Introduction

This post is to help people to install and run Apache Spark in a computer with window 10 (it may also help for prior versions of Windows or even Linux and Mac OS systems), and want to try out and learn how to interact with the engine without spend too many resources. If you really want to build a serious prototype, I strongly recommend to install one of the virtual machines I mentioned in this post a couple of years ago: Hadoop self-learning with pre-configured Virtual Machines or to spend some money in a Hadoop distribution on the cloud. The new version of these VMs come with Spark ready to use.

A few words about Apache Spark

Apache Spark is making a lot of noise in the IT world as a general engine for large-scale data processing, able to run programs up to 100x faster than Hadoop MapReduce, thanks to its in-memory computing capabilities. It is possible to write Spark applications using Java, Python, Scala and R, and it comes with built-in libraries to work with structure data (Spark SQL), graph computation (GraphX), machine learning (MLlib) and streaming (Spark Streaming).

Spark runs on Hadoop, Mesos, in the cloud or as standalone. The latest is the case of this post. We are going to install Spark 1.6.0 as standalone in a computer with a 32-bit Windows 10 installation (my very old laptop). Let’s get started.

Install or update Java

For any application that uses the Java Virtual Machine is always recommended to install the appropriate java version. In this case I just updated my java version as follows:

Start –> All apps –> Java –> Check For Updates

Update Java

In the same way you can verify your java version. This is the version I used:

Extract the folder containing the file winutils.exe to any location of your preference.

Environment Variables Configuration

This is also crucial in order to run some commands without problems using the command prompt.

_JAVA_OPTION: I set this variable to the value showed in the figure below. I was getting Java Heap Memory problems with the default values and this fixed this problem.

HADOOP_HOME: even when Spark can run without Hadoop, the version I downloaded is prebuilt for Hadoop 2.6 and looks in the code for it. To fix this inconvenient I set this variable to the folder containing the winutils.exe file

JAVA_HOME: usually you already set this variable when you install java but it is better to verify that exist and is correct.

SCALA_HOME: the bin folder of the Scala location. If you use the standard location from the installer should be the path in the figure below.

SPARK_HOME: the bin folder path of where you uncompressed Spark

Environment Variables 1/2

Environment Variables 2/2

Permissions for the folder tmp/hive

I struggled a little bit with this issue. After I set everything I tried to run the spark-shell from the command line and I was getting an error, which was hard to debug. The shell tries to find the folder tmp/hive and was not able to set the SQL Context.

I look at my C drive and I found that the C:\tmp\hive folder was created. If not you can created by yourself and set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils:

Open a command prompt as administrator and type:

Set 777 permissions for tmp/hive

Please be aware that you need to adjust the path of the winutils.exe above if you saved it to another location.

We are finally done and could start the spark-shell which is an interactive way to analyze data using Scala or Python. In this way we are going also to test our Spark installation.

Using the Scala Shell to run our first example

In the same command prompt go to the Spark folder and type the following command to run the Scala shell:

Start the Spark Scala Shell

After some executions line you should be able to see a similar screen:

Shell started

You are going to receive several warnings and information in the shell because we have not set different configuration options. By now just ignore them.

Let’s run our first program with the shell, I took the example from the Spark Programming Guide. The first command creates a resilient data set (RDD) from a text file included in the Spark’s root folder. After the RDD is created, the second command just counts the number of items inside:

Running a Spark Example

And that’s it. Hope you can follow my explanation and be able to run this simple example. I wish you a lot of fun with Apache Spark.

12 Responses to Apache Spark installation on Windows 10

Hi Paul
This is a great help to me, but seems I’m doing something wrong.
I have Windows 10 Pro 64 bits. I downloaded the winutils.exe (64 bits) but when I tried to execute the:
C:\WINDOWS\system32>c:\Hadoop\bin\winutils.exe chmod 777 \tmp\hive
I obtain an error (the winutils.exe is not compatible with the Windows version

Hi Paul,
The winutils issue was my headache. Please try to do the following:
– Copy the content of the whole library and try again.
– If this doesn’t help, try to build the hadoop sources by yourself, I wrote a post about it (https://wordpress.com/stats/day/hernandezpaul.wordpress.com). It was also a pain in the a…
– If you don’t want to walk this way just let me know and I will share a link to downlod the winutils I built. I did it with Windows Server 64 bits but it should work also for Windows 10.
– Last thing I can offer to you is download the hadoop binaries that this blogger offers in this post: http://kplitzkahran.blogspot.de/2015/08/hadoop-271-for-windows-10-binary-build.html
the download link is at the very end of the post.
Kind Regards,
Paul