Since Apache Spark 1.4 came out I've been wanting to deploy it on my computer to test it out. I hit several bumps along the way, however at this point I think I've got a pretty good idea of how it's done.
Download Spark
The first step is to get the newest version of spark downloaded on your machine. Go to this
link, and select Spark 1.4 pre-built for hadoop 2.6. This is the version that
I've installed. Once it's downloaded, untar it either with your file browser, or with the command tar -xzf
spark-1.4.0-bin-hadoop2.6.tgz
. Now copy this entire directory into a safe place on your computer.
Setting up Your Environment
Once we have the directory all safe and secure we need to establish some environment variables, most notably the
$SPARK_HOME
variable.
In your .bashrc
, add the following line:
export $SPARK_HOME=/home/zoë/Projects/spark/spark-1.4.0-bin-hadoop2.6
Where you replace the directory with whichever directory spark resides in on your machine.
Configuring Spark Options
Since 1.3 the method for deploying a standalone cluster has changed a little bit. That being said, we need to configure
the worker options. First, copy the template file provided in $SPARK_HOME/conf/spark-env.sh.template
to
$SPARK_HOME/conf/spark-env.sh
. Now edit the file with your favorite editor and add the following line:
SPARK_WORKER_INSTANCES=4
This will tell the scripts to create 4 workers. You can adjust this as necessary, as well as adjust any other parameters in the configuration file.
The Run Script
The file $SPARK_HOME/sbin/start-all.sh
will start the master and all of your workers. It's kinda clunky, so I wrote a
Python script to make things work nicer.
#!/usr/bin/env python3
import os
import argparse
import time
parser = argparse.ArgumentParser()
parser.add_argument('-p', '--python', action='store_true', default=False)
args = parser.parse_args()
try:
os.system('bash $SPARK_HOME/bin/load-spark-env.sh')
os.system('bash $SPARK_HOME/sbin/start-all.sh')
print("Spark Running")
if args.python:
os.system('ipython2.7 notebook --profile=pyspark')
else:
while True:
time.sleep(3600)
except KeyboardInterrupt:
os.system('$SPARK_HOME/sbin/stop-all.sh')
And that's it! If you put that script in $SPARK_HOME
, it will automatically load all other environment variables and
start Spark. Once you press Ctrl+c it will gracefully quit, shutting down spark with it.