Since Apache Spark 1.4 came out I've been wanting to deploy it on my computer to test it out. I hit several bumps along the way, however at this point I think I've got a pretty good idea of how it's done.
The first step is to get the newest version of spark downloaded on your machine. Go to this
link, and select Spark 1.4 pre-built for hadoop 2.6. This is the version that
I've installed. Once it's downloaded, untar it either with your file browser, or with the command
spark-1.4.0-bin-hadoop2.6.tgz. Now copy this entire directory into a safe place on your computer.
Setting up Your Environment
Once we have the directory all safe and secure we need to establish some environment variables, most notably the
.bashrc, add the following line:
Where you replace the directory with whichever directory spark resides in on your machine.
Configuring Spark Options
Since 1.3 the method for deploying a standalone cluster has changed a little bit. That being said, we need to configure
the worker options. First, copy the template file provided in
$SPARK_HOME/conf/spark-env.sh. Now edit the file with your favorite editor and add the following line:
This will tell the scripts to create 4 workers. You can adjust this as necessary, as well as adjust any other parameters in the configuration file.
The Run Script
$SPARK_HOME/sbin/start-all.sh will start the master and all of your workers. It's kinda clunky, so I wrote a
Python script to make things work nicer.
#!/usr/bin/env python3 import os import argparse import time parser = argparse.ArgumentParser() parser.add_argument('-p', '--python', action='store_true', default=False) args = parser.parse_args() try: os.system('bash $SPARK_HOME/bin/load-spark-env.sh') os.system('bash $SPARK_HOME/sbin/start-all.sh') print("Spark Running") if args.python: os.system('ipython2.7 notebook --profile=pyspark') else: while True: time.sleep(3600) except KeyboardInterrupt: os.system('$SPARK_HOME/sbin/stop-all.sh')
And that's it! If you put that script in
$SPARK_HOME, it will automatically load all other environment variables and
start Spark. Once you press Ctrl+c it will gracefully quit, shutting down spark with it.