The following is a quickstart for running Flask on Spark.

Most of the example tutorials I have found are for running a bunch of spark jobs on a spark cluster and returning a result. I was interested in long-running tasks and seeing if I could build a web app that ran on Spark. I thought it would be possible but I didn’t think it would be this easy. Please note that this same procedure will work for lots of python scripts, and I am interested to see what else I can load into Spark.

Prerequisites:

If you haven’t installed Spark then grab the latest build from https://spark.apache.org/downloads.html and untar it in a directory somewhere. This doesn’t need to be anywhere special and I just used ~/Downloads/

Change into the root of the extracted Spark directory::

cd ~/Downloads/spark-2.3.0-bin-hadoop2.7/

Create the following file start_standalone.sh to start spark. This is optional but it helps me. Set JAVA_HOME correctly for you system, the trick below works for me.

#! /bin/sh
#
# start_standalone.sh
#

export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
./sbin/start-master.sh
./sbin/start-slave.sh spark://$(hostname -s):7077

You should also make sure you are running the same version of python when you start Spark as when you do spark-submit. The best way to do that is to quickly create a virtualenv and activate it. Install Flask in there while we are at it.

python3 -m venv env
source env/bin/activate
pip install flask

Now we can start our cluster (of one)

./spark-standalone.sh

You should now be able to access the Spark UI at http://localhost:8080/ and see that it has 1 worker attached.

Next we write our Python Flask script. I have created 2 routes, one of which calculates Pi using the example from the source code examples/src/main/python/pi.py with a few tweaks to remove the arguments and change them to GET parameters. `

#! /usr/bin/env python
# -*- coding: utf-8 -*-
# vim:fenc=utf-8
#

"""
Flask on Spark example.

Run with:

    ./bin/spark-submit --master spark://$(hostname -s):7077 exampleweb.py

"""

import sys
sys.path.append( './python' )
from random import random
from operator import add

from flask import Flask, request
from pyspark.sql import SparkSession

app = Flask(__name__)

spark = SparkSession\
        .builder\
        .appName("Flark - Flask on Spark")\
        .getOrCreate()

@app.route("/")
def hello():
    return "Hello World! There is a spark example at <a href=\"/pi?partitions=1\">/pi</a>"

@app.route("/pi")
def pi():

    try:
        partitions = int(request.args.get('partitions', '1'))
    except Exception as e:
        return e

    partitions = 4
    n = 1000000 * partitions


    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    return "Pi is roughly %f" % (4.0 * count / n)


if __name__ == "__main__":
    app.run()

Now we can start our Flask application by submitting it to spark.

./bin/spark-submit --master spark://$(hostname -s):7077 exampleweb.py

And then access it at http://localhost:5000/ and http://localhost:5000/pi?partitions=1