Detailed Guide to Setting up Scalable Apache Spark Infrastructure on Docker - Standalone Cluster With History Server

This post is a complete guide to build a scalable Apache Spark on using Dockers. We will see how to enable History Servers for log persistence.

Pavan Kulkarni

10 minute read

This post is a complete guide to build a scalable Apache Spark on using Dockers. We will see how to enable History Servers for log persistence. To be able to scale up and down is one of the key requirements of today’s distributed infrastructure. By the end of this guide, you should have pretty fair understanding of setting up Apache Spark on Docker and we will see how to run a sample program.

Prerequisites:

  1. Docker (Installation Instructions Here)
  2. Eclipse (download from here)
  3. Scala (Read this to Install Scala)
  4. Gradle (Read this to Install Gradle)
  5. Apache Spark (Read this to Install Spark)

GitHub Repos:

  1. docker-spark-image - This repo contains the DOckerfile required to build base image for containers.
  2. create-and-run-spark-job - This repo contains all the the necessary files required to build a scalable infrastructure.
  3. Docker_WordCount_Spark - This repo contains the source code that I will be running as part of demo. You can clone this repo and run a gradle clean build to generate an executable jar.

Let’s Begin ….!!!

Setting up Apache Spark in Docker gives us the flexibility of scaling the infrastructure as per the complexity of the project. This way we are:

  1. Neither under-utilizing nor over-utilizing the power of Apache Spark
  2. Neither under-allocating nor over-allocating resource to cluster
  3. In a shared environment, we have some liberty to spawn our own clusters and bring them down.

So, here’s what I will be covering in this tutorial:

  1. Create a base image for all the Spark nodes.
  2. Create a bridged network to connect all the containers internally.
  3. Add shared volumes across all shared containers for data sharing.
  4. Submit a job to cluster.
  5. Finally, monitor the job for performance optimization.

Let’s go over each one of these above steps in detail.

1. Creating a base image for all out Spark nodes.

  1. Crete a directory docker-spark-image that will contain the following files - Dockerfile, master.conf, slave.conf, history-server.conf and spark-defaults.conf.
  2. master.conf - This configuration file is used to start the master node on the container. This is started in supervisord mode.

From the Docker docs :
supervisord - Use a process manager like supervisord. This is a moderately heavy-weight approach that requires you to package supervisord and its configuration in your image (or base your image on one that includes supervisord), along with the different applications it manages. Then you start supervisord, which manages your processes for you.

  1. slave.conf - This configuration file is used to start the slave node on the container and allow it to connect to master node. This is also started in supervisord mode. Make sure to note down the name master. We will be referencing the master node as master in the this post.
  2. history-server.conf - This configuration file is used to start the history server on the container. This is also started in supervisord mode. I build a new container just to persist app history logs of Spark jobs. The history log location specified in spark-defaults.conf is a shared volume between all containers. This is mounted to our local for log persistence. We will see below how this is done in action.
  3. spark-defaults.conf - This configuration file is used to enable and set log locations used by history server.

    spark.eventLog.enabled             true
    spark.eventLog.dir                 file:///opt/spark-events
    spark.history.fs.logDirectory      file:///opt/spark-events
    
  4. Finally, Dockerfile - Lines 6:31 update and install - Java 8, supervisord and Apache Spark 2.2.1 with Hadoop 2.7. Then, copy all the configuration files to the image and create the log location as specified in spark-defaults.conf. All the required ports are exposed for proper communication between the containers and also for job monitoring using WebUI.

    # adding conf files to all images. This will be used in supervisord for running spark master/slave
    COPY master.conf /opt/conf/master.conf
    COPY slave.conf /opt/conf/slave.conf
    COPY history-server.conf /opt/conf/history-server.conf
    
    # Adding configurations for history server
    COPY spark-defaults.conf /opt/spark/conf/spark-defaults.conf
    RUN  mkdir -p /opt/spark-events
    
    # expose port 8080 for spark UI
    EXPOSE 4040 6066 7077 8080 18080 8081
    

    Additionally, you can start a dummy process in the container so that the container does not exit unexpectedly after creation. This in combination of supervisord daemon, ensures that the container is alive until killed or stopped manually.

    #default command: this is just an option 
    CMD ["/opt/spark/bin/spark-shell", "--master", "local[*]"]
    
  5. Create an image by running the below command from docker-spark-image directory.

    Pavans-MacBook-Pro:docker-spark-image pavanpkulkarni$ docker build -t pavanpkulkarni/spark_image .
        
    Sending build context to Docker daemon  81.92kB
    [WARNING]: Empty continuation line found in:
        RUN apt-get install software-properties-common -y &&  apt-add-repository ppa:webupd8team/java -y &&  apt-get update -y &&  echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections &&  echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections &&  apt-get install -y oracle-java8-installer     supervisor
    [WARNING]: Empty continuation lines will become errors in a future release.
    Step 1/16 : FROM ubuntu:14.04
    14.04: Pulling from library/ubuntu
    
    .
    .
    .
    .
    .
    .
    
    
    Removing intermediate container 2d860633548a
     ---> bb560415c8bf
    Step 16/16 : CMD ["/opt/spark/bin/spark-shell", "--master", "local[*]"]
     ---> Running in 4d354e2b3984
    Removing intermediate container 4d354e2b3984
     ---> 4c1113febcc4
    Successfully built 4c1113febcc4
    Successfully tagged pavanpkulkarni/spark_image:latest
    
    Pavans-MacBook-Pro:docker-spark-image pavanpkulkarni$ docker images
    REPOSITORY                   TAG                 IMAGE ID            CREATED             SIZE
    pavanpkulkarni/spark_image   latest              4c1113febcc4        46 seconds ago      1.36GB
    ubuntu                       14.04               8cef1fa16c77        13 days ago         223MB
    
    Pavans-MacBook-Pro:docker-spark-image pavanpkulkarni$ docker tag 4c1113febcc4 pavanpkulkarni/spark_image:2.2.1
    Pavans-MacBook-Pro:docker-spark-image pavanpkulkarni$ docker images
    REPOSITORY                   TAG                 IMAGE ID            CREATED             SIZE
    pavanpkulkarni/spark_image   2.2.1               4c1113febcc4        4 minutes ago       1.36GB
    pavanpkulkarni/spark_image   latest              4c1113febcc4        4 minutes ago       1.36GB
    ubuntu                       14.04               8cef1fa16c77        13 days ago         223MB
    
    
    

    OR

    You can pull this image from my Docker Hub as

    docker pull pavanpkulkarni/spark_image:2.2.1
    

Creating a cluster

To create a cluster, I make using of docker-compose utility.

From the docker-compose docs:
docker-compose - Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration.

Create a new directory create-and-run-spark-job . This directory will contain - docker-compose.yml, Dockerfile, executable jar and/any supporting files required for execution.

  1. We start by creating docker-compose.yml. Let’s create 3 sections, one for each master, slave and history-server. The image needs to be specified for each container. ports field specifies port binding between the host and container as HOST_PORT:CONTAINER_PORT. Under the slave section, port 8081 is exposed to host (expose can be used instead of port). Here 8081 is free to bind with any available port on the host side. command is used to run a command in container. volumes field is to create and mount volumes between container and host. volumes follows HOST_PATH:CONTAINER_PATH format.

    These are the minimum configurations we need to have in docker-compose.yml

    master node :

        master:
        build: .
        image: pavanpkulkarni/spark_image:2.2.1
        container_name: master
        ports:
          - 4040:4040
          - 7077:7077
          - 8080:8080
          - 6066:6066
        command: ["/usr/bin/supervisord", "--configuration=/opt/conf/master.conf"]
    

    slave node:

        image: pavanpkulkarni/spark_image:2.2.1
        depends_on:
          - master
        ports:
          - "8081"
        command: ["/usr/bin/supervisord", "--configuration=/opt/conf/slave.conf"]
        volumes:
            - ./docker-volume/spark-output/:/opt/output
            - ./docker-volume/spark-events/:/opt/spark-events
    

    history-server container:

      image: pavanpkulkarni/spark_image:2.2.1
      container_name: history-server
      depends_on:
        - master
      ports:
        - "18080:18080"
      command: ["/usr/bin/supervisord", "--configuration=/opt/conf/history-server.conf"]
      volumes:
        - ./docker-volume/spark-events:/opt/spark-events
    
    
  2. Executable jar - I have built the project using gradle clean build. I will be using the Docker_WordCount_Spark-1.0.jar for the demo. This jar is a application that will perform a simple WordCount on sample.txt and write output to a directory. The jar takes 2 arguments as shown below. output_directory is the mounted volume of worker nodes (slave containers)

Docker_WordCount_Spark-1.0.jar [input_file] [output_directory]

  1. Dockerfile - This is application specific Dockerfile that contains only the jar and application specific files. docker-compose uses this Dockerfile to build the containers.

    FROM pavanpkulkarni/spark_image:2.2.1
    LABEL authors="pavanpkulkarni@pavanpkulkarni.com"
    
    COPY Docker_WordCount_Spark-1.0.jar /opt/Docker_WordCount_Spark-1.0.jar
    COPY sample.txt /opt/sample.txt
    

To build 3 node cluster

  1. cd to create-and-run-spark-job
  2. Run the command docker ps -a to check the status of containers. We start with one image and no containers.

    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker images
    REPOSITORY                   TAG                 IMAGE ID            CREATED             SIZE
    pavanpkulkarni/spark_image   2.2.1               4c1113febcc4        About an hour ago   1.36GB
    pavanpkulkarni/spark_image   latest              4c1113febcc4        About an hour ago   1.36GB
    ubuntu                       14.04               8cef1fa16c77        13 days ago         223MB
        
    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker ps -a
    CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
    
    
  3. Build the docker-compose from the application specific Dockerfile.

    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker-compose build
    Building master
    Step 1/4 : FROM pavanpkulkarni/spark_image:2.2.1
     ---> 4c1113febcc4
    Step 2/4 : LABEL authors="pavanpkulkarni@pavanpkulkarni.com"
     ---> Running in 8d4cce9730cb
    Removing intermediate container 8d4cce9730cb
     ---> 0e0f1aba18ed
    Step 3/4 : COPY Docker_WordCount_Spark-1.0.jar /opt/Docker_WordCount_Spark-1.0.jar
     ---> 215e22127d54
    Step 4/4 : COPY sample.txt /opt/sample.txt
     ---> a79cd3fb5e33
    Successfully built a79cd3fb5e33
    Successfully tagged pavanpkulkarni/spark_image:2.2.1
    slave uses an image, skipping
    history-server uses an image, skipping
    
    
  4. Spawn a 3 - node cluster

    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker-compose up -d --scale slave=3
    Creating master ... done
    Creating history-server                   ... done
    Creating create-and-run-spark-job_slave_1 ... done
    Creating create-and-run-spark-job_slave_2 ... done
    Creating create-and-run-spark-job_slave_3 ... done
    
    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker ps
    CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS              PORTS                                                                                                                 NAMES
    043cf5a74586        pavanpkulkarni/spark_image:2.2.1   "/usr/bin/supervisor…"   22 seconds ago      Up 24 seconds       4040/tcp, 6066/tcp, 7077/tcp, 8080-8081/tcp, 0.0.0.0:18080->18080/tcp                                                 history-server
    bd762af3600e        pavanpkulkarni/spark_image:2.2.1   "/usr/bin/supervisor…"   22 seconds ago      Up 25 seconds       4040/tcp, 6066/tcp, 7077/tcp, 8080/tcp, 18080/tcp, 0.0.0.0:32770->8081/tcp                                            create-and-run-spark-job_slave_3
    ee254a16787f        pavanpkulkarni/spark_image:2.2.1   "/usr/bin/supervisor…"   22 seconds ago      Up 24 seconds       4040/tcp, 6066/tcp, 7077/tcp, 8080/tcp, 18080/tcp, 0.0.0.0:32769->8081/tcp                                            create-and-run-spark-job_slave_1
    463edf008d05        pavanpkulkarni/spark_image:2.2.1   "/usr/bin/supervisor…"   22 seconds ago      Up 24 seconds       4040/tcp, 6066/tcp, 7077/tcp, 8080/tcp, 18080/tcp, 0.0.0.0:32768->8081/tcp                                            create-and-run-spark-job_slave_2
    ad6a781d9437        pavanpkulkarni/spark_image:2.2.1   "/usr/bin/supervisor…"   22 seconds ago      Up 25 seconds       0.0.0.0:4040->4040/tcp, 0.0.0.0:6066->6066/tcp, 0.0.0.0:7077->7077/tcp, 8081/tcp, 0.0.0.0:8080->8080/tcp, 18080/tcp   master
    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$
    

    This will give us a 3-node cluster like

    • Master - master
    • Workers - create-and-run-spark-job_slave_1, create-and-run-spark-job_slave_2, create-and-run-spark-job_slave_3
    • History Server - history-server

    The mounted volumes will now be visible in your host. In my case, I can see 2 directories created in my current dir

    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ ls -ltr docker-volume/
    total 0
    drwxr-xr-x  3 pavanpkulkarni  staff  96 May 10 09:57 spark-output
    drwxr-xr-x  2 pavanpkulkarni  staff  64 May 11 15:51 spark-events
    

    Note on docker-compose networking from docker-compose docs -
    docker-compose - By default Compose sets up a single network for your app. Each container for a service joins the default network and is both reachable by other containers on that network, and discoverable by them at a hostname identical to the container name.

    In our case, we have a bridged network called create-and-run-spark-job_default.The name of network is same as name of your parent dir. This can be changed by setting the COMPOSE_PROJECT_NAME variable. A deeper inspection can be done by running the docker inspect create-and-run-spark-job_default command

    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker network ls
    NETWORK ID          NAME                               DRIVER              SCOPE
    dc9ce7304889        bridge                             bridge              local
    a5bd0ff97b90        create-and-run-spark-job_default   bridge              local
    8433fa00d5d8        host                               host                local
    fbbb577e1d8e        none                               null                local
    
  5. Spark cluster can be verified to be up && running as by the WebUI

  6. The cluster can be scaled up or down by replacing n with your desired number of nodes.

    docker-compose up -d --scale slave=n
    
  7. Let’s submit a job to this 3-node cluster from the master node. This is a simple spark-submit command that will produce the output in /opt/output/wordcount_output directory.

    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ docker exec master /opt/spark/bin/spark-submit --class com.pavanpkulkarni.dockerwordcount.DockerWordCount --master spark://master:6066 --deploy-mode cluster --verbose /opt/Docker_WordCount_Spark-1.0.jar /opt/sample.txt  /opt/output/wordcount_output
        
    Using properties file: /opt/spark/conf/spark-defaults.conf
    Adding default property: spark.eventLog.enabled=true
    Adding default property: spark.eventLog.dir=file:///opt/spark-events
    Adding default property: spark.history.fs.logDirectory=file:///opt/spark-events
    Parsed arguments:
      master                  spark://master:6066
      deployMode              cluster
      executorMemory          null
      executorCores           null
      totalExecutorCores      null
      propertiesFile          /opt/spark/conf/spark-defaults.conf
      driverMemory            null
      driverCores             null
      driverExtraClassPath    null
      driverExtraLibraryPath  null
      driverExtraJavaOptions  null
      supervise               false
      queue                   null
      numExecutors            null
      files                   null
      pyFiles                 null
      archives                null
      mainClass               com.pavanpkulkarni.dockerwordcount.DockerWordCount
      primaryResource         file:/opt/Docker_WordCount_Spark-1.0.jar
      name                    com.pavanpkulkarni.dockerwordcount.DockerWordCount
      childArgs               [/opt/sample.txt /opt/output/wordcount_output]
      jars                    null
      packages                null
      packagesExclusions      null
      repositories            null
      verbose                 true
    
    Spark properties used, including those specified through
     --conf and those from the properties file /opt/spark/conf/spark-defaults.conf:
      (spark.eventLog.enabled,true)
      (spark.history.fs.logDirectory,file:///opt/spark-events)
      (spark.eventLog.dir,file:///opt/spark-events)
    
    
    Running Spark using the REST application submission protocol.
    Main class:
    org.apache.spark.deploy.rest.RestSubmissionClient
    Arguments:
    file:/opt/Docker_WordCount_Spark-1.0.jar
    com.pavanpkulkarni.dockerwordcount.DockerWordCount
    /opt/sample.txt
    /opt/output/wordcount_output
    System properties:
    (spark.eventLog.enabled,true)
    (SPARK_SUBMIT,true)
    (spark.history.fs.logDirectory,file:///opt/spark-events)
    (spark.driver.supervise,false)
    (spark.app.name,com.pavanpkulkarni.dockerwordcount.DockerWordCount)
    (spark.jars,file:/opt/Docker_WordCount_Spark-1.0.jar)
    (spark.submit.deployMode,cluster)
    (spark.eventLog.dir,file:///opt/spark-events)
    (spark.master,spark://master:6066)
    Classpath elements:
    
    
    
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    18/05/11 20:06:54 INFO RestSubmissionClient: Submitting a request to launch an application in spark://master:6066.
    18/05/11 20:06:55 INFO RestSubmissionClient: Submission successfully created as driver-20180511200654-0000. Polling submission state...
    18/05/11 20:06:55 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20180511200654-0000 in spark://master:6066.
    18/05/11 20:06:55 INFO RestSubmissionClient: State of driver driver-20180511200654-0000 is now RUNNING.
    18/05/11 20:06:55 INFO RestSubmissionClient: Driver is running on worker worker-20180511194135-172.18.0.4-33221 at 172.18.0.4:33221.
    18/05/11 20:06:55 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse:
    {
      "action" : "CreateSubmissionResponse",
      "message" : "Driver successfully submitted as driver-20180511200654-0000",
      "serverSparkVersion" : "2.2.1",
      "submissionId" : "driver-20180511200654-0000",
      "success" : true
    }
    
    
    

    Output is available on the mounted volume on host -

    Pavans-MacBook-Pro:create-and-run-spark-job pavanpkulkarni$ ls -ltr docker-volume/spark-output/wordcount_output/
    total 32
    -rw-r--r--  1 pavanpkulkarni  staff  12324 May 11 16:07 part-00000-a4225163-c8dd-4a51-9ef4-44e085e553e4-c000.csv
    -rw-r--r--  1 pavanpkulkarni  staff      0 May 11 16:07 _SUCCESS
    

Automation

Should the Ops team choses to have a scheduler on the job for daily processing or for the ease do developers, I have created a simple script to take care of the above steps - RunSparkJobOnDocker.sh. This script alone can be used to scale the cluster up or scale down per requirement.
TIP: Using spark-submit REST API, we can monitor the job and bring down the cluster after job completion.

References

  1. https://docs.docker.com/config/containers/multi-service_container/
  2. https://docs.docker.com/compose/compose-file/
  3. https://databricks.com/session/lessons-learned-from-running-spark-on-docker
  4. https://grzegorzgajda.gitbooks.io/spark-examples/content/basics/docker.html