Install Apache Spark 2.3

This post will guide through steps to install Spark

Pavan Kulkarni

2 minute read

This post will guide you through installation of Apache Spark 2.3.

  1. Download the latest version of Apache Spark to your local from here. This will download spark-x.x.x-bin-hadoop2.7.tgz.
  2. Un-compress the the .tgz to your desired directory. For the purpose of this post, I will unzip it to /Users/pavanpkulkarni/Documents/spark
  3. Add the below entries to your ~/.bash_profile

    #Spark Home
    export SPARK_HOME=/Users/pavanpkulkarni/Documents/spark/spark-2.3.0-bin-hadoop2.7
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
  4. Source the ~/.bash_profile file to reflect the changes.

    source ~/.bash_profile
  5. Verify installation

    Pavans-MacBook-Pro:~ pavanpkulkarni$ spark-shell
    2018-04-09 14:00:15 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at
    Spark context available as 'sc' (master = local[*], app id = local-1523296821403).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
    Type in expressions to have them evaluated.
    Type :help for more information.

    Web UI should be available at - http://localhost:4040/

  6. Run a sample code in spark-shell

    scala> val rdd = sc.parallelize(1 to 1000000, 10)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
    scala> rdd.count()
    res0: Long = 1000000
    scala> rdd.take(20)
    res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
    scala> val rdd1 = _ + 1 )
    rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25
    scala> rdd1.count()
    res2: Long = 1000000
    scala> rdd1.take(20)
    res3: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)