Install Apache Spark 2.3

This post will guide through steps to install Spark

Pavan Kulkarni

2 minute read

This post will guide you through installation of Apache Spark 2.3.

  1. Download the latest version of Apache Spark to your local from here. This will download spark-x.x.x-bin-hadoop2.7.tgz.
  2. Un-compress the the .tgz to your desired directory. For the purpose of this post, I will unzip it to /Users/pavanpkulkarni/Documents/spark
  3. Add the below entries to your ~/.bash_profile

    #Spark Home
    export SPARK_HOME=/Users/pavanpkulkarni/Documents/spark/spark-2.3.0-bin-hadoop2.7
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
    
  4. Source the ~/.bash_profile file to reflect the changes.

    source ~/.bash_profile
    
  5. Verify installation

    Pavans-MacBook-Pro:~ pavanpkulkarni$ spark-shell
    2018-04-09 14:00:15 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://10.0.0.67:4040
    Spark context available as 'sc' (master = local[*], app id = local-1523296821403).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
          /_/
    
    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> 
    

    Web UI should be available at - http://localhost:4040/

  6. Run a sample code in spark-shell

    scala> val rdd = sc.parallelize(1 to 1000000, 10)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
    
    scala> rdd.count()
    res0: Long = 1000000
    
    scala> rdd.take(20)
    res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
    
    scala> val rdd1 = rdd.map( _ + 1 )
    rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25
    
    scala> rdd1.count()
    res2: Long = 1000000
    
    scala> rdd1.take(20)
    res3: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)