How to Setup PyCharm to Run PySpark Jobs

This post will guide you to a step-by-step setup to run PySpark jobs in PyCharm

Pavan Kulkarni

4 minute read

This post will give a walk through of how to setup your local system to test PySpark jobs. Followed by demo to run the same code using spark-submit command.

Prerequisites

  1. PyCharm (download from here)
  2. Python (Read this to Install Scala)
  3. Apache Spark (Read this to Install Spark)

Let’s Begin

  1. Clone my repo from GitHub for a sample WordCount in PySpark.

  2. Import the cloned project to PyCharm

    File –> Open –> path_to_project

  3. Open ~/.bash_profile and add the below lines.

    #Spark Home
    #export SPARK_HOME=/Users/pavanpkulkarni/Documents/spark/spark-2.3.0-bin-hadoop2.7
    export SPARK_HOME=/Users/pavanpkulkarni/Documents/spark/spark-2.2.1-bin-hadoop2.7
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
    export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
    export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
    

    source the ~/.bash_profile to reflect the changes.

    source ~/.bash_profile
    
  4. Install pyspark and pypandoc as shown below.

    PyCharm –> Preferences –> Project Interpreter

    Set_Project_Interpreter

  5. Go to PyCharm –> Preferences –> Project Interpreter. Click on Add Content Root. Here you need to add paths

    /Users/pavanpkulkarni/Documents/spark/spark-2.2.1-bin-hadoop2.7/python /Users/pavanpkulkarni/Documents/spark/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip.

    Set_Project_Structure

  6. Restart PyCharm.

  7. Run the project. You should be able to see output.

    /Users/pavanpkulkarni/venv/bin/python /Users/pavanpkulkarni/Documents/workspace/PySpark_WordCount/com/pavanpkulkarni/wordcount/PysparkWordCount.py
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    18/04/27 15:00:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    +---------+-----+
    |     word|count|
    +---------+-----+
    |      ...|    5|
    |     July|    2|
    |       By|    1|
    |   North,|    1|
    |   taking|    1|
    |    harry|   18|
    |     #TBT|    1|
    |  Potter:|    3|
    |character|    1|
    |        7|    2|
    |  Phoenix|    1|
    |   Number|    1|
    |      day|    1|
    |   (Video|    1|
    | seconds)|    1|
    | Hermione|    3|
    |    Which|    1|
    |      did|    1|
    |   Potter|   38|
    |Voldemort|    1|
    +---------+-----+
    only showing top 20 rows
    
    
     Word Count :  None
    
    Process finished with exit code 0
    
    

Run PySpark with spark-submit.

  1. Open terminal and type

    Pavans-MacBook-Pro:PySpark_WordCount pavanpkulkarni$ spark-submit com/pavanpkulkarni/wordcount/PysparkWordCount.py
    
  2. Output :

    
    Pavans-MacBook-Pro:PySpark_WordCount pavanpkulkarni$ spark-submit com/pavanpkulkarni/wordcount/PysparkWordCount.py
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    18/04/27 15:06:32 INFO SparkContext: Running Spark version 2.2.1
    18/04/27 15:06:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    18/04/27 15:06:32 INFO SparkContext: Submitted application: WordCount
    18/04/27 15:06:32 INFO SecurityManager: Changing view acls to: pavanpkulkarni
    18/04/27 15:06:32 INFO SecurityManager: Changing modify acls to: pavanpkulkarni
    18/04/27 15:06:32 INFO SecurityManager: Changing view acls groups to:
    18/04/27 15:06:32 INFO SecurityManager: Changing modify acls groups to:
    18/04/27 15:06:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(pavanpkulkarni); groups with view permissions: Set(); users  with modify permissions: Set(pavanpkulkarni); groups with modify permissions: Set()
    18/04/27 15:06:33 INFO Utils: Successfully started service 'sparkDriver' on port 60536.
    18/04/27 15:06:33 INFO SparkEnv: Registering MapOutputTracker
    18/04/27 15:06:33 INFO SparkEnv: Registering BlockManagerMaster
    18/04/27 15:06:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    18/04/27 15:06:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
    18/04/27 15:06:33 INFO DiskBlockManager: Created local directory at /private/var/folders/nb/ygmwx13x6y1_9pyzg1_82w440000gn/T/blockmgr-67bf5d7f-1b6a-4ce0-9795-b11d17618bd3
    18/04/27 15:06:33 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
    18/04/27 15:06:33 INFO SparkEnv: Registering OutputCommitCoordinator
    18/04/27 15:06:33 INFO Utils: Successfully started service 'SparkUI' on port 4040.
    18/04/27 15:06:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.67:4040
    18/04/27 15:06:33 INFO SparkContext: Added file file:/Users/pavanpkulkarni/Documents/workspace/PySpark_WordCount/com/pavanpkulkarni/wordcount/PysparkWordCount.py at file:/Users/pavanpkulkarni/Documents/workspace/PySpark_WordCount/com/pavanpkulkarni/wordcount/PysparkWordCount.py with timestamp 1524855993821
    18/04/27 15:06:33 INFO Utils: Copying /Users/pavanpkulkarni/Documents/workspace/PySpark_WordCount/com/pavanpkulkarni/wordcount/PysparkWordCount.py to /private/var/folders/nb/ygmwx13x6y1_9pyzg1_82w440000gn/T/spark-711b082d-dde7-474b-b370-f046fc4bdf04/userFiles-cf6b481c-9b81-41b4-b929-a20d39e9f4cd/PysparkWordCount.py
    18/04/27 15:06:33 INFO Executor: Starting executor ID driver on host localhost
    18/04/27 15:06:33 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60537.
    18/04/27 15:06:33 INFO NettyBlockTransferService: Server created on 10.0.0.67:60537
    18/04/27 15:06:33 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    18/04/27 15:06:33 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.67, 60537, None)
    18/04/27 15:06:33 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.67:60537 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.67, 60537, None)
    18/04/27 15:06:33 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.67, 60537, None)
    18/04/27 15:06:33 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.67, 60537, None)
    18/04/27 15:06:34 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/pavanpkulkarni/Documents/workspace/PySpark_WordCount/spark-warehouse/').
    18/04/27 15:06:34 INFO SharedState: Warehouse path is 'file:/Users/pavanpkulkarni/Documents/workspace/PySpark_WordCount/spark-warehouse/'.
    18/04/27 15:06:34 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
    18/04/27 15:06:36 INFO FileSourceStrategy: Pruning directories with:
    18/04/27 15:06:36 INFO FileSourceStrategy: Post-Scan Filters:
    18/04/27 15:06:36 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
    18/04/27 15:06:36 INFO FileSourceScanExec: Pushed Filters:
    18/04/27 15:06:36 INFO ContextCleaner: Cleaned accumulator 1
    
        .
        .
        .
        .
    
        18/04/27 15:06:38 INFO TaskSetManager: Finished task 3.0 in stage 5.0 (TID 6) in 62 ms on localhost (executor driver) (8/8)
    18/04/27 15:06:38 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
    18/04/27 15:06:38 INFO DAGScheduler: ResultStage 5 (showString at NativeMethodAccessorImpl.java:0) finished in 0.062 s
    18/04/27 15:06:38 INFO DAGScheduler: Job 2 finished: showString at NativeMethodAccessorImpl.java:0, took 0.070147 s
    +---------+-----+
    |     word|count|
    +---------+-----+
    |      ...|    5|
    |     July|    2|
    |       By|    1|
    |   North,|    1|
    |   taking|    1|
    |    harry|   18|
    |     #TBT|    1|
    |  Potter:|    3|
    |character|    1|
    |        7|    2|
    |  Phoenix|    1|
    |   Number|    1|
    |      day|    1|
    |   (Video|    1|
    | seconds)|    1|
    | Hermione|    3|
    |    Which|    1|
    |      did|    1|
    |   Potter|   38|
    |Voldemort|    1|
    +---------+-----+
    only showing top 20 rows
    
    
     Word Count :  None
    18/04/27 15:06:38 INFO SparkUI: Stopped Spark web UI at http://10.0.0.67:4040
    18/04/27 15:06:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    18/04/27 15:06:38 INFO MemoryStore: MemoryStore cleared
    18/04/27 15:06:38 INFO BlockManager: BlockManager stopped
    18/04/27 15:06:38 INFO BlockManagerMaster: BlockManagerMaster stopped
    18/04/27 15:06:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    18/04/27 15:06:38 INFO SparkContext: Successfully stopped SparkContext
    18/04/27 15:06:38 INFO ShutdownHookManager: Shutdown hook called
    18/04/27 15:06:38 INFO ShutdownHookManager: Deleting directory /private/var/folders/nb/ygmwx13x6y1_9pyzg1_82w440000gn/T/spark-711b082d-dde7-474b-b370-f046fc4bdf04
    18/04/27 15:06:38 INFO ShutdownHookManager: Deleting directory /private/var/folders/nb/ygmwx13x6y1_9pyzg1_82w440000gn/T/spark-711b082d-dde7-474b-b370-f046fc4bdf04/pyspark-f0149541-b9b5-4fc9-b5d6-08ae6ce63968
    
    Pavans-MacBook-Pro:PySpark_WordCount pavanpkulkarni$
        
    

References:

  1. https://www.jetbrains.com/pycharm/
  2. https://spark.apache.org/docs/2.2.1/api/python/pyspark.html