PySpark – dev set up – Eclipse – Windows

By | October 4, 2017

For our example purposes, we will set-up Spark in the location: C:\Users\Public\Spark_Dev_set_up
Note: I am running Eclipse Neon

Prerequisites

  1. Python 3.5
  2. JRE 8
  3. JDK 1.8
  4. Eclipse plugins: PyDev

Steps to set up:

  1. Download from here: https://spark.apache.org/downloads.html
    1. Choose a Spark release: 2.1.0
    2. Choose a package type: Pre-built for Apache Hadoop 2.6
    3. Download below version of Spark:


  2. Download winutils.exe
    Download from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and copy to C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\wintuils\bin
  3. Final folder structure

    C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6
        bin/
        conf/   
        data/   
        examples/   
        jars/       
        licenses/       
        python/     
        R/  
        sbin/   
        winutils/   
          winutils/bin
        yarn/   
        LICENSE
        NOTICE
        README.md
        RELEASE
    
  4. In Eclipse, Set environment variables:
    Windows -> Preferences -> Environment

    Variable: SPARK_HOME
    Value: C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6
    
    Variable: HADOOP_HOME 
    Value: C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\winutils
    


  5. In Eclipse, Add libraries to PYTHONPATH:
    Windows -> Preferences -> Libraries -> New Egg/Zip(s) -> C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\python\lib\pyspark.zip <br/><br/>
    
    Windows -> Preferences -> Libraries -> New Egg/Zip(s) -> C:\Users\Public\Spark_Dev_set_up\spark-2.1.0-bin-hadoop2.6\python\lib\py4j-0.10.4-src.zip  <br/><br/>
    

  6. How to use it:

    In Eclipse, Create new python project

    File -> New -> PyDev Project -> “sample” :


    Note: If you don’t see the python interpreter configured, Click on “Clik here to configure an interpreter not listed” -> Quick Auto-Config

  7. Create sample program

    Right click on the Project “sample” -> New PyDev Module -> test1.py

    from pyspark.sql import SparkSession
    spark = SparkSession\
            .builder\
            .getOrCreate()
     
    sc = spark.sparkContext
    myRdd = sc.parallelize([1,2,3,4])
    print(myRdd.take(5))
    
  8. Run the program

    Right click on the program -> Run As -> Python Run
    Program output:

Leave a Reply

Your email address will not be published. Required fields are marked *