Installation¶
PySpark is included in the official releases of Spark available in the Apache Spark website. For Python users, PySpark also provides pip installation from PyPI. This is usually for local usage or as a client to connect to a cluster instead of setting up a cluster itself.
This page includes instructions for installing PySpark by using pip, Conda, downloading manually, and building from the source.
Python Versions Supported¶
Python 3.8 and above.
Using PyPI¶
PySpark installation using PyPI is as follows:
pip install pyspark
If you want to install extra dependencies for a specific component, you can install it as below:
# Spark SQL pip install pyspark[sql] # pandas API on Spark pip install pyspark[pandas_on_spark] plotly # to plot your data, you can install plotly together. # Spark Connect pip install pyspark[connect]
For PySpark with/without a specific Hadoop version, you can install it by using PYSPARK_HADOOP_VERSION environment variables as below:
PYSPARK_HADOOP_VERSION=3 pip install pyspark
The default distribution uses Hadoop 3.3 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically downloads a different version and uses it in PySpark. Downloading it can take a while depending on the network and the mirror chosen. PYSPARK_RELEASE_MIRROR can be set to manually choose the mirror for faster downloading.
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=3 pip install
It is recommended to use -v option in pip to track the installation and download status.
PYSPARK_HADOOP_VERSION=3 pip install pyspark -v
Supported values in PYSPARK_HADOOP_VERSION are:
- without : Spark pre-built with user-provided Apache Hadoop
- 3 : Spark pre-built for Apache Hadoop 3.3 and later (default)
Note that this installation of PySpark with/without a specific Hadoop version is experimental. It can change or be removed between minor releases.
Using Conda¶
Conda is an open-source package management and environment management system (developed by Anaconda), which is best installed through Miniconda or Miniforge. The tool is both cross-platform and language agnostic, and in practice, conda can replace both pip and virtualenv.
Conda uses so-called channels to distribute packages, and together with the default channels by Anaconda itself, the most important channel is conda-forge, which is the community-driven packaging effort that is the most extensive & the most current (and also serves as the upstream for the Anaconda channels in most cases).
To create a new conda environment from your terminal and activate it, proceed as shown below:
conda create -n pyspark_env conda activate pyspark_env
After activating the environment, use the following command to install pyspark, a python version of your choice, as well as other packages you want to use in the same session as pyspark (you can install in several steps too).
conda install -c conda-forge pyspark # can also add "python=3.8 some_package [etc.]" here
Note that PySpark for conda is maintained separately by the community; while new versions generally get packaged quickly, the availability through conda(-forge) is not directly in sync with the PySpark release cycle.
While using pip in a conda environment is technically feasible (with the same command as above ), this approach is discouraged, because pip does not interoperate with conda.
For a short summary about useful conda commands, see their cheat sheet.
Manually Downloading¶
PySpark is included in the distributions available at the Apache Spark website. You can download a distribution you want from the site. After that, uncompress the tar file into the directory where you want to install Spark, for example, as below:
tar xzvf spark-master-bin-hadoop3.tgz
Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib . One example of doing this is shown below:
cd spark-master-bin-hadoop3 export SPARK_HOME=`pwd` export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/.zip); IFS=:; echo "$]>"):$PYTHONPATH
Installing from Source¶
To install PySpark from source, refer to Building Spark.
How to Install PySpark on Your Windows Machine Effortlessly
Are you excited about diving into the world of PySpark? But puzzled by the PySpark install process on your Windows machine? Fret not! In this step-by-step guide, we’ll walk you through the process of installing PySpark without breaking a sweat.
What Is PySpark?
PySpark, the Python API for Apache Spark, empowers users to conduct real-time, large-scale data processing in distributed settings using Python. It offers a PySpark shell for interactive data analysis, blending Python’s user-friendliness with the robust capabilities of Apache Spark. PySpark encompasses Spark’s full suite of functionalities, including Spark SQL, DataFrames, Structured Streaming, Machine Learning (MLlib), and Spark Core, making it accessible to Python-savvy individuals for data processing and analysis of any scale.
Prerequisites
Before we jump into the installation of PySpark, make sure you have the following prerequisites in place:
- Python: PySpark requires Python. Ensure you have Python installed on your Windows machine. If not, download and install Python from the official website.
- Java: PySpark relies on Java, so ensure you have Java Development Kit (JDK) installed. You can download JDK from Oracle’s website.
Install PySpark on Windows
Now, let’s get to the heart of the matter — install PySpark.
Follow these simple steps:
Step 1: Install Spark in Jupyter Notebook
- Visit the official Apache Spark website (https://spark.apache.org/downloads.html).
- Choose the latest stable version of Spark.
- Select “Pre-built for Apache Hadoop” and download the “Direct Download” link for your chosen version.
- Extract the downloaded .tgzfile to your preferred location
- Let’s say extracted file is in C:\Spark\spark-3.4.1-bin-hadoop3.2
- Above location to be set as SPARK_HOME
Step 2: Download Hadoop
- Download Winutils.exe file from github
- Select the Hadoop version as per the version selected Step 1
- Click hadoop.exe and download it under C:\Spark\spark-3.4.1-bin-hadoop3.2\Hadoop\bin
- Above location to be set as HADOOP_HOME
3: Set Environment Variables
- Open the File Explorer
- Right Click on “This PC”
- Click on “Properties”
- Click on “Advanced system settings” on the left.
- In the System Properties window, click the “Environment Variables” button.
- Click “OK” to save the environment variables.
- Under “System variables,” click “New” and add the following variables:
- Variable name: SPARK_HOME
- Variable value: The path to the Spark folder you extracted earlier(C:\Spark\spark-3.4.1-bin-hadoop3.2).
- Variable name: HADOOP_HOME
- Variable value: The path to the Hadoop folder within the Spark directory (e.g., C:\Spark\spark-3.4.1-bin-hadoop3.2\Hadoop\bin).
- Variable name: Path
- Variable value: The path to your Python executable (e.g., C:\Users\userName\anaconda3\python.exe ).
Step 4: Install PySpark Findspark
In this brief guide, we’ve simplified the process of installing PySpark on your Windows machine. Now you’re ready to embark on your data adventures with PySpark, harnessing its immense capabilities effortlessly.
Remember, the journey of data exploration and analysis begins with a single installation step. Happy coding!
How to setup PySpark on Windows?
A pache Spark is an engine vastly used for big data processing. But why do we need it? Firstly, we have produced and consumed a huge amount of data within the past decade and a half. Secondly, we decided to process this data for decision-making and better predictions. Now as the amount of data grows, so does the need for infrastructure to process it efficiently and quickly (oh! The impatient homo-sapiens).
Apache Spark is an open-source engine and was released by the Apache Software Foundation in 2014 for handling and processing a humongous amount of data. Currently, Apache Spark provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. Spark also supports higher-level tools including Spark SQL for SQL and structured data processing, and MLlib for machine learning, to name a few.
Spark helps by separating the data in different clusters and parallelizing the data processing task for GBs and TBs of data. It does so at a very low latency, too. You can read further about the features and usage of Spark here.
But what is PySpark?
To put it in simple words, PySpark is a set of Spark APIs in Python language. It includes almost all Apache Spark features. Because of the simplicity of Python and the efficient processing of large datasets by Spark, PySpark became a hit among the data science practitioners who mostly like to work in Python.
What is wrong with “pip install pyspark” ?
Well, we (Python coders) love Python partly because of the rich libraries and easy one-step installation. In the case of PySpark, it is a bit different: you can still use the above-mentioned command, but your capabilities with it are limited. When using pip, you can install only the PySpark package which can be used to test your jobs locally or run your jobs on an existing cluster running with Yarn, Standalone, or Mesos. It does not contain features or libraries to set up your own cluster, which is a capability you want to have as a beginner.
If you want PySpark with all its features, including starting your own cluster, then follow this blog further…
PySpark Installation
Dependencies of PySpark for Windows system include:
1. Download and Install JAVA
As Spark uses Java Virtual Machine internally, it has a dependency on JAVA. Install the latest version of the JAVA from here.
- JAVA Download Link: here
- Install JAVA by running the downloaded file (easy and traditional browse…next…next…finish installation)
2. Download and Install Python
If you are going to work on a data science related project, I recommend you download Python and Jupyter Notebook together with the Anaconda Navigator.
- Anaconda Download Link: here
- Follow the self-explanatory traditional installation steps (same as above)
Otherwise, you can also download Python and Jupyter Notebook separately
- Python Download link: here
- Run the downloaded file for installation, make sure to check the “include python to Path” and install the recommended packages (including ‘pip’)
To see if Python was successfully installed and that Python is in the PATH environment variable, go to the command prompt and type “python”. You should see something like this. (my Python version is 3.8.5, yours could be different)
In case you do not see the above command, please follow this tutorial for help.
Next, you will need the Jupyter Notebook to be installed for learning integration with PySpark
Install Jupyter Notebook by typing the following command on the command prompt: “pip install notebook”
3. Download and unzip PySpark
Finally, it is time to get PySpark. From the link provided below, download the .tgz file using bullet point 3. You can choose the version from the drop-down menus. Then download the 7-zip or any other extractor and extract the downloaded PySpark file. Remember, you will have to unzip the file twice.
- PySpark Download Link: here
- 7zip Download Link: here
Note: The location of my file where I extracted Pyspark is
“E:\PySpark\spark-3.2.1-bin-hadoop3.2” (we will need it later)
4. Download winutils.exe
In order to run Apache Spark locally, winutils.exe is required in the Windows Operating system. This is because Spark needs elements of the Hadoop codebase called ‘winutils‘ when it runs on non-windows clusters. These windows utilities (winutils) help the management of the POSIX(Portable Operating System Interface) file system permissions that the HDFS (Hadoop Distributed File System) requires from the local (windows) file system.
Too-technical? Just download it. Make sure to select the correct Hadoop version.
- winutils.exe Download Link: here
- Create a folder structure hadoop\bin within the Pyspark folder and put the downloaded winutils.exe file there.
Note: The location of my winutils.exe is
“E:\PySpark\spark-3.2.1-bin-hadoop3.2\hadoop\bin”
5. Set Environment variables
Now that we have downloaded everything we need, it is time to make it accessible through the command prompt by setting the environment variables.
Some Side Info: What are Environment variables?
Environment variables are global system variables accessible by all the processes / users running under the operating system.
PATH is the most frequently used environment variable, it stores a list of directories to search for executable programs (.exe files). To reference a variable in Windows, you can use %varname%.
Some more side info: What does PATH do?
When you launch an executable program (with file extension of » .exe «, » .bat » or » .com «) from the command prompt, Windows searches for the executable program in the current working directory, followed by all the directories listed in the PATH environment variable. If the program is not found in these directories, you will get the following error saying “the command is not recognized”.
Back to the PySpark installation. In order to set the environment variables
- Go to Windows search
- Type “env” —it will show the “edit environment variable for your account”, click on it
- Click on “New” for the user variables and add the following variable name and values (depending upon the location of the downloaded files)
Next, Update the PATH variable with the \bin folder address, containing the executable files of PySpark and Hadoop. This will help in executing Pyspark from the command prompt.
- Click on the “Path” variable
- Then add the following two values ( we are using the previously defined Environment variables here)
%SPARK_HOME%\bin
%HADOOP_HOME%\bin
6. Let’s fire PySpark!
Test if PySpark has been installed correctly and all the environment variables are set.
Great! You have now installed PySpark successfully and it seems like it is running. To see PySpark running, go to “https://localhost:4040” without closing the command prompt and check for yourself.
7. Jupyter Notebook integration with Python
Now, once the PySpark is running in the background, you could open a Jupyter notebook and start working on it. But running PySpark commands will still throw an error (as it does not know which cluster to use) and in that case, you will have to use a python library “findspark”. And use the following two commands before PySpark import statements in the Jupyter Notebook.
import findspark
findspark.init()
But there is a workaround. You can configure PySpark to fire up a Jupyter Notebook instantiated with the current Spark cluster by running just the command “pyspark” on the command prompt. To achieve this, you will not have to download additional libraries. For this…
… you will need to add two more environment variables
Now, when you run the “pyspark” in the command prompt:
- It will give information on how to open the Jupyter Notebook.
- Just copy the URL (highlight and use CTRL+c) and paste it into the browser along with the token information — this will open Jupyter Notebook.
8. Running a sample code on the Jupyter Notebook
Just to make sure everything is working fine, and you are ready to use the PySpark integrated with your Jupyter Notebook.
- Run Pyspark through the command prompt
- Open Jupyter Notebook
- Write the following commands and execute them
# Import Libraries
import pyspark
from pyspark import SQLContext
# Setup the Configuration
conf = pyspark.SparkConf()
spark_context = SparkSession.builder.getOrCreate()
# Add Data
data = ([(1580, "John", "Doe", "Mars" ),
(5820, "Jane", "Doe", "Venus"),
(2340, "Kid1", "Doe", "Jupyter"),
(7860, "Kid2", "Doe", "Saturn")
])
# Setup the Data Frame
user_data_df = spark_context.createDataFrame(data)
# Display the Data Frame
user_data_df.show()
- Open the URL https://localhost:4040 and check for yourself.
My Version information
- Python: 3.8.5
- JAVA: 1.8.0_331
Java™ SE Runtime Environment (build 1.8.0_331-b09)
Java HotSpot™ 64-Bit Server VM (build 25.331-b09, mixed mode) - PySpark: 3.2.1 (spark-3.2.1-bin-hadoop3.2.tgz)
- Hadoop winutils.exe: 3.2.1
- Jupyter:
IPython : 7.30.1
ipykernel : 6.6.0
jupyter_client : 7.0.6
jupyter_core : 4.9.1
notebook : 6.4.6
CONGRATULATIONS! You were able to set up the environment for PySpark on your Windows machine.
Please write in the comment section if you face any issues.
Installing Apache PySpark on Windows 10
Apache Spark Installation Instructions for Product Recommender Data Science Project
Published in
Towards Data Science
6 min read
Aug 30, 2019
Over the last few months, I was working on a Data Science project which handles a huge dataset and it became necessary to use the distributed environment provided by Apache PySpark.
I struggled a lot while installing PySpark on Windows 10. So I decided to write this blog to help anyone easily install and use Apache PySpark on a Windows 10 machine.
1. Step 1
PySpark requires Java version 7 or later and Python version 2.6 or later. Let’s first check if they are already installed or install them and make sure that PySpark can work with these two components.
Installing Java
Check if Java version 7 or later is installed on your machine. For this execute following command on Command Prompt.
If Java is installed and configured to work from a Command Prompt, running the above command should print the information about the Java version to the console. Else if you get a message like:
‘java’ is not recognized as an internal or external command, operable program or batch file.
then you have to install java.
a) For this download java from Download Free Java Software
b) Get Windows x64 (such as jre-8u92-windows-x64.exe) unless you are using a 32 bit version of Windows in which case you need to get the Windows x86 Offline version.
c) Run the installer.
d) After the installation is complete, close your current Command Prompt if it was already open, reopen it and check if you can successfully run java —version command.
2. Step 2
Python
Python is used by many other software tools. So it is quite possible that a required version (in our case version 2.6 or later) is already available on your computer. To check if Python is available and find it’s version, open Command Prompt and type…