Pandas is a fast and powerful open-source data analysis and manipulation framework written in Python. Apache Spark is an open-source unified analytics engine for distributed large-scale data processing. Both are widely adopted in the data engineering and data science communities.
Even though there’s a great value in combining them in terms of productivity, scalability, and performance, it’s often overlooked. In this blog post, we’ll give a sneak peek into combining these tools to enjoy the best of both worlds!
For the purpose of this example, we’ve used a 1.9GB CSV file with fire department calls’ data, obtained from https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3 as of 11/11/2019.
First, let’s try to calculate the total number of calls per zip code, using Pandas:
import pandas as pd import time # Record the start time start = time.time() # Read the CSV file with the header pandasDF = pd.read_csv('/dbfs/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv', header=0) # Compute the total number of calls per zip code pandasDF.groupby('Zipcode of Incident')['Call Number'].count() # Record the end time end = time.time() print('Command took ', end - start, ' seconds')
As you can see from the screenshot above, this took roughly 40 seconds on an i3.xlarge machine (with 30.5GB RAM and 4 cores). Keep in mind this is a small dataset for example purposes.
Can we improve it? With Pandas API on Spark – we can!
Apache Spark is a distributed processing engine, which will allow us to easily parallelize the computation.
Pandas API on Spark is a Pandas’ API compatible drop-in replacement which provides Pandas’ users the benefits of Spark, with minimal code changes.
It is also useful for PySpark users by supporting tasks that are easier to accomplish using Pandas, like plotting an Apache Spark DataFrame.
Let’s try the same example, but this time – using Pandas API on Spark:
import pyspark.pandas as ps import time # Record the start time start = time.time() # Read the CSV file with the header pysparkDF = ps.read_csv('dbfs:/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv', header=0) # Compute the total number of calls per zip code pysparkDF.groupby('Zipcode of Incident')['Call Number'].count() # Record the end time end = time.time() print('Command took ', end - start, ' seconds')
Notice we only had to change the import pandas as pd to import pyspark.pandas as ps.
This time, it took only about 7 seconds, which can be attributed to the fact it is executed in a distributed manner(as opposed to Pandas). In this example, we used the same i3.xlarge machine (with 30.5GB RAM and 4 cores) as the cluster driver, and 4 i3.xlarge machines for the cluster workers.
Essentially, Spark divided the 1.9GB file into smaller chunks (which are called “partitions”), and all partitions were processed concurrently across all machines in the cluster.
That means Spark was able to run 16 tasks concurrently, as you can see below:
It’s important to note that the larger the dataset – the greater performance improvement you can expect (e.g think about what would have happened if we chose a 1900GB dataset rather than a 1.9GB dataset…).
Using the Pandas API on Spark is just one of the options available to Python developers to easily use Spark and enjoy the performance, scalability, and stability benefits it provides.