• Latest
  • Trending
Supercharge Your Pandas Code with Apache Spark

Supercharge Your Pandas Code with Apache Spark

March 2, 2022
Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022
8 Most Common Causes of a Data Breach

5.7bn data entries found exposed on Chinese VPN

August 18, 2022
Fibre optic interconnection linking Cameroon and Congo now operational

Fibre optic interconnection linking Cameroon and Congo now operational

July 15, 2022
Ericsson and MTN Rwandacell Discuss their Long-Term Partnership

Ericsson and MTN Rwandacell Discuss their Long-Term Partnership

July 15, 2022
Airtel Africa Purchases $42M Worth of Additional Spectrum

Airtel Africa Purchases $42M Worth of Additional Spectrum

July 15, 2022
Huawei steps up drive for Kenyan talent

Huawei steps up drive for Kenyan talent

July 15, 2022
TSMC predicts Q3 revenue boost thanks to increased iPhone 13 demand

TSMC predicts Q3 revenue boost thanks to increased iPhone 13 demand

July 15, 2022
Facebook to allow up to five profiles tied to one account

Facebook to allow up to five profiles tied to one account

July 15, 2022
Top 10 apps built and managed in Ghana

Top 10 apps built and managed in Ghana

July 15, 2022
MTN Group to Host the 2nd Edition of the MoMo API Hackathon

MTN Group to Host the 2nd Edition of the MoMo API Hackathon

July 15, 2022
KIOXIA Introduce JEDEC XFM Removable Storage with PCIe/NVMe Spec

KIOXIA Introduce JEDEC XFM Removable Storage with PCIe/NVMe Spec

July 15, 2022
  • Consumer Watch
  • Kids Page
  • Directory
  • Events
  • Reviews
Sunday, 5 February, 2023
  • Login
itechnewsonline.com
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion
Subscription
Advertise
No Result
View All Result
itechnewsonline.com
No Result
View All Result

Supercharge Your Pandas Code with Apache Spark

by ITECHNEWS
March 2, 2022
in Data Science, Leading Stories
0 0
0
Supercharge Your Pandas Code with Apache Spark

Pandas is a fast and powerful open-source data analysis and manipulation framework written in Python. Apache Spark is an open-source unified analytics engine for distributed large-scale data processing. Both are widely adopted in the data engineering and data science communities.

Even though there’s a great value in combining them in terms of productivity, scalability, and performance, it’s often overlooked. In this blog post, we’ll give a sneak peek into combining these tools to enjoy the best of both worlds!

YOU MAY ALSO LIKE

Inaugural AfCFTA Conference on Women and Youth in Trade

Instagram fined €405m over children’s data privacy

For the purpose of this example, we’ve used a 1.9GB CSV file with fire department calls’ data, obtained from https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3 as of 11/11/2019.

First, let’s try to calculate the total number of calls per zip code, using Pandas:

import pandas as pd
import time
# Record the start time
start = time.time()
# Read the CSV file with the header
pandasDF = pd.read_csv('/dbfs/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv', header=0)
# Compute the total number of calls per zip code 
pandasDF.groupby('Zipcode of Incident')['Call Number'].count()
# Record the end time
end = time.time()
print('Command took ', end - start, ' seconds')

As you can see from the screenshot above, this took roughly 40 seconds on an i3.xlarge machine (with 30.5GB RAM and 4 cores). Keep in mind this is a small dataset for example purposes.

Can we improve it? With Pandas API on Spark – we can!

Apache Spark is a distributed processing engine, which will allow us to easily parallelize the computation.

Pandas API on Spark is a Pandas’ API compatible drop-in replacement which provides Pandas’ users the benefits of Spark, with minimal code changes.

It is also useful for PySpark users by supporting tasks that are easier to accomplish using Pandas, like plotting an Apache Spark DataFrame.

Let’s try the same example, but this time – using Pandas API on Spark:

import pyspark.pandas as ps
import time
# Record the start time
start = time.time()
# Read the CSV file with the header
pysparkDF = ps.read_csv('dbfs:/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv', header=0)
# Compute the total number of calls per zip code 
pysparkDF.groupby('Zipcode of Incident')['Call Number'].count()
# Record the end time
end = time.time()
print('Command took ', end - start, ' seconds')

Notice we only had to change the import pandas as pd to import pyspark.pandas as ps.

using Apache Spark

This time, it took only about 7 seconds, which can be attributed to the fact it is executed in a distributed manner(as opposed to Pandas). In this example, we used the same i3.xlarge machine (with 30.5GB RAM and 4 cores) as the cluster driver, and 4 i3.xlarge machines for the cluster workers.

Essentially, Spark divided the 1.9GB file into smaller chunks (which are called “partitions”), and all partitions were processed concurrently across all machines in the cluster.

That means Spark was able to run 16 tasks concurrently, as you can see below:

using Apache Spark

It’s important to note that the larger the dataset – the greater performance improvement you can expect (e.g think about what would have happened if we chose a 1900GB dataset rather than a 1.9GB dataset…).

Using the Pandas API on Spark is just one of the options available to Python developers to easily use Spark and enjoy the performance, scalability, and stability benefits it provides.

Tags: Pandas Code with Apache Spark
ShareTweetShare
Plugin Install : Subscribe Push Notification need OneSignal plugin to be installed.

Search

No Result
View All Result

Recent News

Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022
8 Most Common Causes of a Data Breach

5.7bn data entries found exposed on Chinese VPN

August 18, 2022

About What We Do

itechnewsonline.com

We bring you the best Premium Tech News.

Recent News With Image

Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022

Recent News

  • Inaugural AfCFTA Conference on Women and Youth in Trade September 6, 2022
  • Instagram fined €405m over children’s data privacy September 6, 2022
  • 5.7bn data entries found exposed on Chinese VPN August 18, 2022
  • Fibre optic interconnection linking Cameroon and Congo now operational July 15, 2022
  • Home
  • InfoSec
  • Opinion
  • Africa Tech
  • Data Storage

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

No Result
View All Result
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Go to mobile version