• Latest
  • Trending
The Pandas Series and Memory Consumption

The Pandas Series and Memory Consumption

July 5, 2022
ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023
Data Leak Hits Thousands of NHS Workers

Data Leak Hits Thousands of NHS Workers

February 20, 2023
EU Cybersecurity Agency Warns Against Chinese APTs

EU Cybersecurity Agency Warns Against Chinese APTs

February 20, 2023
How Your Storage System Will Still Be Viable in 5 Years’ Time?

How Your Storage System Will Still Be Viable in 5 Years’ Time?

February 20, 2023
The Broken Promises From Cybersecurity Vendors

Cloud Infrastructure Used By WIP26 For Espionage Attacks on Telcos

February 20, 2023
Instagram and Facebook to get paid-for verification

Instagram and Facebook to get paid-for verification

February 20, 2023
YouTube CEO Susan Wojcicki steps down after nine years

YouTube CEO Susan Wojcicki steps down after nine years

February 20, 2023
Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022
8 Most Common Causes of a Data Breach

5.7bn data entries found exposed on Chinese VPN

August 18, 2022
  • Consumer Watch
  • Kids Page
  • Directory
  • Events
  • Reviews
Sunday, 4 June, 2023
  • Login
itechnewsonline.com
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion
Subscription
Advertise
No Result
View All Result
itechnewsonline.com
No Result
View All Result

The Pandas Series and Memory Consumption

by ITECHNEWS
July 5, 2022
in Data Science, Leading Stories
0 0
0
The Pandas Series and Memory Consumption

The pandas Series is a numpy array with the addition of an index for each entry. It is also the building block object for the pandas DataFrame, like below.

Image description

YOU MAY ALSO LIKE

ATC Ghana supports Girls-In-ICT Program

Vice President Dr. Bawumia inaugurates ICT Hub

Since the Series is constructed from numpy arrays, this gives them the added benefit of faster computation on a Series or between multiple ones due to numpy’s vectorized computation. While with a base python list a loop must be used to perform computation on elements of a list, slowly checking the index and confirming the data type, numpy handles this in pre-compiled C code for much faster computation.
Let’s construct the above DataFrame by building a Series for each column and exploring how to change the data type of our values. Although the length of these columns are trivially small, knowing how to maximize the efficiency of your code by choosing the right data type will become important when dealing with larger amounts of data and performing more complex computations.
We’ll start by creating a Series for the year column. We can do this by initializing a variable named ‘years’ that contains a base python list filled with the year for the past 8 years. Then we will pandas Series method. We will also give the Series a name by passing a string to the name = keyword argument. This Series name will also provide pandas with the column name for these values when added to a DataFrame.

import pandas as pd
import numpy as np
years = [2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014]
year_series = pd.Series(data=years, name='year')
year_series

[output]

0    2021
1    2020
2    2019
3    2018
4    2017
5    2016
6    2015
7    2014
Name: year, dtype: int64

As you can see, pandas has automatically generated a numeric index for our Series and decided to use the ‘int64’ data type for our values based on the data we passed through. Let’s use some attributes and methods to explore what this Series is made of.
The .values attribute gives us the list of years we gave to pandas initially, but we see here that it has been turned it into a numpy array.

print(type(year_series.values))
year_series.values

[output]

<class 'numpy.ndarray'>
array([2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014])

Now let’s check how much memory are Series is using with the .memory_usage() method, which outputs the amount of space being used in bytes. For a refresher, every byte is 8 bits. Since our Series contains 8 values, the sum of the bytes used for our array will equal the amount of bits used for each element in the array.
Since our data type is ‘int64’, measured in bits, and we have 8 values, we could assume our Series consumes 64 bits x 8 values = 512 bits. Converting that to bytes, the .memory_usage() method should give us an output of 64bytes.

print(year_series.memory_usage(), 'bytes')

[output]

192 bytes

Interesting, it did not. Let’s dig around to find where the other 128 bytes are coming from. The .nbytes attribute will give us the memory consumption of a numpy array. Let’s combine it with the .values and .index attributes on the Series object to check the consumption of both.

print(
f'''
values consumption: {year_series.values.nbytes} bytes
index consumption:  {year_series.index.nbytes} bytes
''')

[output]

values consumption: 64 bytes
index consumption:  128 bytes

The index is storing its values in twice the amount of memory as the actual data! This is an even more odd behavior when considering the results of checking the dtype of the index, which says the index is using ‘int64’.

year_series.index.dtype

[output]

dtype('int64')

This will be the same index behavior when building a DataFrame with the pd.DatFrame() method. To reduce the amount of memory used by the index, let’s try creating our own index with the pandas .Index() method. Then we can recreate our Series with our own custom index by passing it to the index= keyword argument. For even more memory efficiency we can alter the dtype used for the underlying data in our Series. Pandas decided to use ‘int64’ which can store values between -9,223,372,036,854,775,808 and 9,223,372,036,854,775,807. This is far more space than we need. Let’s downgrade the data type of our values to ‘uint16’. The ‘u’ is for unsigned, which works well for values that will not be negative like a year and ‘int16’ for 16 bits, which can store numbers ranging between 0 and 65,535. This is plenty for representing the years of python and the pandas library versions.

index = pd.Index(list(range(len(years))))
year_series2 = pd.Series(years, index=index, dtype='uint16', name='year')
year_series2

[output]

0    2021
1    2020
2    2019
3    2018
4    2017
5    2016
6    2015
7    2014
Name: year, dtype: uint16

Our custom index using the listed range of the length of our data seems to give us the same index as before. Let’s see if its memory consumption is the same.

print(
f'''
total consumption:  {year_series2.memory_usage()} bytes
values consumption: {year_series2.values.nbytes} bytes
index consumption:  {year_series2.index.nbytes} bytes
''')

[output]

total consumption:  80 bytes
values consumption: 16 bytes
index consumption:  64 bytes

Great, that cut the memory consumption of our index in half from 128 bytes to 64 bytes and of our values from 64 bytes to 16 bytes.
Now that we can create a Series and alter the data types used, let’s continue building the DatFrame above with a Series for the python version column for each year in our first Series. We already have a custom index created so we will just have to pass that into the new Series. As for the underlying data, since our version numbers are floating point values, our dtype for this Series will default to ‘float64’ like our Series containing years defaulted to ‘int64’. Let’s go ahead and downgrade this to ‘float32’.

python_versions = [3.9, 3.9, 3.8, 3.7, 3.6, 3.6, 3, 3]
python_series = pd.Series(python_versions, index=index, dtype='float32', name='python_version')
python_series

[output]

0    3.9
1    3.9
2    3.8
3    3.7
4    3.6
5    3.6
6    3.0
7    3.0
Name: python_version, dtype: float32

Now let’s check the memory consumption.

print(
f'''
total consumption:  {python_series.memory_usage()} bytes
values consumption: {python_series.values.nbytes} bytes
index consumption:  {python_series.index.nbytes} bytes
''')

[output]

total consumption:  96 bytes
values consumption: 32 bytes
index consumption:  64 bytes

Not bad, this column will take up more memory than the one with our years, but we were able to decrease the underlying datas memory consumption from 64 bytes to 32 bytes.
Now let’s create our last Series that contains the versions of pandas for the years in our first Series. Since each year has multiple pandas updates, these are stored as a sting representing the range of version updates for that year.

pandas_versions = ['1.2->1.4', '1.0->1.1', '0.24->0.25', '0.23', '0.20->0.22', 
                    '0.18->0.19','0.16->0.17','0.13->0.15']
pandas_series = pd.Series(pandas_versions, index=index, name='pandas_version')
pandas_series

[output]

0      1.2->1.4
1      1.0->1.1
2    0.24->0.25
3          0.23
4    0.20->0.22
5    0.18->0.19
6    0.16->0.17
7    0.13->0.15
Name: pandas_version, dtype: object

We can see a string is represented in the Series as the numpy ‘object’ dtype.
Now for the memory consumption.

print(
f'''
total consumption:  {pandas_series.memory_usage()} bytes
values consumption: {pandas_series.values.nbytes} bytes
index consumption:  {pandas_series.index.nbytes} bytes
''')

[output]

total consumption:  128 bytes
values consumption: 64 bytes
index consumption:  64 bytes

We can see the ‘object’ dtype totals to 64bytes for the column, or 8bytes/64bits, per element. Unlike for numeric values, there is not a smaller ‘object’ dtype to downgrade to.
Now let’s finally put these in a DataFrame with the pd.concat() method.

df = pd.concat([year_series2, python_series, pandas_series], axis = 1)
df

[output]

Image description

Memory consumption of a this DataFrame can be checked with the same .memory_usage() method as the Series’ used.

print(df.memory_usage())
print()
print(f'total memory consumption: {df.memory_usage().sum()} bytes')

[output]

Index             64
year              16
python_version    32
pandas_version    64
dtype: int64

total memory consumption: 176 bytes

For comparison, I created this same DataFrame by passing a dictionary with the values to the pd.DataFrame() constructor method and checked the memory consumption.

data = {'year': years, 'python_versions': python_versions, 'pandas_versions': pandas_versions}
df2 = pd.DataFrame(data=data)
df2

[output]

Image description

Memory consumption for this DataFrame.

print(df2.memory_usage())
print()
print(f'total memory consumption: {df2.memory_usage().sum()} bytes')

[output]

Index              128
year                64
python_versions     64
pandas_versions     64
dtype: int64

total memory consumption: 320 bytes

Our DataFrame made from customized Series’ added up 176 bytes, while our DataFrame made from the the pd.DatFrame() method and no customizing came in at 320 bytes.That’s a 45% increase in memory consumption!
As you can see, changing the data types used in your pandas DataFrames can drastically decrease the amount of memory used. For larger datasets this can have useful effects in both file size and time for computation.

Source: manifoldmindaway
Tags: The Pandas Series and Memory Consumption
ShareTweetShare
Plugin Install : Subscribe Push Notification need OneSignal plugin to be installed.

Search

No Result
View All Result

Recent News

ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023

About What We Do

itechnewsonline.com

We bring you the best Premium Tech News.

Recent News With Image

ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023

Recent News

  • ATC Ghana supports Girls-In-ICT Program April 25, 2023
  • Vice President Dr. Bawumia inaugurates ICT Hub April 2, 2023
  • Co-Creation Hub’s edtech accelerator puts $15M towards African startups February 20, 2023
  • Data Leak Hits Thousands of NHS Workers February 20, 2023
  • Home
  • InfoSec
  • Opinion
  • Africa Tech
  • Data Storage

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

No Result
View All Result
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Go to mobile version