Organised Preprocessing for Pandas Dataframe

Introduction on Preprocessing 

Preprocessing is an essential step in machine learning. We underestimate preprocessing but in reality, choosing the right preprocessing for our data is equally important as choosing the right model, if not more. Most of the time we go with some random preprocessing and don’t change it much as it requires a lot of changes in the code. To solve this, we need to make preprocessing organized so that it becomes easier to experiment with different preprocessing.

In this article, I will be implementing preprocessing functions for – Cleaning Data, Encoding Data, Normalizing Data, Train-Val-Test Split, and Performing Preprocessing for the Cars93 Dataset using Pandas Dataframe. After the end of this Article, we will be able to preprocess our data in just four lines in an organized and simple manner. (Jump to Summary to get an Overview)

Objectives

To Make Universal Functions

We want our functions to work for all datasets by passing a pandas dataframe as an input to our function instead of manually having to change the code for each dataframe. Not even once, in any of the functions, we will use anything specific to the dataset we have taken as an example.

To Make a Single Function  for Each Preprocessing Task

For each task (cleaning, encoding, normalizing, splitting) we want to make a combined function where we can do the task by just using that single function and playing with the arguments we pass into the function.

To Make Preprocessing as Simple and Flexible as Possible

We want our main function to make it possible to preprocess the data the way we want without having to change anything in the code. We will do this with a smart choice of arguments.

To Make Experimenting with Different Preprocessing Very Easy

Since preprocessing our data will become simple with our functions, we will not feel lazy to try different preprocessing to get our accuracy up. It will save us a lot of time that we would have otherwise spent making changes through the code even for minute changes in preprocessing.

 

 

We will be Implementing the following Preprocessing Functions

I will be explaining them further in the article as we implement them

Cleaning Data

Encoding Data

Normalizing Data

Splitting Data

Link to Dataset – Cars93 Dataset

Importing Data 

import pandas as pd

df = pd.read_csv(path+'Cars93.csv')

df = df[['Model','Manufacturer','Type','Price','AirBags','Cylinders','Horsepower','RPM']]

# We keep a few features only for our purpose

df.head(3)

 

Cleaning Data

We need to clean the data so that we do not face any issues later when we apply a model to our data.

Removing Missing Data

We need to remove data where the value of any feature is nan or na or empty.

def remove_missing(df) : 
  remove = []
  for i, row in df.iterrows():
    if row.isna().values.any() : remove.append(i)
  df.drop(remove,axis=0,inplace=True)

Remove Mismatch Data

We need to remove data with mismatches. For eg. a data point with a string value for a numerical feature. For this, we will check what data type is the majority for each feature and remove the data with a different data type for those features.

We also make provision of ‘exceptions’ where we can specify features for which values can have different data types and we don’t want to remove mismatches.

def remove_mismatch(df,exceptions=[]) : 
  for col in df : 
    if col in exceptions : continue
    df.reset_index(drop=True, inplace=True)
    s = [False]*len(df[col])
    for i,cell in enumerate(df[col]) : 
      try : n = int(cell)
      except : s[i] = True
    t = s.count(True)
    f = s.count(False)
    st = False
    if(t>f) : st = True
    remove = [i for i in range(len(df[col])) if s[i]!=st]
    df.drop(remove,axis=0,inplace=True)

Converting Numeric Data Stored as String to Numerical Form –

Sometimes Numeric Data (eg. int) is stored as a String, leading to an error when we train our model or normalize our data. We need to identify such cases and convert them to their original numerical form.

def str_to_num(df) : 
  for col in df : 
    try : df[col] = pd.to_numeric(df[col])
    except : pass

Single Function for Cleaning Data

def clean(df,exceptions_mismatch=[]) : 
  remove_missing(df) 
  remove_mismatch(df,exceptions=exceptions_mismatch)
  str_to_num(df)
clean(df,exceptions_mismatch=['Model'])

Encoding Data

Label Encoding- Assigning an integer to each unique value of a column/feature.
One Hot Encoding- Converting 1 column to n columns where n is the number of unique values in that column. Each new column represents a unique value in the original column and it contains either 0 or 1. So in each row, only one of the n columns will have the value 1 and the remaining n-1 columns will have the value 0.
We are going to represent the type of encoding we want for each column using a dictionary, where the keys will be the column/feature names and their values will be the type of encoding we want.

labels = {}
labels['AirBags'] = ['None','Driver only','Driver & Passenger']
labels['Type'] = None
labels['Manufacturer'] = []
labels['Model'] = []

This way experimenting with different encoding will become very easy. For eg., if we want to change the encoding of the ‘Type’ column from One Hot to Label, we can do it by simply changing its value in the labels dictionary from None to [].

Label Encoding

The function takes the column name and order as input.

Lets say, df['col'] = ['b','a','b','c']
def encode_label(df,col,order=[]) :
  if(order==[]) : order = list(df[col].unique())
  for i,cell in enumerate(df[col]) : 
    try : 
      df.at[i,col] = order.index(df[col][i])
    except : 
      df.at[i,col] = -1

One Hot Encoding

The function takes the column name as input.

Lets say, df['col'] = ['b','a','b','c']

After One Hot Encoding -
  • df['col_b'] = [1,0,1,0]
  • df['col_a'] = [0,1,0,0]
  • df['col_c'] = [0,0,0,1]
def encode_onehot(df,col) :
  k = {}
  n = df[col].shape[0]
  unique = df[col].unique()
  for unq in unique : k[unq] = [0]*n
  for i in range(n) :
    k[df.at[i,col]][i] = 1
  for unq in unique : df[f"{col}_{unq}"] = k[unq] 
  df.drop(col,axis=1,inplace=True)

Single Function For Encoding Data

def encode(df,cols) : 
  for col in cols.keys() : 
    if(cols[col] is None) : encode_onehot(df,col)
    else : encode_label(df,col,cols[col])
encode(df,labels)

Normalizing Data

# Dividing by largest
def normalize_dbl(df,cols,round=None) : 
  if(type(cols)!=list) : cols = [cols]
  for col in cols : 
    l = df[col].max()
    if round is None : df[col] = df[col].div(l)
    else : df[col] = df[col].div(l).round(round)
# Dividing by constant
def normalize_dbc(df,cols,round=None,c=1) :
  if(type(cols)!=list) : cols = [cols]
  for col in cols : 
    if round is None : df[col] = df[col].div(c)
    else : df[col] = df[col].div(c).round(round)
# Dividing by constant x largest
def normalize_dblc(df,cols,round=None,c=1) :
  if(type(cols)!=list) : cols = [cols]
  for col in cols : 
    l = df[col].max() * c
    if round is None : df[col] = df[col].div(l)
    else : df[col] = df[col].div(l).round(round)
# min-max normalization
def normalize_rescale(df,cols,round=None) :
  if(type(cols)!=list) : cols = [cols]
  for col in cols : 
    df[col] = df[col] - df[col].min()
    l = df[col].max()
    if round is None : df[col] = df[col].div(l)
    else : df[col] = df[col].div(l).round(round)
# mean normalization
def normalize_mean(df,cols,round=None) :
  if(type(cols)!=list) : cols = [cols]
  for col in cols : 
    mean = df[col].mean()
    l = df[col].max() - df[col].min()
    df[col] = df[col] - mean
    if round is None : df[col] = df[col].div(l)
    else : df[col] = df[col].div(l).round(round)


Single Function for Normalizing Data
def normalize(df,cols=None,kinds='dbl',round=None,c=1,exceptions=[]) :
  if(cols is None) : 
    cols = []
    for col in df : 
      if(pd.api.types.is_numeric_dtype(df[col])) : 
        if(max(df[col])>1 or min(df[col])<-1) : 
          if(col not in exceptions) : cols.append(col)
  if(type(cols)!=list) : cols = [cols]
  n = len(cols)
  if(type(kinds)!=list) : kinds = [kinds]*n
  for i,kind in enumerate(kinds) : 
    if(kind=='dbl') : normalize_dbl(df,cols[i],round)
    if(kind=='dbc') : normalize_dbc(df,cols[i],round,c)
    if(kind=='dblc') : normalize_dblc(df,cols[i],round,c)
    if(kind in ['min-max','rescale','scale']) : normalize_rescale(df,cols[i],round)
    if(kind=='mean') : normalize_mean(df,cols[i],round)

We can vastly vary the overall normalizations by easily making changes in the parameters of this function when we call it. This helps in experimenting with different normalizations.

Some examples of various ways in which we can normalize our data using this function –

If we want to normalize all columns (it detects numeric columns) –

normalize(df)

If we want to normalize and round to 3 decimal places –

normalize(df,round=3)

If we want to normalize all columns by a kind other than dividing by largest –

normalize(df,kinds='mean')

If we want to normalize some columns with a kind and some columns with other kind –

normalize(df,['Price','Horsepower'],'dbl')
normalize(df,['AirBags','Cylinders'],'min-max')
normalize(df,['RPM'],'dblc',c=1.25)

or

normalize(df,['Price','AirBags','Cylinders','Horsepower','RPM'],['dbl','min-max','min-max','dbl','dblc'],c=1.25)

If we want to normalize all columns except a few –

normalize(df,kinds='min-max',exceptions=['AirBags','RPM'],round=4)

Splitting Data

We will use sklearn and make a function to split data where we won’t even need to mention if we are splitting our data into 2 or 3 portions.

We also need to reset the index of x_train, x_test, etc., otherwise, we can face problems while iterating over them in the future.

from sklearn.model_selection import train_test_split
x = df.drop(['Price'], axis=1)
y = df.loc[:,'Price']

Way Split 

def train_test(x,y,train_size=-1,test_size=-1) :

if(train_size==-1) : train_size = 1-test_size

x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=train_size,random_state=101)

x_train.reset_index(drop=True,inplace=True)

x_test.reset_index(drop=True,inplace=True)

y_train.reset_index(drop=True,inplace=True)

y_test.reset_index(drop=True,inplace=True)

return x_train,x_test,y_train,y_test

Way Split 

def train_val_test(x,y,train_size=-1,val_size=-1,test_size=-1) :

if(train_size==-1) : train_size = 1-val_size-test_size

if(val_size==-1) : val_size = 1-train_size-test_size

x_train,x_val,y_train,y_val = train_test_split(x,y,train_size=train_size,random_state=101)

x_val,x_test,y_val,y_test = train_test_split(x_val,y_val,train_size=(val_size/(1-train_size)),random_state=101)

x_train.reset_index(drop=True,inplace=True)

x_val.reset_index(drop=True,inplace=True)

x_test.reset_index(drop=True,inplace=True)

y_train.reset_index(drop=True,inplace=True)

y_val.reset_index(drop=True,inplace=True)

y_test.reset_index(drop=True,inplace=True)

return x_train,x_val,x_test,y_train,y_val,y_test

Single Function for Splitting Data

If we pass two sizes in the function (eg. train_size & val_size) then it will be a three-way split, if we pass one size (eg. train_size) it will be a two-way split.

Exit mobile version