You’ve probably done your online searches on “Feature Selection”, and you’ve probably found tons of articles describing the three umbrella terms that group selection methodologies, i.e., “Filter Methods”, “Wrapper Methods” and “Embedded Methods”.
Under the “Filter Methods”, we find statistical tests that select features based on their distributions. These methods are computationally very fast, but in practice they do not render good features for our models. In addition, when we have big datasets, p-values for statistical tests tend to be very small, highlighting as significant tiny differences in distributions, that may not be really important.
The “Wrapper Methods” category includes greedy algorithms that will try every possible feature combination based on a step forward, step backward, or exhaustive search. For each feature combination, these methods will train a machine learning model, usually with cross-validation, and determine its performance. Thus, wrapper methods are very computationally expensive, and often, impossible to carry out.
The “Embedded Methods,” on the other hand, train a single machine learning model and select features based on the feature importance returned by that model. They tend to work very well in practice and are faster to compute. On the downside, we can’t derive feature importance values from all machine learning models. For example, we can’t derive importance values from nearest neighbours. In addition, co-linearity will affect the coefficient values returned by linear models, or the importance values returned by decision tree based algorithms, which may mask their real importance. Finally, decision tree based algorithms may not perform well in very big feature spaces, and thus, the importance values might be unreliable.
Filter Methods are hard to interpret and are not commonly used in practice; Wrapper Methods are computationally expensive and often impossible to carry out; and Embedded Methods are not suitable in every scenario or for every machine learning model. What do we do then? How else can we select predictive features?
Fortunately, there are more ways to select features for supervised learning. And I will cover three of them in detail throughout this blog post. For more feature selection methods, check out the online course Feature Selection for Machine Learning.
Alternative feature selection methods
In this article, I will describe three algorithms that select features based on their impact on model performance. They are often referred to as “Hybrid Methods” because they share characteristics of Wrapper and Embedded methods. Some of these methods rely on training more than one machine learning model, a bit like wrapper methods. Some selection procedures rely on feature importance like Embedded Methods.
But nomenclature aside, these methods have been successfully used in the industry or in data science competitions, and provide additional ways of finding the most predictive features for a certain machine learning model.
Throughout the article, I will lay out the logic and procedure of some of these feature selection methods and show how we can implement them in Python using the open source library Feature-engine. Let’s get started.
We will discuss selection by:
- Feature shuffling
- Feature performance
- Target mean performance
Feature shuffling
Feature shuffling, or permutation feature importance consists of assigning importance to a feature based on the decrease in a model performance score when the values of a single feature are randomly shuffled. Shuffling the order of the feature values (across the rows of the dataset) alters the original relationship between the feature and the target, so the drop in the model performance score is indicative of how much the model depends on that feature.
The procedure works as follows:
- It trains a machine learning model and determines its performance.
- It shuffles the order of the values of 1 feature.
- It makes predictions with the model trained in step 1, and determines the performance.
- If the performance drops below a threshold, it keeps the feature, otherwise it removes it.
- It repeats from step 2 until all features are examined.
Selection by shuffling features has several advantages. First, we need to train only one machine learning model. The importance is subsequently assigned by shuffling the feature values and making predictions with that model. Second, we can select features for any supervised machine learning model of our choice. Third, we can implement this selection procedure utilizing open source, and we will see how to do this in the coming paragraphs.
Pros:
- It only trains one machine learning model, so it is quick.
- It is suitable for any supervised machine learning model.
- It is available in Feature-engine, a Python open source library.
On the downside, if two features are correlated, when one of the features is shuffled, the model will still have access to the information through its correlated variable. This may result in a lower importance value for both features, even though they might actually be important. In addition, to select features, we need to define an arbitrary importance threshold below which features will be removed. With higher threshold values, fewer features will be selected. Finally, shuffling features introduces an element of randomness, so for features with borderline importance, that is, importance values close to the threshold, different runs of the algorithm may return different subsets of features.
Considerations:
- Correlations may affect the interpretation of the feature’s importance.
- The user needs to define an arbitrary threshold.
- The element of randomness makes the selection procedure non-deterministic.
With this in mind, selecting features by feature shuffling is a good feature selection method that focuses on highlighting those variables that directly affect the model performance. We can manually derive the permutation importance with Scikit-learn, and then select those variables that show an importance above a certain threshold. Or we can automate the entire procedure with Feature-engine.
Python Implementation
Let’s see how to carry out selection by feature shuffling with Feature-engine. We will use the diabetes dataset that comes with Scikit-learn. First, we load the data:
import pandas as pd from sklearn.datasets import load_diabetes from sklearn.linear_model import LinearRegression from feature_engine.selection import SelectByShuffling # load dataset diabetes_X, diabetes_y = load_diabetes(return_X_y=True) X = pd.DataFrame(diabetes_X) y = pd.DataFrame(diabetes_y)
We set up the machine learning model we are interested in:
# initialize linear regression estimator linear_model = LinearRegression()
We will select features based on the drop in the r2 using 3 fold cross-validation:
# initialize the feature selector tr = SelectByShuffling(estimator=linear_model, scoring="r2", cv=3)
With the method fit() the transformer finds the important variables —those that cause a drop in r2when shuffled. By default, features will be selected if the performance drop is bigger than the mean drop caused by all features.
# fit transformer tr.fit(X, y)
With the method transform() we drop the unselected features from the dataset:
Xt = tr.transform(X)
We can inspect the individual feature’s importance through one of the transformer’s attributes:
tr.performance_drifts_ {0: -0.02368121940502793, 1: 0.017909161264480666, 2: 0.18565460365508413, 3: 0.07655405817715671, 4: 0.4327180164470878, 5: 0.16394693824418372, 6: -0.012876023845921625, 7: 0.01048781540981647, 8: 0.3921465005640224, 9: -0.01427065640301245}
We can access to the names of the features that will be removed in another attribute:
tr.features_to_drop_ [0, 1, 3, 6, 7, 9]
That’s it, simple. We have a reduced dataframe in Xt.
Feature performance
A direct way of determining the importance of a feature is to train a machine learning model using solely that feature. In this case, the “importance” of the feature is given by the performance score of the model. In other words, how well a model trained on a single feature predicts the target. Poor performance metrics speak of weak or non-predictive features.
The procedure works as follows:
- It trains a machine learning model for each feature.
- For each model, it makes predictions and determines model performance.
- It selects features with performance metrics above a threshold.
In this selection procedure, we train one machine learning model per feature. The model uses an individual feature to predict the target variable. Then, we determine the model performance, usually with cross-validation, and select features whose performance falls above a certain threshold.
On one hand, this method is more computationally costly because we would train as many models as features we have in our data set. On the other hand, models trained on a single feature tend to train fairly quickly.
With this method, we can select features for any model that we want, because the importance is given by the performance metric. On the downside, we need to provide an arbitrary threshold for the feature selection. With higher threshold values, we select smaller feature groups. Some threshold values can be fairly intuitive. For example, if the performance metric is the roc-auc, we can select features whose performance is above 0.5. For other metrics, like accuracy, what determines a good value is not so clear.
Pros:
- It is suitable for any supervised machine learning model.
- It explores features individually, thus avoiding correlation issues.
- It is available in Feature-engine, a Python open source project.
Considerations:
- Training one model per feature can be computationally costly.
- The user needs to define an arbitrary threshold.
- It does not pick up feature interactions.
We can implement selection by single feature performance utilizing Feature-engine.
Python Implementation
Let’s load the diabetes dataset from Scikit-learn:
import pandas as pd from sklearn.datasets import load_diabetes from sklearn.linear_model import LinearRegression from feature_engine.selection import SelectBySingleFeaturePerformance # load dataset diabetes_X, diabetes_y = load_diabetes(return_X_y=True) X = pd.DataFrame(diabetes_X) y = pd.DataFrame(diabetes_y)
We want to select features whose r2 > 0.01, utilizing a linear regression and using 3 fold cross-validation.
# initialize the feature selector sel = SelectBySingleFeaturePerformance( estimator=LinearRegression(), scoring="r2", cv=3, threshold=0.01)
The transformer uses the method fit() to fit 1 model per feature, determine performance, and select the important features.
# fit transformer sel.fit(X, y)
We can explore the features that will be dropped:
sel.features_to_drop_ [1]
We can also examine each individual feature’s performance:
sel.feature_performance_ {0: 0.029231969375784466, 1: -0.003738551760264386, 2: 0.336620809987693, 3: 0.19219056680145055, 4: 0.037115559827549806, 5: 0.017854228256932614, 6: 0.15153886177526896, 7: 0.17721609966501747, 8: 0.3149462084418813, 9: 0.13876602125792703}
With the method transform() we remove the features from the dataset:
# drop variables Xt = sel.transform(X)
And that’s it. Now we have a reduced dataset.
Target mean performance
The selection procedure that I will discuss now was introduced in the KDD 2009 data science competition by Miller and co-workers. The authors do not attribute any name to the technique, but since it uses the mean target value per group of observations as a proxy for predictions, I like to call this technique “Selection by Target Mean Performance.”
This selection methodology also assigns an “importance” value to each feature. This importance value is derived from a performance metric. Interestingly, the model does not train any machine learning models. Instead, it uses a much simpler proxy as a prediction.
In a nutshell, the procedure uses the mean target value per category or per interval (if the variable is continuous) as a proxy for prediction. With this prediction, it derives a performance metric, like r2, accuracy, or any other metric that assesses a prediction against the truth.
How does this procedure exactly work?
For categorical variables:
- It splits the dataframe into a training and a testing set.
- For every categorical feature, it determines the mean target value per category (using the train set).
- It replaces categories with corresponding target mean values in the test.
- It determines a performance metric using the encoded features and the target (on the test set).
- It selects features whose performance is above a threshold.
For categorical values, the mean value of the target is determined for each category based on the training set. Then, the categories are replaced by the learned values in the test set, and these values are used to determine the performance metric.
For continuous variables, the procedure is fairly similar:
- It splits the dataframe into a training and a testing set.
- For every continuous feature, it sorts the values into discrete intervals finding the limits using the train set.
- It determines the mean target value per interval (using a training set).
- It sorts variables in the test set into the intervals identified in 2.
- It replaces intervals with corresponding target mean values (using the test set).
- It determines a performance metric between the encoded feature and the target (on the test set).
- It selects features whose performance is above a threshold.
For continuous variables, the authors first separated the observations into bins, a process otherwise called discretization. They used 1% quantiles. Then they determined the mean value of the target in each bin using the training set and evaluated the performance after replacing the bin values with the target mean in the test set.
This feature selection technique is very simple; it involves taking the mean of the responses for each level (category or interval), and comparing these values to the target values to obtain a performance metric. Despite its simplicity, it has a number of advantages.
First, it does not involve training a machine learning model, so it is incredibly fast to compute. Second, it captures non-linear relationships with the target. Third, it is suitable for categorical variables, unlike the great majority of the existing selection algorithms. It is robust to outliers as these values will be allocated to one of the extreme bins. According to the authors, it offers comparable performance between categorical and numerical variables. And, it is model-agnostic. The features selected by this procedure should, in theory, be suitable for any machine learning model.
Pros:
- It is fast because no machine learning model is trained.
- It is suitable for categorical and numerical variables alike.
- It is robust to outliers.
- It captures non-linear relationships between features and the target.
- It is model-agnostic.
This selection method also presents some limitations. First, for continuous variables, the user needs to define an arbitrary number of intervals in which the values will be sorted. This poses a problem for skewed variables, where most of the values may fall into just one bin. Second, categorical variables with infrequent labels may lead to unreliable results as there are few observations for those categories. Therefore, the mean target value per category will be unreliable. In extreme cases, if a category was not present in the training set, we would not have a mean target value to use as a proxy to determine performance.
Considerations:
- It needs tuning of interval numbers for skewed variables.
- Rare categories will offer unreliable performance proxies or make the method impossible to compute.
With these considerations in mind, we can select variables based on the target mean performance with Feature-engine.
Python implementation
We will use this method to select variables from the Titanic dataset, which has a mix of numerical and categorical variables. When loading the data, I will do some preprocessing to facilitate the demonstration and then separate it into train and test.
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from feature_engine.selection import SelectByTargetMeanPerformance # load data data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl') # extract cabin letter data['cabin'] = data['cabin'].str[0] # replace infrequent cabins by N data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin']) # cap maximum values data['parch'] = np.where(data['parch']>3,3,data['parch']) data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp']) # cast variables as object to treat as categorical data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O') # separate train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['survived'], axis=1), data['survived'], test_size=0.3, random_state=0)
We will select features based on the roc-auc using 2 fold cross-validation. The first thing to note is that Feature-engine allows us to use cross-validation, which is an improvement with respect to the original method described by the authors.
Feature-engine also allows us to decide how we will determine the intervals for numerical variables. We can choose equal frequency or equal width intervals. The authors used 1% quantiles, which is suitable for continuous variables with a fair spread of values, but not often suitable for skewed variables. In this demo, we will separate numerical variables into equal frequency intervals.
Finally, we want to select features for which the roc-auc is greater than 0.6.
# Feature-engine automates the selection of # categorical and numerical variables sel = SelectByTargetMeanPerformance( variables=None, scoring="roc_auc_score", threshold=0.6, bins=3, strategy="equal_frequency", cv=2,# cross validation random_state=1, # seed for reproducibility )
With the method fit() the transformer:
- replaces categories by the target mean
- sorts numerical variables into equal frequency bins
- replaces bins by the target mean
- using the target mean encoded variables returns the roc-auc
- selects features whose roc-auc > 0.6
# find important features sel.fit(X_train, y_train)
We can explore the ROC-AUC for each feature:
sel.feature_performance_ {'pclass': 0.6802934787230475, 'sex': 0.7491365252482871, 'age': 0.5345141148737766, 'sibsp': 0.5720480307315783, 'parch': 0.5243557188989476, 'fare': 0.6600883312700917, 'cabin': 0.6379782658154696, 'embarked': 0.5672382248783936}
We can find the features that will be dropped from the data:
sel.features_to_drop_ ['age', 'sibsp', 'parch', 'embarked']
With the method transform() we drop the features from the data sets:
# remove features X_train = sel.transform(X_train) X_test = sel.transform(X_test)
Simple. Now we have reduced versions of the train and test sets.