6 Predictive Models Every Beginner Data Scientist Should Master

As you fall into the hype vortex of Machine Learning and Artificial Intelligence, it seems that only advanced techniques will solve all your problems when you want to build a predictive model. But, as you get your hands dirty in the code, you find out that the truth is very, very different. A lot of the problems you will face as a data scientist are solved with a combination of several models and most of them have been around for ages.

And, even if you solve problems using more advanced models, learning the fundamentals will give you an head start in most discussions. Particularly, learning the benefits and short-comes of more simple models will help you steer a data science project for success. The truth is: advanced models are able to do two things — amplify or amend some of the flaws of simpler models that they are based on.

French Telco Orange Hit by Cyber-Attack

ATC Ghana supports Girls-In-ICT Program

That being said, let’s jump into the DS world and know about 6 models that you should learn and master when you want to be a Data Scientist.

Linear Regression

One of the oldest models (an example, Francis Galton used the term “Regression” in the 19th century) around and still one of the most effective to represent linear relationships using data.

Studying linear regression is a staple in econometric classes all around the world — learning this linear model will give you a good intuition behind solving regression problems (one of the most common problems to solve with ML) and also understand how you can build a simple line to predict phenomena using math.

There are also other benefits on learning Linear Regression — particularly when you learn both methods available to achieve the best performance:

Closed form solution, an almost magical formula that gives you the weights of the variables with a simple algebra equation.
Gradient Descent, an optimization method that progresses toward the optimum weights and that is used to optimize other types of algorithms.

Additionally, the fact that we can visualize Linear Regression in practice using a simple 2-D plot makes this model a really good start to understand algorithms.

Some resources to learn about it:

Logistic Regression

Although named Regression, Logistic Regression is the best model to start your mastery on Classification Problems.

There are several benefits on learning Logistic Regression, namely:

Having a first glance at classification and multi-classification problems (a huge part of ML tasks).
Understand function transformations such as the one done by the Sigmoid Function.
Understand the usage of other functions for Gradient Descent and how it is agnostic to the function to optimize.
First glance at Log-Loss function.

What should you expect to know after studying Logistic Regression? You will able to understand the mechanism behind Classification Problems and how you can use Machine Learning to separate classes. Some problems that fall into this category:

Understanding if a transaction is fraudulent or not.
Understanding if a customer will churn or not.
Classifying loans according to their probability of default.

Just like Linear Regression, the Logistic is also a linear algorithm — after studying both of them, you will get to know the main limitations behind linear algorithms and how they fail to represent many real-world complexities.

Some resources to learn about it:

Decision Trees

The first non-linear algorithm to study should be the Decision Tree. A fairly simple and explainable algorithm based on if-else rules, the Decision Tree will give you a good grasp on non-linear algorithms and their advantages and disadvantages.

Decision Trees are the building block of all tree-based models — by learning them you will also be prepared to study other techniques such as XGBoost or LightGBM (more about them, below).

The cool part is that Decision Trees apply to both Regression and Classification problems, with minimum differences between the two — the rationale behind choosing the best variables that influence an outcome is roughly the same, you just switch the criteria to do it — in this case, the error measure.

Although you have the concept of hyper-parameters for regression (such as the regularization parameter), in Decision Trees they are of extreme importance, being able to draw the line between a good and a model that is an absolute garbage. Hyper parameters will be essential on your journey in ML, and Decision Trees are an excellent opportunity to test them.

Some resources about decision trees:

Random Forest

Due to their sensitivity to hyper-parameters and fairly simple assumptions, Decision Trees are fairly limited in their outcome. As you study them, you will understand that they are really prone to over-fitting, creating models that don’t generalize for the future.

The concept of Random Forest is really simple — if Decision Trees are a dictatorship, Random Forests are a democracy. They help to diversify across different decision trees and this helps to bring robustness to your algorithm — just like decision trees, you can configure a ton of hyper-parameters to enhance the performance of this Bagging model. What’s Bagging? A really important concept in ML that brings stability to different models — you just use the average or a voting mechanism to transform the result of different models into a single approach.

In practice, Random Forest trains a fixed amount of Decision Trees and (normally) averages the results from all those previous models — and just like Decision Trees, we have Classification and Regression Random Forests. If you’ve heard about the concept Wisdom of the Crowds, bagging models apply that concept to ML models training.

Some resources to learn about the Random Forest algorithm:

XGBoost/LightGBM

Other algorithms based on Decision Trees that brings them stability are XGBoost or LightGBM. These models are boosting algorithms, they work on errors made by previous weak learners to find patterns that are more robust and generalize better.

This stream of thought regarding Machine Learning models, that gained traction after Michael Kearns’s paper on Weak Learners and Hypothesis Testing, showcases that boosting models may be an excellent solution for the overall bias/variance trade-off that models suffer. Additionally, these models are some of the favorite choices to apply in Kaggle competitions.

XGBoost and LightGBM are two famous implementations of Boosting algorithms. Some resources to learn about them:

Artificial Neural Networks

Finally, the current holy grail of predictive models— Artificial Neural Networks (ANNs).

ANNs are currently one of the best models to find non-linear patterns in data and to build really complex relationships between independent and dependent variables. By learning them you will be exposed to the concepts of activation function, back-propagation and neural network layers — these concepts should give you good foundations to study Deep Learning models.

Additionally, Neural Networks have ton of different flavors when it comes to their architecture — studying the most basic ones will build the blocks to jump to other types of models such as Recurrent Neural Networks (mostly used in Natural Language Processing) and Convolutional Neural Networks (mostly used in Computer Vision).

Some extra resources to learn about them:

And, that’s it! These models should give you a nice head start in Data Science and Machine Learning. By learning them you will be prepared to learn more advanced models and easily grasp the math behind those models.

The good part is that the more advanced stuff is normally based on the 6 models I’ve presented here, so knowing their underlying math and mechanisms will never hurt, even in projects where you need to bring the “big guns”.

Source: Ivo Bernardo, Data Scientist

Tags: 6 Predictive Models Every Beginner Data Scientist