Big datasets pose computation problems for software such as R and python in addition to implementing basic machine learning algorithms that can seem like it would run forever. Most of the time it is difficult to even determine how much time it would take to run these algorithms. Enter H20, an open-source software for big-data analysis, produced by the company H2O.ai.
The H2O software runs can be called from statistical packages R, Python, and other environments. It is used for exploring and analyzing datasets held in cloud computing systems and in the Apache Hadoop Distributed File System as well as in the conventional operating-systems Linux, macOS, and Microsoft Windows. H2O allows users to fit thousands of potential models as part of discovering patterns in data.
H2O is a Java Virtual Machine that is optimized for doing in-memory processing of distributed, parallel machine learning algorithms on clusters. A cluster is a software construct that can be can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster. According to the H20 documentation, a cluster memory capacity is the sum across all H2O nodes in the cluster.
H2o provides great flexibility in training and scaling machine learning algorithms in large datasets, which we will witness as we progress in this tutorial.
In our case, we will focus on the Kaggle challenge and use H20 to obtain a great score on the leaderboard.
Through this tutorial series, we will explore different machine learning algorithms offered by H20 such as Generalized Linear models, Gradient Boosting Machines, Stacked Ensembles, and Deep learning modules using the H20 framework.
In the first tutorial, we will learn how to set up H20 on our machine and run some basic H20 algorithms with their baseline performance.
In subsequent tutorials we will discuss the algorithms we will use in detail, then tune our algorithms to our advantage, create stacked ensembles, perform interesting feature engineering, and try to wiggle our way to top of the leaderboard.
You can either use terminal or directly install H20 package from your jupyter notebook.
Also, make sure you have jdk8 and jre 8 installed.
We then import the H20 python package and initialize the H20 cluster. If no address is mentioned inside the H20.init() command, then H20 will initialize a cluster on your local machine.
A variable df reads the CSV file and stores it as a H20 data frame. Remember an H20 data frame is different from a regular pandas data frame.
Let’s check the dimension of our data frame. It has 595212 rows and 59 columns.
Specify target variable and convert it as a factor variable
Create Test, train and validation set in H20
Create Base Models for gradient boosting machine. We will discuss more about this algorithm in detail in later tutorials. We will tune this algorithm later to achieve optimal performance.
Print model summary:
Now let us get predictions for actual tests set in the competition.
We will repeat steps we used for Gradient boosting in Generalized Linear Models as well as below.
In the next tutorial, we will discuss Gradient Boosting Machine in detail and learn how to tune this algorithm better