Introduction
There are many great boosting Python libraries for data scientists to reap the benefits of. Some include XGBoost, and the new CatBoost algorithm. However, there is one algorithm that combines some of both of these other algorithm characteristics, making it a must for data scientists. The benefits are of course great in learning and education, but more importantly, for working in a quick, professional environment that requires an algorithm that is fast. Below, I will discuss LightGBM [1] benefits and how they are specific to your data science job.
Categorical Encoding
Perhaps the best feature of this library is the categorical feature support. Whereas a lot of data scientists might use one-hot encoding to create tons of new columns for only one categorical feature, this library allows you to specify the categorical features with the categorical_feature parameter.
While one-hot encoding is useful, in academia, inside your Jupyter Notebook, for example, it can be less useful in the professional setting. Say you have 10 categorical features with 100 unique bins, that can be expanded to 1,000 new columns. Not only does this make your dataframe sparse, but it also makes your model incredibly slower. Another stressful outcome for this sparsity is when you have to translate your features into production code for software engineers working on your prediction service and deployment. This transferring of responsibilities (if you have that setup, of course), can be confusing and overwhelming for both parties to have to deal with.
Here are some of the benefits of categorical encoding with LightGBM:
- Easier to encode categorical features
- Easier to use
- Easier to work with other data scientists, software engineers, backend engineers, and product managers
- Can retain original column names
- Can reap the benefits of categorical features rather than traditional numeric conversion with one-hot encoding
- These benefits can ultimately make your model faster and more accurate
Fast
Not only does encoding your categorical features make your model faster, but LightGBM also has a few other tricks to improve your training and prediction speeds. LigthGBM uses both GOSS and EFB, or Gradient-based One-Side Sampling, and Exclusive Feature Binding, as well as histogram-based splitting.
Here is why a fast LightGBM model is useful for professionals:
- Not every job will allow you weeks or months to come up with a model, and some may even want one the same week — or at least, a proof of concept model
- This faster modeling can allow you to test features and parameters faster, ultimately allowing you to work better in a faster environment
- Can test more features without slowing down your model as much as in other algorithms
It is simple, it is fast, and when you have a lot of people depending on your model, fast will allow you to help the business more efficiently.
Accurate
All XGBoost, CatBoost, and LightGBM are accurate models. Yes, it depends on your problem, features, and data ultimately, but in general, these algorithms lead to accurate results after you have performed the necessary steps.
Because you can use categorial features, you will are likely to have an accurate model, more so, than an algorithm that can only perform one-hot encoding. The way that LightGBM splits can lead to more accurate models as well. It is important to note that you will want to prevent overfitting though.
Here are some of the reasons why LightGBM is more accurate, and how it can help you professionally:
- Splitting method
- Categorical feature support
- Of course, everyone wants a more accurate model, especially in a business (just have to make sure you do not overfit)
Summary
Although these benefits are simple, they are incredibly important and make your work a lot easier. As a result, your company — stakeholders and engineers, will be satisfied with you utilizing LightGBM.
To summarize, here are some of the main benefits of using LightGBM professionally:
- Categorical Encoding
- Fast
- Accurate
I hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with these benefits. Why or why not? What other benefits do you think are important to point out in LightGBM? These can certainly be clarified even further, but I hope I was able to shed some light on LightGBM.