Prajwal Kailas (~prajwal) |
It is relatively easy to build a first-cut machine learning model. But what does it take to build a reasonably good model, or even a state-of-art model ?
Ensemble models. They are our best friends. They help us exploit the power of computing. Ensemble methods aren’t new. They form the basis for some extremely powerful machine learning algorithms like random forests and gradient boosting machines. The key point about ensemble is that consensus from diverse models are more reliable than a single source. This talk will cover how we can combine model outputs from various base models to create a stronger/better model output.
This talk will cover various strategies to create ensemble models.
Using third-party Python libraries along with scikit-learn, this talk will demonstrate the following ensemble methodologies: 1) Weighted Average 2) Stacking 3)Blending
Real-life examples from the enterprise world will be show-cased where ensemble models produced better results consistently when compared against single best-performing models.
We at Unnati Data Labs have built a package for performing the ensembling of machine learning models.
The motivation was to incorporate the various techniques described into a single python package. The current version of the package has the power to perform weighted averaging, and build stacking and blending models for binary classification. The other ensembling techniques will be incorporated in the future. The package enables data encoding, the desired data on which the user wishes to classify the data, can be encoded in many different ways like label encoding (categorical encoding), one hot encoding, sum encoding, polynomial encoding, backward difference encoding, helmert encoding, and hashing encoding. The encoded data will then be used to train the base models. The package provides the user with the ability to build a number of base models such as Gradient Boosting (XGBoost), Multi Layer Perceptron, Random Forest Classifier, Decision Tree, Linear Regression, Logistic Regression. The user can build any or all the base models with default parameter values, change the parameter values or provide a list of parameter values to perform hyper parameter optimistaion (Using Hyperopt and Grid Search) and identify the optimum parameter values. Once the desired parameter values have been obtained the respective base models will be trained. Upon training the models, the trained models are used to obtain predictions on the cross validation data, these predictions obtained from the base models will be used to construct a data frame, which will be used to train the stacking and blending models, and perform weighted averaging.
Once the base models have been trained the user can select which ensembling technique to use. The dataframe of predictions will be used to perform weighted averaging, the same dataframe will be used to train the stacking model, for training the blending model the dataframe of predictions will be appended with the cross validation data. To train the ensemble models the algorithm or classifier which will be used can be any of the algorithms/classifiers provided for the training of the base models. For testing the stacking and blending models we can hold out a test set which can be used to examine how well these ensemble models perform, wether they overfit, underfit, provide better performance than the base models.
Why this package stands out? First of, the package enables data encoding which provides more flexibility and information which can imporve the performance of the models. The package utilizes hyperopt and gridsearch to perform hyper parameter optimization, so at the time of training the models, we have the optimum parameter values which yields improved performance. Since ensembling involves building and training multiple models with the option of performing hyper parameter optimization, time taken to train these models becomes an issue. To tackle this issue the package makes use of the joblib library for parallelization. With the help of joblib, the various base models are trained in parallel, also the predictions on the cross validation data are obtained by running the predict functions in parallel. For the ensemble models the training of the different ensemble models are done in parallel, so is the testing phase of the ensemble models. This has enabled for a fast, robust, well defined package with immense potential in the world of machine laeenring.
The package is still in its primary stage, and will continue to be worked upon. Some of the key areas to work on will be adding more base models as well as different ensembling techniques and also expanding to regression, multi class classification and more.
Creating better models is the critical component of building a good data science product.
Basic knowledge on machine learning and usage of python libraries like sklearn, pandas.
Ensemble Package (Link) https://github.com/unnati-xyz/ensemble-package
A tech enthusiast and an optimist, pursuing his Bachelors in Computer Science & Engineering at the National Institute Of Technology, Karnataka. Finshed my second year of engineering and interning at Unnati Data Labs.