From Reading CSV to Baseline Submission in 10 minutes. Hackathons unfolded.
Time to explore the Data Technology. Get Ready.
It's imperative to understand the general approach and follow it to perfection, if someone wants to finish in the top ranks. What distinguishes the most handsome Kagglers from average ones is their ability to keep on improving all the time. Well, leaving that for an Advanced discussion, let's start by defining Benchmark Score.
- What is Benchmark Score?
The organizers of any hackathon, using the most basic Machine Learning algorithm, define a score that they expect any naive participant to achieve. This can regarded as a threshold that can be crossed with a simple sub-standard approach. Example, classification benchmark can be set as
0.4. This implies that with a simple classifier, you can predict 40% correct classes.
Submission in 10 Minutes!
As soon as the data in CSV format is released, people start generating their hypotheses and thinking of features. That is not something wrong, but, in scenarios as competitive as Hackathons, it's crucial to set a benchmark for yourself, and improve from there on.
The pipeline created for a Baseline submission to achieve that benchmark shall be detailed in the talk:
- Variable Identification
- Univariate and Bivariate Analysis
- Missing Value Treatment
- Outlier Detection
- Feature Engineering: Creation and Transformation
- Feature Selection
- Model Implementation
- Cross Validation
After this, you shall have a score as a baseline. Any other creativity you add to your model, if the score decreases, you know you are not going right somewhere!
Note: This talk assumes that you have a zeal to derive insights from data, and leverage it's power to learn about the future before-hand.
Apart from an interest in problem solving and real-life applications, attendees are advised to come prepared with following in their armor for better understanding:
Python 2.7: Anaconda can be used for easy installation with all dependencies.
Otherwise, these packages can be installed separately:
- SKLearn for ML Packages.
- Pandas for Data Exploration.
- Matplotlib and Seaborn for Visualisation of Data.
- Numpy for Mathematical Computation.
- XGBoost for enhanced prediction.
Dataset to be used for demonstration shall be uploaded shortly.
Note: The requirements are not compulsory. They are recommended for those of you who want to test this approach there itself. :)
The content for the talk has been divided into 3 segments:
After recently graduating with a B.Tech. degree in Production and Industrial Engineering from IIT Roorkee, I am currently working as a Data Science Associate in one of the top management consulting firms in India. I have been contributing to the field of Data Science since my sophomore year, and looking for varied applications of the same in other domains ever since. Involved in initiating and building the community of more than
1500 enthusiasts in college, I have closely mentored and motivated people around me to learn DS the right way!
I avidly take part in hackathons, and feel that working on real data is the way ahead. Owing to research acumen, I have
2 publications to my name, mainly focusing on Terrorism Mining in India, and the other one on Sports Analytics. I have done a lot of work in Supply Chain Optimization as well, and believe in integrating Operations Research with Machine Learning to build products which can help in making the world a better place.
Recently, I have given talks in Data Science conferences, both at a national and international level. At Data Science Congress, I talked about applications of Data Mining applied to Terrorism, and at ICIS 2016, briefed the esteemed public on Data Science with Python and beyond.