Harnessing Open Data to build user profiles using python sci-kit

Abhiram Ravikumar (~abhiram89)





Want to learn how you can use the huge amounts of open data available on social platforms like Twitter, GitHub and StackOverflow to build a profile for a software developer? Yes, it's possible using python's sci-kit library. Mine data, extract features, compute quotients and finally, visualize!

Detailed description

The talk will start with an overview of data mining and machine learning concepts, during the course of which common misconceptions about data science would be cleared. As a real life example, the problem statement of job-seekers and recruitment is introduced. This then leads to the solution Vritthi, an open source project and then the technical aspects follow.

Vritthi uses data mining and machine learning to help job-seekers to understand their skill sets and take up courses that would help them improve their technical expertise. Vritthi can automatically calculate a professional quotient by collating data from websites like GitHub, StackOverFlow and LinkedIn. This analysis is a result of parsing thousands of similar profiles available through the APIs of the above websites.

GitHub archive is one of our data sources which actually helps set standards to coding competencies of individual profiles.

Collection of data from GitHub using its API is explained in detail, along with the feature-set used to analyze profiles. Once the data is collected from the API, it passes through the data cleaning phase after which a set of features are extracted. These features could be as simple as number of commits, number of projects in a particular programming language, and so on. Right after this, python sci-kit is used to build the data model that’s required for analysis. A supervised learning model is used which consists of two phases - clustering profiles and computing quotient values. Once the data model is ready, computing technical quotient values per programming language or skill is focused upon. For example, “programming languages used” is one of the attributes of the feature vector. Finally, the computed quotients are visualized using a web application which uses Python’s Bokeh visualization library.

Thus, classic data mining and machine learning have been employed on openly available data to solve a specific problem statement.

Who is this talk for?

  • Python developers who’d like to explore sci-kit
  • Web developers who’d like to explore python’s bokeh library for data viz.
  • Entrepreneurs who would like to see how a practical use case is solved using open data

What will participants take away?

  • Live example of machine learning and how to adopt python sci-kit library in a ML use case
  • A solid understanding of data science and how it can solve problems in real life
  • Deeper understanding of GitHub’s API for data extraction and mining


Basic programming knowledge in any object-oriented language would be helpful.

Content URLs:

Speaker Info:

Abhiram has been a part of the open source world in Bangalore for over 3 years now. As a student volunteer in Bangalore, he started contributing to Mozilla as well as FSMK (Free Software Movement Karnataka). After becoming a Mozilla Rep, he has presented over 40 sessions and workshops on python scripting, web dev, Rust and git version control at various venues all over India. Being an internet activist, he was an integral part of the #SaveTheInternet campaign in India during the fight against net neutrality violations. In 2016, he was invited to Mozilla’s Leadership Summit in Singapore to present a talk on running a successful campus club for ~3 years.

Currently, he is a Mozilla Tech Speaker well versed in topics like full stack web development, decentralization, scalable infrastructure set up, open source contribution practices and mentoring web enthusiasts . For the past 2 years, he is working at SAP Labs in Bangalore as a full stack web developer and continues to contribute to Mozilla India on a voluntary basis.

Recently, he was invited to record a programming course on Rust by the educational website Lynda.com at Los Angeles, California. The course is titled First Look: Rust and it went live last week!

Speaker Links:

Events and speaking engagements

Blogs and social media

Id: 989
Section: Data science
Type: Talks
Target Audience: Beginner
Last Updated: