Experiments in data mining, entity disambiguation and how to think data-structures for designing beautiful algorithms
by Ekta Grover (speaking)
- Scientific Computing
- Technical level
This session is about 3 main things –
How to “think” data mining problems and quickly trying out different algorithmic approaches (Fail Cheap, if you should)
Three problems in data-mining around disambiguation, Natural language processing & Information Retrieval
Measuring your Model performance - developing a near-real time performance aggregation platform
And, Putting it all together – Building the problem structure (Think - data structures), analyzing, visualizing and hopefully monetizing .
The audience will walk away with concrete ideas on building your own custom disambiguation algorithms & distance metrics .
This session is for people who would like to move from pure play analytics to heavy data lifting and data engineering applications. With a background in Computer science and Quantitative economics, I blend these two approaches to bring quick wins(and Fail cheaper) to solve some of most interesting problems around us today- for fun and profit.
The four problems I will discuss are around –
Scoring your connections to find your net worth in LinkedIn [Data structures, Algorithms , Data mining & Visualization]
Building your custom distance metrics (with Graphs as base data-structure) and finger keying – with applications for hand-collected data-sets, susceptible to entirely different distance metrics, than traditionally explored [Algorithms]
Building relevant Job feeds in Linkedin (TF-IDF) and (possibly) hacking around Job applications [Relevance Algorithms, Linguist pre-processing & visualization]
Developing a custom batch aggregation platform - with competing system goals [developing scoring metrics, concept of centrality & formal constructs to measure recency]
Most of all you will learn the why and why not of things, that I will keep coming back to as I discus the problems above. This I feel is the most important differentiator between coding well and coding for scale, and helps build a structured thought process to problem solving.
By the end of the session, you will have seen a blend of tools ranging from Python (Algorithms & data mining & Graph Theory) , Python, R & ink space (Graphical representation & visualization)
The session will be heavy on algorithms, and thinking data structures, so to make most of this, you need – a background/Interest in Computer science and Quantitative vigor with some hands-on coding experience, and a mind that wants to learn more.
Base operating system – any *nix flavor (I am on oracle Virtual box with Ubuntu 12.04)
Python 2.7 with some data mining specific modules imported (will update the exhaustive list shortly)
Preferable, but not mandatory – R & ink space installed – we use these tools only a representation and visualization of the problem statement . Python is the base tool we will be working with. Some head-start with Python would be good, to make the most of the session.
Ekta is Sr. Analytics Consultant with data sciences team at 7, Innovation labs. She has a background in Quantitative Economics (MS) from Goethe University, Frankfurt and Computer Science (BE) from PESIT, Bangalore and enjoys Monetizing and leveraging technology to solve abstract Business problems.
While at Graduate school she became passionately interested in rationality, framing problems and how we humans respond to ambiguous choices - something she sews in technical dimensions with a scientific rigor in the data mining context – by thinking about the process that generated the data in first place.
Her current profile with 7, Inc involves end to end solutionining, statistical analysis and deployment of Analytic models for e-commerce clients and designing intuitive customer experiences.
Comprehensive links in Github to the 4 problems I intend to speak about -:
Experiments in data mining : http://bit.ly/12IFUfM
A QWERTY keyboard simulator for custom distance in hand crafted datasets- http://bit.ly/1bOgVeH
Scoring your linkedin connections
http://bit.ly/1e0ftHi Developing your near-real time performance benchmark platform