Experiments in data mining, entity disambiguation and how to think data-structures for designing beautiful algorithms

by Ekta Grover (speaking)

Section: Scientific Computing
Technical level: Intermediate

Objective

This session is about 3 main things –

How to “think” data mining problems and quickly trying out different algorithmic approaches (Fail Cheap, if you should)
Three problems in data-mining around disambiguation, Natural language processing & Information Retrieval
Measuring your Model performance - developing a near-real time performance aggregation platform

And, Putting it all together – Building the problem structure (Think - data structures), analyzing, visualizing and hopefully monetizing .

The audience will walk away with concrete ideas on building your own custom disambiguation algorithms & distance metrics .

Description

This session is for people who would like to move from pure play analytics to heavy data lifting and data engineering applications. With a background in Computer science and Quantitative economics, I blend these two approaches to bring quick wins(and Fail cheaper) to solve some of most interesting problems around us today- for fun and profit.

The four problems I will discuss are around –

Scoring your connections to find your net worth in LinkedIn [Data structures, Algorithms , Data mining & Visualization]
Building your custom distance metrics (with Graphs as base data-structure) and finger keying – with applications for hand-collected data-sets, susceptible to entirely different distance metrics, than traditionally explored [Algorithms]
Building relevant Job feeds in Linkedin (TF-IDF) and (possibly) hacking around Job applications [Relevance Algorithms, Linguist pre-processing & visualization]
Developing a custom batch aggregation platform - with competing system goals [developing scoring metrics, concept of centrality & formal constructs to measure recency]

Most of all you will learn the why and why not of things, that I will keep coming back to as I discus the problems above. This I feel is the most important differentiator between coding well and coding for scale, and helps build a structured thought process to problem solving.
By the end of the session, you will have seen a blend of tools ranging from Python (Algorithms & data mining & Graph Theory) , Python, R & ink space (Graphical representation & visualization)

The session will be heavy on algorithms, and thinking data structures, so to make most of this, you need – a background/Interest in Computer science and Quantitative vigor with some hands-on coding experience, and a mind that wants to learn more.

Requirements

Base operating system – any *nix flavor (I am on oracle Virtual box with Ubuntu 12.04)
Python 2.7 with some data mining specific modules imported (will update the exhaustive list shortly)

Preferable, but not mandatory – R & ink space installed – we use these tools only a representation and visualization of the problem statement . Python is the base tool we will be working with. Some head-start with Python would be good, to make the most of the session.

Speaker bio

Ekta is Sr. Analytics Consultant with data sciences team at [24]7, Innovation labs. She has a background in Quantitative Economics (MS) from Goethe University, Frankfurt and Computer Science (BE) from PESIT, Bangalore and enjoys Monetizing and leveraging technology to solve abstract Business problems.

While at Graduate school she became passionately interested in rationality, framing problems and how we humans respond to ambiguous choices - something she sews in technical dimensions with a scientific rigor in the data mining context – by thinking about the process that generated the data in first place.

Her current profile with [24]7, Inc involves end to end solutionining, statistical analysis and deployment of Analytic models for e-commerce clients and designing intuitive customer experiences.

Digital footprint (Technical & Professional) -:
Linkedin : http://www.linkedin.com/in/ektagrover
Quora : http://www.quora.com/Ekta-Grover

Comprehensive links in Github to the 4 problems I intend to speak about -:
Experiments in data mining : http://bit.ly/12IFUfM
A QWERTY keyboard simulator for custom distance in hand crafted datasets- http://bit.ly/1bOgVeH
Scoring your linkedin connections
http://bit.ly/1e0ftHi Developing your near-real time performance benchmark platform
http://bit.ly/1fpM9I8

Slides

http://www.slideshare.net/ekta1007/pycon-2013-experiments-in-data-mining-entity-disambiguation-and-how-to-think-datastructures-for-designing-beautiful-algorithms

Links

Uploaded preliminary slides . Will update them along the days. This will be a broad skeleton of what we will cover, with some richer content, resources & links.

Comments

▲
1
▼

[-][+] nidhi 585 days ago

very nice:)

[reply] [link]
▲
1
▼

[-][+] Ekta Grover 549 days ago

Uploaded my preliminary slides - looking for some constructive feedback & suggestions.

[reply] [link]
▲
1
▼

[-][+] Ekta Grover 544 days ago

Looks like the embed of the latest slides at Slideshare is not reflecting at this page - here's the link to the latest version-
http://www.slideshare.net/ekta1007/pycon-2013-experiments-in-data-mining-entity-disambiguation-and-how-to-think-datastructures-for-designing-beautiful-algorithms

[reply] [link]

Login with Twitter or Google to leave a comment →