Real-time processing of high-velocity social media data streams with Apache Storm

Sarthak Dev (~sarthak)


19

Votes

Description:

Objective

  • Introduce real-time stream computation in python and how it can be better alternative to Python worker-queue frameworks such as celery/RabbitMQ.
  • Insights on high-velocity social stream computation for analytics

Analytics and social-media

For real-time analytics, processing data faster than the consumption process is imperative. With social media, even more so. There are a lot of options re: task and messaging queues, for e.g. celery, RabbitMQ etc. Both are broker-centric and maintain a state of every event which happens and this leaves an extra dependency layer. Storm, however, works completely within it's own layer when it comes to consumption and processing of tasks. This immediately makes it more lightweight and easier to maintain.

Apache Storm

Storm is an open-source distributed "real-time" computing system. It processes as it receives and is often looked at as the real-time counterpart to Hadoop's batch-processing prowess. It is based on a "topology" architecture divided into two computing units: Spouts and Bolts. Spouts generate the data and feed it to Bolts which process it and may/may not feed it to further bolts. Storm is scalable, guarantees the processing of any data that has been generated, simple to setup and can be used with any language.

USP: Twitter owns this project now and uses it for their analytics.

Prerequisites:

There isn't much in terms of pre-requisites, but awareness of how social-media works would be great. Knowledge about Twitter/Facebook APIs would be an added bonus.

Content URLs:

I don't have slides currently but would put them up ASAP.

Speaker Info:

I am Sarthak, backend engineer with Airwoot. I've been working full-time with Python over the last 2 and a half years and have always had an inclination towards data-science and making sense of open data. Yes, http://data.gov happens to be one of my favourite sites. I'm extremely active on Twitter and during big events, often open up my Python console and take quick digs at checking out how they're spreading.

At Airwoot, we work with social-data on a daily basis and feedback from our clients has led us to believe analytics is one of our stronger suits. Our tech stack is completely Python with databases spread across PostgreSQL and MongoDB. We have been using Apache Storm for our analytics over the last few months and have derived fantastic value off it.

Having worked with start-ups for close to 3 years and often having shouldered the responsibility to be the data-guy, I believe this is an exciting opportunity to share my experience with what I believe is a rather fabulous piece of technology.

Speaker Links:

You can find me here on Github and Twitter.

Section: Data Visualization and Analytics
Type: Talks
Target Audience: Intermediate
Last Updated:

@sarthak : Thank you for your proposal. Would request you please write down a rough agenda of your talk. Also, assuming you are writing your topologies in Python , please mention the libraries you plan to use.

Looking forward to it.

konark modi (~konark)

AFAIK, now it's an Apache foundation project, so you might want to update the abstract with the same instead of Twitter.

konark modi (~konark)

@konark: The library I am using is streamparse by the guys at Parse.ly.

The rough agenda would be to take the audience through Storm, it's power and capabilities and then running through a Python way to do this. Maybe even ship in a small demo.

Also, I have mentioned Apache Storm itself throughout. Twitter now owns this project, by the way.

Sarthak Dev (~sarthak)

Login to add a new comment.