Real-time processing of high-velocity social media data streams with Apache Storm
Sarthak Dev (~sarthak) |
- Introduce real-time stream computation in python and how it can be better alternative to Python worker-queue frameworks such as celery/RabbitMQ.
- Insights on high-velocity social stream computation for analytics
Analytics and social-media
For real-time analytics, processing data faster than the consumption process is imperative. With social media, even more so. There are a lot of options re: task and messaging queues, for e.g. celery, RabbitMQ etc. Both are broker-centric and maintain a state of every event which happens and this leaves an extra dependency layer. Storm, however, works completely within it's own layer when it comes to consumption and processing of tasks. This immediately makes it more lightweight and easier to maintain.
Storm is an open-source distributed "real-time" computing system. It processes as it receives and is often looked at as the real-time counterpart to Hadoop's batch-processing prowess. It is based on a "topology" architecture divided into two computing units: Spouts and Bolts. Spouts generate the data and feed it to Bolts which process it and may/may not feed it to further bolts. Storm is scalable, guarantees the processing of any data that has been generated, simple to setup and can be used with any language.
USP: Twitter owns this project now and uses it for their analytics.
There isn't much in terms of pre-requisites, but awareness of how social-media works would be great. Knowledge about Twitter/Facebook APIs would be an added bonus.
I don't have slides currently but would put them up ASAP.
I am Sarthak, backend engineer with Airwoot. I've been working full-time with Python over the last 2 and a half years and have always had an inclination towards data-science and making sense of open data. Yes, http://data.gov happens to be one of my favourite sites. I'm extremely active on Twitter and during big events, often open up my Python console and take quick digs at checking out how they're spreading.
At Airwoot, we work with social-data on a daily basis and feedback from our clients has led us to believe analytics is one of our stronger suits. Our tech stack is completely Python with databases spread across PostgreSQL and MongoDB. We have been using Apache Storm for our analytics over the last few months and have derived fantastic value off it.
Having worked with start-ups for close to 3 years and often having shouldered the responsibility to be the data-guy, I believe this is an exciting opportunity to share my experience with what I believe is a rather fabulous piece of technology.