Scalable Big Data solutions using Lambda Architecture
Navya Agarwal (~navya) |
In this digital age, we are generating data at an unprecedented rate. But generating data is not the same as curating knowledge. To extract useful insights from the data and to tame the three Vs of data (Volume, Velocity and Variety), we need to rethink our tools and design principles.
There are two orthogonal approaches to solve this problem. One approach is where we use a new set of tools:
- NoSQL Databases - Mongo, Cassandra, HBase
- Highly Scalable Message Queues - Kafka
- Distributed filesystems - HDFS
- MapReduce Paradigm - Hadoop, Spark
The other and more fundamental line of thought is to innovate around the underlying architecture itself. In this series of innovations and improvement, we have an alternate paradigm for Big Data computation - the Lambda Architecture
Lambda architecture - a generic, scalable and fault-tolerant data processing architecture and goes beyond any specific set of tools or libraries. As a concrete example, PySpark provides abstractions which makes it very easy to write big data applications. Alternatively, one may build their own big data application using components like Airflow, PyAkka etc. The idea of this talk is to introduce the architecture itself and make it easy for people to understand and relate to it.
In this talk we would cover the following aspects :
- Introduction and motivation
- Design philosophy behind Lambda Architecture
- Pros and Cons
- Alternatives to Lambda Architecture
We would be doing this talk as a dialogue between the two speakers to encourage thought-processing and brainstorming in the audience. In our experience, a conversational setting is more engaging and connectable for the audience.
The notes for the talk are available at https://github.com/shagunsodhani/Lambda-Architecture
The presentation (along with detailed explanation) would be soon available. The notes for the talk are available at https://github.com/shagunsodhani/Lambda-Architecture
I am a Machine Learning developer working with the Data Science and Analytics team at Adobe Systems. I have also been a teaching assistant for the 3-course series titled Data Science and Engineering with Spark XSeries, created in partnership with professors from University of California, Berkeley, University of California, Los Angeles and Databricks and offered on the edX platform.
I have good experience of public speaking and have previously given talks at:
- PyCon 2016
- Big Data Training Program, IIT Roorkee
I am a polyglot developer working with the LiveFyre team at Adobe Systems. I currently look after the authentication and orchestration part of the stack with the broad goal of optimizing the performance and scalability of the system. I am also looking at various language modeling use cases for our product. Over the past 2 years, I have dabbled with multiple tech stacks and have worked on various innovative ideas with different products. Prior to joining Adobe, I have been a DAAD research scholar at Chemnitz University of Technology (Germany) and was a gold medalist at MNNIT. I have also worked in the area of statistical machine translation at IIIT Hyderabad.
I am also delivering a talk on Encoder Decoder Systems at PyDataDelhi Conference in September.
I have been an active speaker in Adobe and have given tech talks on various topics including:
- Quartz as the core scheduling service for our workflows
- Lambda Expression as the building blocks for our services