Machine learning on the run: Optimized feature engineering for streaming timeseries data in industrial automation



With the emergence of large scale industrial IoT systems, large volume of telemetry data is being collected and stored at high frequency. Many of the industrial processes are being automated with sensors and control systems, which along with the manufacturing execution systems(MES) generate large volumes of data; collected through a large number of sensors, deployed in the product line. This data must be processed in near real time. Under the domain of industry 4.0, the data is time series and the processing include stream ingestion, ETL (Extraction, Transform and Loading) and data preprocessing for machine learning. Machine learning model deployments are for fault detection, predictive maintenance, root cause analysis to name a few, to be followed by complex event processing (CEP).

There are multiple challenges for data engineering in the case of large scale industrial IoT systems:

  1. Latency: For sampling rates in the order of a few milli-seconds, data has to be processed at similar rates to ensure that the time lag between the raw data intake and data processing is minimal.
  2. Interdependence: Many times, the result of one model or a transform is dependent on another. Thus, different sampling rates or sudden changes in data streams may lead to delays in computation.
  3. Granularity: The data from multiple sensors come in varied granularity. This has to be accounted for when creating the feature engineering system.
  4. Volume of data: Typically the volume of data is very large. It could easily be upto 500M data points in a day from 100s of 1000s of deployed sensors with 10K – 30K machine learning models running in real time.

In this talk, the speaker is going to talk about a crucial component of data preparation for machine learning deployments which optimizes data processing for machine learning readiness or a complex event processing. The central idea is to use extended semantic graphs with parallel computation of independent transformations to optimise for faster throughput on data stream. Semantic graphs provide a great way for feature engineering because of its advantages in code refactoring and traceability to name a few. Adding parallelism allows us to be able to run parallel threads that optimise the time required for data preprocessing. The library is made on top of Yahoo’s open source graphkit, library, which is for creating lightweight computation graphs.

Basic Outline of the talk:

  1. (1 min): Discussing the agenda and speaker intro
  2. (2-3 mins): Current challenges with processing data in the case of industry 4.0
  3. (3-4 mins): Brief description of semantic graphs
  4. (9-10 mins): How semantic graphs help in data processing and brief advantages it brings over normal data processing
  5. (5-6 mins): Why parallelization is required
  6. (1-2 mins): Challenges and limitations of this approach
  7. (1-2 mins): Key takeaways and conclusion
  8. (4-5 mins): Audience questions

This project was inspired by the research paper:


  1. Basic idea of graph data structures
  2. Interest in IoT applications

Content URLs:


The slides will be further updated with more detailed and explainer slides soon

Speaker Info:

Mayank Prasoon is currently a software developer at, where he is working on ways to find better and faster ways to deploy industrial automation solutions. He has graduated from IIT BHU, and has represented his college in the past at Microsoft research center in Hyderabad for his project on "Physiotherapy using motion sensored game with live tracking" for the hackathon "code fun do". He has been a hobbyist developer during his college, working on different hackathons, and projects in web development, computer vision and machine learning. While, he is free, he loves spending time playing ukulele.

Speaker Links:



Youtube link for microsoft presentation:

Id: 1270
Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: