PyVelox: Interfacing Python bindings for the unified execution engine by Meta
Sanjiban Sengupta (~sanjiban) |
Description:
Velox, an open-source project by Meta, is a C++ database acceleration library which provides high-performance components for processing huge datasets. Voltron Data, in collaboration with the Meta open-source team, has been developing PyVelox, a Python package that adds bindings to commonly used Velox APIs. This addition empowers Velox developers to leverage Python’s interactive REPL, enabling them to efficiently explore and triage Velox vectors and associated data components. Consequently, PyVelox enables Python developers to execute data queries, including SQL queries, across a wide spectrum of workloads such as batch processing, stream processing, AI/ML, and more.
Velox stands out as a unified engine for data execution, offering a versatile execution engine that seamlessly integrates with diverse data compute architectures. This integration minimizes redundancy while extending consistent functionality across various frameworks. Currently, Velox finds applications in engines like Presto and Apache Spark, alongside Meta's internal streaming service XStream, with integration plans underway for Apache Flink. Previously, say for a data compute architecture built on Spark or Presto, the user needed the respective engine for the execution of data queries, however, Velox facilitates using the same execution engine for both of them, thus unifying the process for any data compute operation. This unification not only reduces complexity but also ensures universal semantics throughout the entire data lifecycle, thus features generated during ad hoc training, or online execution remain consistent.
In this talk, we aim to briefly discuss Velox, its philosophy and methodology. Following this, we shall move to PyVelox, its data types, expressions, and functionalities, thus demonstrating the simplicity of running database queries on Velox using Python APIs without losing efficiency.
Outline:
- Velox Incubation
- Open-sourced by Facebook in late 2021.
- Velox development over the years
- Need for PyVelox
- PyVelox Developments
- Data types
- Expressions
- Serialization-deserialization
- Conversion to-and-from Apache Arrow
- Type and function signatures
- Demo
- PyVelox future goals
Prerequisites:
Knowledge of data engineering and analytics will be helpful. The project is an execution engine, that can be integrated into a data compute architecture, so experience with SQL or data relational queries will be beneficial. The talk will include topics on Python bindings based on pybind11
, thus Python and intermediate C++ knowledge is expected.
Speaker Info:
Sanjiban works as a Software Engineer with the Data Engineering team at Voltron Data. His work primarily focuses on the development of open-source projects such as Apache Arrow, Substrait, and Velox by Meta. He co-created Substrait Fiddle, which is an online tool to prototype, visualize and share data relational queries based on the substrait specification. As a part of Voltron Data, he collaborated with the Meta open-source team for developing PyVelox, particularly implementing the support for Arrow-Velox conversion, complex data types, etc.
Sanjiban has been working in the open-source data science and engineering domain since his junior year of college in 2021. He was accepted to participate in Google Summer of Code 2021 for CERN-HSF and thus worked on developing storage functionalities for deep learning models. A year later, he was selected to participate in the CERN Summer Student Program in Geneva, Switzerland, and worked on enhancing TMVA SOFIE: which is a fast machine learning inference engine by CERN. In SOFIE, he was particularly involved in the development of the Keras and PyTorch Parser, machine learning operators based on ONNX standards, Graph Neural Networks support, etc. Moreover, he volunteered as a Mentor for the contributors of Google Summer of Code 2022, and again in 2023, and the CERN Summer Students of 2023 working on CERN’s ROOT Data Analysis Project.
Sanjiban finds hackathon and ideation events very interesting, and has participated in many of them in different levels. Previously, he has worked with various startups as well as corporations, thus gaining industrial experience. During college, he acted as the Vice Chair, and then the Chair of the ACM Student Chapter of IIIT Bhubaneswar. He also acted as the ML Head of various student technical societies.
His work on CERN's TMVA SOFIE Machine Learning Inference Engine has been published/presented as follows:
- Moneta L., Sengupta S., Hamdan A. "New developments of TMVA/SOFIE: Code Generation and Fast Inference for Graph Neural Networks". Oral Presentation at 26th International Conference on Computing in High Energy & Nuclear Physics; May 2023; Virginia, USA
- Sitong An, Sanjiban Sengupta et al. C++ Code Generation for Fast Inference of Deep Learning Models in ROOT/TMVA. 2023 Journal of Physics: Conference Series 2438 012013
- An S., Moneta L., Sengupta S., Hamdan A. Shah N., Shende H., Mittal S., Zapata O. "ROOT Machine Learning Ecosystem for Data Analysis". Poster presented at 21st International Workshop on Advanced Computing and Analysis Techniques in Physics Research; October 2022; Bari, Italy.
- An S., Moneta L., Sengupta S., Hamdan A., Sossai F., Saxena A. "SOFIE: C++ Code Generation from ROOT/TMVA for Fast Deep Learning Inference". Poster presented at 20th International Workshop on Advanced Computing and Analysis Techniques in Physics Research; November 2021; Daejeon, South Korea.
- Sengupta S. "TMVA SOFIE: Enhancing the Machine Learning Inference Engine". A report published for the CERN Summer Student Program; December 2022; Geneva, Switzerland.
Speaker Links:
Past talks:
TMVA SOFIE: Developing the Machine Learning Inference Engine
CERN Student Sessions 2022, Geneva; August 2022
Link to talkROOT Storage of Deep Learning models in TMVA
CERN-HSF’s GSoC 2021 End of Program Presentation Series; August 2021
Link to talk