PyVelox: Interfacing Python bindings for the unified execution engine by Meta

Sanjiban Sengupta (~sanjiban)


0

Votes

Description:

Velox, an open-source project by Meta, is a C++ database acceleration library which provides high-performance components for processing huge datasets. Voltron Data, in collaboration with the Meta open-source team, has been developing PyVelox, a Python package that adds bindings to commonly used Velox APIs. This addition empowers Velox developers to leverage Python’s interactive REPL, enabling them to efficiently explore and triage Velox vectors and associated data components. Consequently, PyVelox enables Python developers to execute data queries, including SQL queries, across a wide spectrum of workloads such as batch processing, stream processing, AI/ML, and more.

Velox stands out as a unified engine for data execution, offering a versatile execution engine that seamlessly integrates with diverse data compute architectures. This integration minimizes redundancy while extending consistent functionality across various frameworks. Currently, Velox finds applications in engines like Presto and Apache Spark, alongside Meta's internal streaming service XStream, with integration plans underway for Apache Flink. Previously, say for a data compute architecture built on Spark or Presto, the user needed the respective engine for the execution of data queries, however, Velox facilitates using the same execution engine for both of them, thus unifying the process for any data compute operation. This unification not only reduces complexity but also ensures universal semantics throughout the entire data lifecycle, thus features generated during ad hoc training, or online execution remain consistent.

In this talk, we aim to briefly discuss Velox, its philosophy and methodology. Following this, we shall move to PyVelox, its data types, expressions, and functionalities, thus demonstrating the simplicity of running database queries on Velox using Python APIs without losing efficiency.


Outline:

  • Velox Incubation
    • Open-sourced by Facebook in late 2021.
  • Velox development over the years
  • Need for PyVelox
  • PyVelox Developments
    • Data types
    • Expressions
    • Serialization-deserialization
    • Conversion to-and-from Apache Arrow
    • Type and function signatures
  • Demo
  • PyVelox future goals

Prerequisites:

Knowledge of data engineering and analytics will be helpful. The project is an execution engine, that can be integrated into a data compute architecture, so experience with SQL or data relational queries will be beneficial. The talk will include topics on Python bindings based on pybind11, thus Python and intermediate C++ knowledge is expected.

Speaker Info:

Sanjiban works as a Software Engineer with the Data Engineering team at Voltron Data. His work primarily focuses on the development of open-source projects such as Apache Arrow, Substrait, and Velox by Meta. He co-created Substrait Fiddle, which is an online tool to prototype, visualize and share data relational queries based on the substrait specification. As a part of Voltron Data, he collaborated with the Meta open-source team for developing PyVelox, particularly implementing the support for Arrow-Velox conversion, complex data types, etc.

Sanjiban has been working in the open-source data science and engineering domain since his junior year of college in 2021. He was accepted to participate in Google Summer of Code 2021 for CERN-HSF and thus worked on developing storage functionalities for deep learning models. A year later, he was selected to participate in the CERN Summer Student Program in Geneva, Switzerland, and worked on enhancing TMVA SOFIE: which is a fast machine learning inference engine by CERN. In SOFIE, he was particularly involved in the development of the Keras and PyTorch Parser, machine learning operators based on ONNX standards, Graph Neural Networks support, etc. Moreover, he volunteered as a Mentor for the contributors of Google Summer of Code 2022, and again in 2023, and the CERN Summer Students of 2023 working on CERN’s ROOT Data Analysis Project.

Sanjiban finds hackathon and ideation events very interesting, and has participated in many of them in different levels. Previously, he has worked with various startups as well as corporations, thus gaining industrial experience. During college, he acted as the Vice Chair, and then the Chair of the ACM Student Chapter of IIIT Bhubaneswar. He also acted as the ML Head of various student technical societies.

His work on CERN's TMVA SOFIE Machine Learning Inference Engine has been published/presented as follows:

Speaker Links:

Past talks:

  • TMVA SOFIE: Developing the Machine Learning Inference Engine
    CERN Student Sessions 2022, Geneva; August 2022
    Link to talk

  • ROOT Storage of Deep Learning models in TMVA
    CERN-HSF’s GSoC 2021 End of Program Presentation Series; August 2021
    Link to talk

GitHub Profile
LinkedIn Profile
Speaker's Personal website

Section: Data Science, AI & ML
Type: Talks
Target Audience: Intermediate
Last Updated: