Hangar; git for your data

Sherin Thomas (~hhsecond)


Description:

Software development is entering an era where the behavior of programs critically depends on the data they were trained on. In this setting, data is the new source code, and this opens the door to challenges like versioning and collaboration on numerical data. Enter Hangar, an open-source tool by [tensor]werk that brings Git-style version control to n-dimensional arrays. It supports versioning, branching, merging, time-travel, diffing, remote repositories and partial fetching, with data loaders for the major deep learning frameworks. At its core hangar is designed to solve many of the same problems faced by traditional code version control system (ie. Git), just adapted for numerical data: - Time travel through the historical evolution of a dataset - Zero-cost Branching to enable exploratory analysis and collaboration - Cheap Merging to build datasets over time (with multiple collaborators) - Completely abstracted organization and management of data files on disk - Ability to only retrieve a small portion of the data (as needed) while still maintaining a complete historical record - Ability to push and pull changes directly to collaborators or a central server (ie a truly distributed version control system)

The ability of version control systems to perform these tasks for codebases is largely taken for granted by almost every developer today; However, we are in fact standing on the shoulders of giants, with decades of engineering which has resulted in these phenomenally useful tools. Now that a new era of “Data-Defined software” is taking hold, we find there is a strong need for analogous version control systems which are designed to handle numerical data at large scale... Welcome to Hangar!, a version control system for your data completely written in Python

Talk outline

  • What is data
  • Hangar fundamentals
    • Branching, merging, conflict management
    • Hangar data philosophy
  • Hangar backends: LMDB, HDF5, TileDB
  • Hangar storage: Filesystem, s3 etc.
  • Versioning and time travel
  • Hangar remote
  • Hangar CLI
    • Performance
    • Import and export
    • Other operations
  • Python APIs

Prerequisites:

  • Basic understanding of existing numerical computing toolkits like numpy
  • Thorough understanding of git

Speaker Info:

I am working as a part of the development team of [Tensor]werk, an infrastructure development company focusing on deep learning deployment problems. I and my team focus on building open source tools for setting up a seamless deep learning workflow this includes RedisAI & hangar. I have been programming since 2012 and started using python since 2014 and moved to deep learning in 2015. I am an open source enthusiast and have contributed to the core of several widely used projects like PyTorch. I spend most of my research time on improving the interpretability of AI models using TuringNetwork. I have authored a deep learning book. I go by hhsecond on internet

Speaker Links:

  • https://github.com/hhsecond
  • https://medium.com/@hhsecond
  • https://www.amazon.in/dp/B078TLWD3F

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: