Kandinsky: Using KMeans (and friends) to play with the colors of photograph(s)

shaurya shaurya3 (~shaurya3) | 09 Jun, 2024

0

Votes

Description:

Introduction and Overview

Clustering is tricky yet absolutely essential for many a Machine Learning initiative. The what, the how and the why confound each time we look at the data, whether it is customer segmentation (or cohort) analysis or it is finding centers of influence or breaking down a population into groups to build different models for each.

Studying clustering algorithms like K-Means using toy datasets is insufficient (and often tedious) because it does not let you experience real-world problems. For e.g. the problem when the centroids don't settle, or situations where we have too many or too few clusters. Which distance measure to use and when? How to prepare (normalize? standardize?) the dataset for clustering?

Also, not too many real-world scenarios are "visual", unless we plot a graph or two, and that fails when we deal with higher dimensions.

What if we could use a non-trivial but visual data source? Like the colors and pixels of a photograph, where we could see the data that went in and the resultant output clusters?

In this talk I'll discuss a very special use-case as a demonstration - Cinema. How I leveraged clustering to guarantee integrity of color data throughout the process of shooting, grading and mastering a full length feature film.

The obvious takeaways of this talk, in my experience, are that Data Science and Data Engineering practitioners gain a deeper understanding of what's going on in the clustering algorithms in a fun, very "visual" and engaging manner; and also build a better intuition about the best approach to take for solving a problem.

Outline

In this talk, we'll use photographs as "visual" input data by breaking the colors and pixels into individual 3 dimensional observations.

These will provide sufficiently varied data, to make creating clusters a non-trivial yet easy-to-grasp pursuit.

Building K-Means from scratch this way also enables us to understand the pitfalls and best-practices. For e.g what to do when the centroids do not converge? how to prepare data for a large and varied dataset? how to think about distance and measuring distances?

The talk will be largely interactive, using Jupyter Notebooks along with photographs pre-selected for the talk. However fellow data scientists or engineers will be able to play with the notebooks using their own photographs as well.

The talk will have the following rough outline (with all notebooks and code available on GitHub) with about 5 minutes for each point: 1. Breaking down photographs to create our seemingly random data.
2. Exploring the data before deciding upon clustering.
3. A walkthrough of the K-Means approach.
4. Running K-Means on uncompressed photographs - too long and tedious.
5. Building efficiencies with Quantization - this is a popular approach in LLMs as well.
6. Pitfalls and Best Practices - Exploring our output on a variety of photographs and paintings.

Previous Accolades

Kandinsky was a project I had built to explore clustering in JavaScript several years ago and later re-built in python to help in the production of my feature film - "Eight Down Toofaan Mail".
This talk has received a very warm response on several international conferences and college lectures previously and in it's upgraded avatar, I feel it will be of tremendous value to a larger, more astute audience like that at PyCon India 2024.

Prerequisites:

Minimal exposure to Python programming and Jupyter Notebooks is recommended.
- A little courage in the face of Math is requested.
- A lot of ensuring fun is guaranteed.

Video URL:

https://www.youtube.com/watch?v=sVtkDhz5liI&t=29814s

Content URLs:

An earlier session on Kandinsky at PyCascades 2024
Another talk presented at PyData Global 2023
Kandinsky on GitHub

Speaker Info:

Shaurya Agarwal, Deputy Head - Engineering, at Barnes and Noble (BNED LoudCloud).

With 20+ years of experience in Analytics & ML, Big Data and Cloud Computing, Shaurya is leading the engineering teams at BNED that are working on building the next generation of data products for the company.

Speaker Links:
Github: https://github.com/shauryashaurya
LinkedIn: https://www.linkedin.com/in/shauryashaurya/
X / Twitter: @shauryashaurya (https://twitter.com/shauryashaurya)
Talks:
- An earlier session on Kandinsky at PyCascades 2024 - Another talk presented at PyData Global 2023 - 2023 panel discussion at DataOps Observability Conf 2023: https://www.youtube.com/live/GM1EzNChtdk?feature=share&t=2884

Speaker Links:

An earlier session on Kandinsky at PyCascades 2024
Another talk presented at PyData Global 2023

Section:	Artificial Intelligence and Machine Learning
Type:	Talk
Target Audience:	Beginner
Last Updated:	10 Jun, 2024

Comments