Demystifying the refactoring of machine learning codebases

rito-sixt | 02 Jun, 2023

7

Votes

Description:

Abstract

A lot has been spoken over the years about how we should go about implementing and refactoring code for machine learning use-cases. I, personally, have encountered claims that ranged from "it should work exactly as it does in software engineering" to "it, simply put, cannot be done because of the experimental nature of the job". With gradual experience of dealing with multiple large scale production systems ( both analytical and otherwise ), some of which served millions of recommendations a day, I realised that the former is not quite true, and needless to say the latter is definitely untrue. Refactoring principles for machine learning code cannot be approached with a lift-and-shift mechanism from the traditional software engineering practices. However the importance of writing clean and maintainable code stays as much important, if not more, because of the associated complexity of our solutions.

Having seen both the worlds with relative depth, I've come to see that an amalgamation of refactoring principles ( as laid by industry pioneers like Martin Fowler and Robert Martin ), with slight adaptations here and there for specific use-cases work the best. In this hands-on workshop, I would be demonstrating the refactoring of a legacy machine learning codebase. The primary intent here is for the audience to walk away with a decent understanding of what it means to write clean data science code, how to look for code smells and what concrete steps they could take to address them. We would start with a smelly codebase, that trains a machine learning model and generates predictions, and then slowly work our way through to refactor various elements to make it more readable, maintainable and reusable. We would be paying special attention to the underlying principles of why we do something, with tons of discussions on why and how we improve certain code segments to make the world a better place!

"Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live"

Sections

The overall workshop can be broadly sub-divided into a few sections -

Introduction and context setting - Here we familiarise ourselves with each other, the legacy code-base and the associated business problem. This section builds the foundation for 180 minutes of intense hacking away at the code to improve it's overall quality, and the overarching concept of why we are doing this would come in handy more often than not
Identifying code smells and prioritisation - Once we understand what the code intends to achieve, we start to objectively evaluate smaller sections. The aim of this section is to identify where the code does a decent job, and more importantly what it doesn't. Post that, we evaluate the discovered list and aim to prioritise the same, to come up with a order of addressing. This section is especially vital, as a lot of teams struggle to dish out bandwidth for refactoring, even when they have a good grasp on what needs to be refactored. The only way we would ever get around to refactor things is with efficient prioritisation and chalking out the expected impact of the associate exercise
Refactoring principles in action - The meat of this workshop would focus on evaluating each of the shortlisted code smells, understanding why it is a problem, in-depth discussion of how it can be addressed, and then the solution. This section would largely focus on why things are done in a certain why, and we should expect lots of discussions with the audience, in a mob programming format.
Look-back, key-takeaways and conclusion - After some serious refactoring for over 120 minutes, we recap and look back to see how far we have come. We discuss what the key-takeaways were for the audience, and how we could benefit from this in their day-to-day lives.

Target Audience

The primary user persona, who would benefit from this workshop, are people who deal with analytical code bases day-in day-out, either in the form of developing machine learning models, putting them into production or consume the output of the same. If you are someone who has to deal with a shabbily written ML codebase, or if you are aware that your team's codebase is not exactly the prettiest thing out there, you would benefit largely from this workshop. More often than not, over the years, that a lot of teams realise that they are crippled by badly written code, but the inertia in the team is way too much and they are puzzled about where to start. This is the exact situation we will try to address in the workshop, hence the conscious choice of starting with a medium-sized legacy codebase. Furthermore, a lot of the refactoring concepts discussed would be adapted accordingly, in the context of dealing with large scale data and analytical processing of the same.

Talk outline

Introduction ( Speaker and Subject ) - 5 minutes
Clean code - What and why should you care? - 10 minutes
Walkthrough of the legacy code and the associated business use-case - 10 minutes
Identifying code smells - 15 minutes
Prioritisation: How to decide what order to refactor using Impact vs Effort Matrix - 10 minutes
Unit testing: Building safety nets before we refactor - 15 minutes
Meaningful naming - 10 minutes
DRY: Don't repeat yourself - 10 minutes
Single Responsibility Principle - 10 minutes
Writing meaningfully clean functions - 15 minutes
Design patterns Overview - 10 minutes
Eliminating if-else ladders using Strategy pattern - 20 minutes
Avoid immutability while using Pandas - 10 minutes
Mastering IDE usage - 15 minutes
Overview of the cleaned code - 10 minutes
Discuss key take-aways - 10 minutes
Concluding the workshop - 5 minutes

Prerequisites:

Pre-Requisites

Familiarity with training machine learning models ( Beginner )
Python ( Intermediate )

Note - Neither of the above should be a strong deterrent, even if an audience member is not too familiar. With a little bit of pre-reading ( say a week or so ), you should be able to follow-along quite well. The only hard pre-requisite is to have some previous coding experience in any language.

Speaker Info:

I ( Ritabrata Moitra ) currently work for Sixt Research and Development, in the capacity of Data Scientist III - MLOps Specialist, working towards building Sixt's end-to-end machine learning platform form scratch. Prior to SIXT, I spent more of my career with Thoughtworks, where I was lucky to have worked with some of the brightest minds I have ever come across. During my time at Thoughtworks, who are considered to be industry pioneers in some of the essential software development practices like Clean Code, Agile etc., I was lucky to be exposed of some of these "best practices" quite early on and the passion towards writing maintainable and scalable software was imbibed in me quite strongly.

Having played a multitude of different roles over the years, starting from Software Dev, Data Engineer, Data Scientist and now MLOps, and getting a in-depth understanding of the different nuances embedded in each of them, I aim to act as the bridge between these disciplines, therefore ensuring symbiotic cross-learning for maximum throughput for all.

When I am not working, I spend my time building custom mechanical keyboards or hacking away at open source repositories.

Speaker Links:

Here are a few links for my previous work -

Professional Platform Links

Open Source Contributions

Prefect

Section:	Data Science, AI & ML
Type:	Workshops
Target Audience:	Intermediate
Last Updated:	16 Jun, 2023

Comments