Committing to Sanity

Introduction

You successfully complete a modeling project and three months goes by. You've since learned something and need to go back to make a change and rebuild the model. So, you start going through your project directory. Hmm, was the final model training code in test_90_new_features-Copy4.py or prod_90_add_attrs-Copy2.py? Did you generate the training data with data_prep-Copy2_with_ohe.py or full_data_pipe-Copy4.py?

If this sounds familiar, you're not using version control as a fundamental way of doing your work, and it's doing you a disservice. In this post, I talk about why git should be embraced by data scientists. In Part 2 I will provide some implementation details for how I structure projects to get you started.

Git

In case you're not familiar, git is a version control technology, which essentially takes snapshots of your project. If you change a file, it knows, and it keeps a record of all changes made to all tracked files. When you've finished something, you tell git to make a snapshot. Since it knows of all the changes, it can easily undo them.

For those of us old enough to have listened to music on cassette tapes, you remember the hassle of needing to rewind or fast-forward to queue up a particular song. Push the button, wait a few seconds, stop, play, realize you overshot or haven't gotten there yet, repeat. With CDs came checkpoints so you could quickly navigate to the beginning of songs. It's now hard to imagine not having that option. Similarly, you'll wonder how you ever managed modeling experiments without logical checkpoints that you could easily navigate and recover.

The availability of checkpoints leads to three key benefits: safety, clarity, repeatability. Let's dig more into each.

Safety

It's obvious that with regular checkpoints, your work is more safe since you can hit the “undo” button when you go down dead ends. But there's some related benefits that are perhaps not as obvious. The first is that you're likely to also be syncing with a remote git server, like GitHub, so your work is backed up on a different machine. By using version control, you've automatically gained the benefit of backing up your work. That may not matter when everything is working well, but it will matter a great deal when you spill coffee on your laptop.

Second, the additional safety leads to a greater willingness to experiment. Instead of modifying a function directly, how many times have you copied/pasted it, added “_v2” to the name, and then modified it, just in case you might want to revert back? This is poor man's version control, and it results in a mess. With checkpoints, you're free to make bold changes without worry because you can always revert. Dead ends can be deleted, breakthroughs can be merged, and potentially useful tangents can just sit on the shelf.

Clarity

When you are free to remove code that isn't relevant to the particular task at hand, your code becomes more focused and clear. If the new model you're currently testing doesn't use that dead-end feature engineering code block, then delete it. This avoids the growth of a big web of incomprehensible code that emerges from experimentation in an environment of uncertainty about the end result.

Repeatability

When you have code spread across multiple copies of Jupyter notebooks that need to be run in a very particular order to produce the model, then you have a repeatability problem. While it's probably a bad idea to use Jupyter notebooks for anything except early experimentation, using version control to keep your code concise/clean reduces the risk and makes it easier to port into a proper module. Future You or Future Teammate will be very happy that there's exactly one data pipeline and one model build.

Implementation

Git is typically thought about in the world of software engineering, where you likely have a team of people all working on the same code base. This presents many challenges, and the best practices that have emerged are in this context. While there's surely a lot of overlap, it's not exactly the same as an individual doing Data Science experiments. Check out Part 2 for a more detailed discussion of these differences along some thoughts of how to effectively use version control as Data Scientists.

Zak Jost
Zak Jost
ML Scientist @ AWS; Blogger; YouTuber

My research interests include distributed robotics, mobile computing and programmable matter.

Related