Implementing Git in Data Science
I hope Part 1 sold you on the idea that version control is a critical tool for managing data science experiments. But the devil is in the details, so let's talk about how to implement version control in a data science project.
There are several paradigms for using git, but I have essentially adapted “feature branching” for the purposes of data science experiments. Briefly, feature branching means there is a “master” branch that you use as a baseline, and new features are added to the code base by branching off of “master”, making all the changes required to implement the feature, and then merging the new branch back to master once successful.
In my case, I create a new branch for a new experiment, or a new modeling idea I want to try. At this point, you need to consciously make a decision: are you modifying code so that it will only work in this experiment, or are you hoping to modify it in a way that will work with both this experiment and previous experiments? Another way of wording this question is: do you want to replace what you've done, or add to it? The answer will determine whether you can merge the new branch back into master, or whether it will stay its own thing forever.
My recommendation is to make the extra effort to extract key components into a library that are then re-used across multiple experiments. This is far preferable to having multiple copies of the same (or worse, slightly different) code, which then need to be maintained separately. This code divergence will likely be a source of mistakes. As the adage goes: the best code is no code. By extracting key components to a shared library, you can make incremental improvements and end up with a cohesive code base that can repeatably run a series of experiments. If you instead keep introducing backwards-incompatible changes, you'll find yourself frequently jumping around branches to copy/paste parts of code that are useful, but then needing to make modifications since the components weren't designed to work together. With a large experiment, this can grow unwieldy.
The advantage of a feature branching approach is that you can merge your experiment branch back into master and then run any of the experiments. The cost of doing this is that when you make changes to the core library, you might also need to change the implementation of other experiments. So, like everything, it's a trade-off decision. In my experience, it evolves organically and I find myself thinking about extracting common code whenever I am tempted to copy and paste.
An example directory structure I have found useful is as follows:
|-- core/ |-- tests/ |-- test_pull_data.py |-- test_prepare_data.py |-- test_model.py |-- test_deploy.py |-- test_utils.py |-- pull_data.py |-- prepare_data.py |-- model.py |-- deploy.py |-- utils.py |-- experiment_1/ |-- data/ |-- training.csv |-- validation.csv |-- test.csv |-- output/ |-- results.json |-- models/ |-- model1 |-- model2 |-- job_config.py |-- build_data.py |-- train.py |-- evaluate.py |-- prod.py |-- experiment_2/ |-- data/ |-- training.csv |-- validation.csv |-- test.csv |-- output/ |-- results.json |-- models/ |-- model1 |-- model2 |-- job_config.py |-- build_data.py |-- train.py |-- evaluate.py |-- prod.py
In this case, the main logic is in the
core/ directory. Experiments are then organized in directories, which contain the code to execute the core logic for the experiment and the input/output assets that result from running it. The implementation code should be extremely simple, and only do things that are specific to this particular experiment. For instance, if it's comparing approaches A and B, then it will import the configurations of A and B, instantiate the relevant code from core, and call “run” for each.
Notice that this structure gives a natural place for implementing unit/functional/integration tests. Further, the mere act of extracting general components into a multi-use library helps make the code more testable. Since this code is likely parameterized instead of relying on hard-coded experiment details, it becomes easier to create toy examples for tests. A future post will dig deeper into writing tests for data science projects.
Here's a few simple practices I have found helpful.
This tells git which files to ignore. This should be setup as first priority in a new project because once you commit something stupid, it's there forever unless you take special action.
It's most important to exclude sensitive information, like passwords and API keys. If you commit a file containing sensitive information early on, this quickly becomes a nightmare. Deleting it from the current snapshot is not enough–you need to eliminate it from all previous commits. Do yourself a favor and just avoid having to learn how to do that.
The next step is to ignore very large data files and unimportant files that do not need to be tracked (i.e. ipython notebook checkpoints, settings files from your IDE, pycache, .pyc, etc). In the example above, all the input/output artifacts should be ignored too since they are fully determined from the code itself and can be regenerated if needed.
2. Frequent commits
If you finish a reasonable chunk of work, make a commit. No need to be stingy and it might get you out of a jam.
3. Clear commit messages
If you're making frequent enough commits, then your chunks of work are probably pretty focused. This should enable clear commit messages. Nothing is more satisfying than trying to trace back an undesirable change and quickly finding it because of a properly annotated commit history. If the description of what you did goes like: “Implemented 3 new features, added dropout, built a cross validation component, and refactored the training logic”, then you're not making frequent enough commits.
What about you?
These thoughts are a work in progress as I try different strategies. If you've developed a different approach that you have found useful, I'd love to hear about it!