Since the Data Science blogosphere is almost exclusively focused on the nitty gritty details of Machine Learning algorithms, you would be forgiven to think that’s the most important aspect of being an ML practitioner. But those in industry building models to solve real business problems know this is a small part of the job. Whereas an academic might build models locally and evaluate them on benchmark data for getting publishable results, a person in industry is integrating the model into a larger system with many opportunities for failure. This post is about those other important things you need to do to be successful in this setting.
Too often in industry, when someone asks about how well a model is performing, the response is: let me do some analysis. If this is your scenario, you’re at risk of unnecessary pain in the form of operational firefighting.
Some day, some person will be reviewing the business metrics and inevitably discover a major problem. They’ll say something like, “Your model says all these things are X, but really they’re Y. That doesn’t make any sense and it’s costing us a bajillion dollars a day!”
When it turns out that the decisions that are based on a model are bad, it evokes a few obvious questions: Is my model broken? Is my data pipeline broken? Has the thing I’m modeling changed? Answering these questions should be no harder than looking at a dashboard. This brings us to Important Thing 1.
Important Thing 1: Until you have real-time-ish performance monitoring, you’re not done
(The definition of “real-time” will be dependent on your business problem and is subject to constraints of i.e. label maturity). This might sound really hard, but unless you’re in an organization with extreme data silos, you can build something in a couple of hours. All you need is:
- Access to model scores and labels
- A machine that can run a cron job
- A machine that can host a dashboarding stack, like Grafana + InfluxDB
Using Docker images, I built this entire stack in about 3 hours despite never using these tools before. Even if you’re the only one that has access to it, this exercise is still worth your time because it’s a time investment that will pay off by giving you early, actionable signals when there are problems. You’ll not only be able to troubleshoot more quickly, but you’ll learn a lot more about your models by observing their temporal behavior.
It might turn out that your model is fine, and instead it’s the data coming into your model that’s the problem. Great news: this is easy to measure too, you just have to hook it up to your new dashboarding solution. Once you have a monitoring dashboard/database setup for performance monitoring, it becomes much easier to add new views like these. This is Important Thing 2.
Important Thing 2: Monitor your model inputs and outputs
Great! You have monitoring on the inputs, outputs, and performance of your models. Now you can check it when things break! Or check it every spare minute to make sure everything is OK. Right? Wrong.
Important Thing 3: Build alarms on your monitors
If you went to the trouble of building the stuff that puts all your important metrics in one place, why not build some simple rules to alert you when things go wonky? It is far, far preferable to get an alarm that one of your important input variables has suddenly gone missing than to figure it out indirectly from all the pain the problem caused for the business. The alarms will allow you to contact the business and say, “Heads up: there’s missing data, and it’s causing my model to act weird, and that will cause bad decisions and a bajillion dollars to be lost, so let’s pause things until the data issue is fixed. I’ve already escalated this to the team responsible for the data.”
Models almost always degrade with time for a variety of reasons. Eventually, you’re likely going to want to retrain your model using the most recent data. Since you have a fancy dashboard visualizing performance, at the very least this should be anticipated and planned instead of in reaction to an urgent problem. But there’s a much better way, and this is the next Important Thing.
Important Thing 4: Automate model retraining and deployment
This not only keeps your model fresh and performant, but it’s an investment that should pay dividends for years, even after you’ve moved on from the team.
How does it save you time? Because technical debt begets more technical debt. When a team is drowning in ops pain, they tend to be reactionary and fight the urgent fires. This means they probably don’t know the model is degraded until it causes a problem, and then they fix that problem as quickly as they can. And that “as quickly as they can” part is what leads to additional technical debt.
Contrast this with a model that automatically refreshes itself. If you know this is a requirement from the beginning, you make design decisions that enable it along the way. So rather than one big Jupyter notebook that pulls in a csv from your laptop and spits out a model artifact that you deploy into production with some black magic, you instead have modularized your data pull process from your model building process, and automated the tracking and deployment of the model. With this in place, a model retrain project goes from taking weeks/months to happening daily or weekly without a human in the loop.
Here is one other reason this is very powerful: people tend to rebuild from scratch instead of reverse engineer. If you do not automate a process to keep the model fresh, then your work will likely be throw-away as soon as you are not there to actively maintain it. Your impact is amplified if your solutions continue to add value after you’ve left.
And while we’re on the topic, here’s one final important thing you can do to make it more likely your work doesn’t become quickly irrelevant:
Important Thing 5: Document, document, document
Different companies have different cultural norms for persisting information. Whatever it is, do it and do it well. Start early and don’t leave it as one of those things you’ll come back to, because you likely never will. Write the formal whitepaper that documents your design decisions and analyses. Also create the more accessible wiki page or whatever that’s targeted to the next person that might need to do maintenance. One of the primary reasons systems are rebuilt from scratch is because the original isn’t well documented.
So, here are the 5 important things that will keep your work robust and relevant, while saving you lots of time that would otherwise be wasted on unnecessary operational firefighting:
- Monitor your models in real-time
- Monitor your model inputs and outputs
- Alarm on your monitors
- Automate model retraining and deployment
- Document your work