Lately, I have been running a lot of Machine Learning experiments both at work and on Kaggle. One thing has become clear: it’s really hard to confidently say Model A is better than Model B. Typically there’s a clear trade-off between speed and confidence in results. In a setup like Kaggle, your success is largely driven by how many ideas you can evaluate, and that in turn is driven by the design of your test bed.
For instance, you could train models A and B, measure the performance, and choose the best one. Fast and easy. But if you change the random seed and repeat, maybe you get the opposite result. You then might decide to do 5-fold cross validation to get error bars, but this takes ~5x the time and raises more questions about how to sample from the data to form the folds (e.g. group k-fold, stratified k-fold…etc). Maybe you then decide to shorten the duration of each run by increasing the learning rate or downsampling the training set, but this adds additional variance to your results, making it even harder to find the signal in the noise.
And what about hyper-parameter tuning? Do you do it separately for models A and B to ensure you’re choosing the best parameters for that data/model combination, or are you locking in the parameters and accepting the possibility that they will favor one model over the other? If you’re tuning the parameters for each, how do you evaluate parameter set 1 against parameter set 2? This is essentially back to comparing two different models, and raises all the same questions about signal and noise.
The answer is: there is no easy answer. Ultimately, constraints on time and compute resources will dictate the power of your comparisons, and that will limit your ability to discriminate between increasingly small performance differences. Good engineering then requires thoughtfully making decisions in the context of this understanding of how noise and variation are interacting with your design. Ignoring it will result in arbitrary decisions, dead-ends, and wasted time.
Side note/rant: It’s worth pointing out that these lessons are not unique to Machine Learning. I experienced the exact same challenges as a process engineer in the semiconductor industry. In this world, the goal was to reduce the number of particles on a wafer. I clearly remember a day when I was trying to advocate for a process change using an experiment on 100 wafers split into experiment/control conditions. When I calculated the required sample size given the variation in the particle counts I discovered that I actually needed >100,000 wafers, which was a totally absurd proposition. As scientists, we’re often faced with these inconvenient truths and all too common, the reaction is to ignore it. This, I think, partly explains the replication crisis in the literature of many scientific fields, particularly in deep learning where it’s infeasibly expensive to run 100s of iterations of an experiment.
Here’s an outline of a process that might help you tame the randomness:
1. Know how big of a change you want to measure.
Are you trying to find improvements of 0.001 AUC, or 0.01 AUC? This answer will greatly impact your experiment test bed. Be careful in choosing your metric, because some are more sensitive than others and you want to give yourself the best chance of finding changes in the thing you actually care about.
2. Understand the sources of your randomness.
If you want to downsample the data, this sampling process will give variation–the smaller the sample, the more variation. Your sample of rows that form the validation and test sets will also introduce randomness, as they are used for early stopping and/or evaluation.
3. Measure the variation for your baseline model.
You’ll do this by randomly choosing a value for each of the random processes outlined in Step 2. For instance, you’ll randomly sample a training set, bootstrap sample a validation set, and bootstrap sample a test set. Then, build a model using the sampled validation and test sets to do early stopping and performance evaluation. Now, repeat this N number of times (10-50?) and use the results for statistical analysis. If your variation is 0.01 AUC and you’d like to observe changes in 0.001 AUC, then you’ll either need to run many, many instances of each experiment, or you’ll need to re-define your process in a way that reduces variance, like increasing the training data size.
4. Calculate the required sample size
Once you know the variation for your baseline and the amount of change you’d like to measure, you can use standard methods to calculate the sample size you’ll need at a given confidence level. If that number is far greater than is feasible, you’ll need to design a different process or adjust your expectations.