Does your machine learning smell?
/Martin Fowler and Kent Beck popularized the term ‘code smell’ in the book Refactoring. They were describing the subtle signs of deeper trouble in code — signs that a program’s source code might need refactoring (restructuring and rewriting). There are too many aromas to list here, but here are some examples (remember, these things are not necessarily problems in themselves, but they suggest you need to look more closely):
Duplicated code.
Contrived complexity (also known as showing off).
Functions with many arguments, suggesting overwork.
Very long functions, which are hard to read.
More recently, data scientist Felienne Hermans applied the principle to the world’s number one programming environment: spreadsheets. The statistics on spreadsheet bugs are quite worrying, and Hermans enumerated the smells that might lead you to them. Here are four of her original five ‘formula’ smells, notice how they correspond to the code smells above:
Duplicated formulas.
Conditional complexity (e.g. nested IF statements).
Multiple references, analogous to the ‘many arguments’ smell.
Multiple operations in one cell.
What does a machine learning project smell like?
Most machine learning projects are code projects, so some familiar smells might be emanating from the codebase (if we even have access to it). But machine learning models are themselves functions — machines that map input X to some target y. And even if the statistical model is simple, like a KNN classifier, the workflow is a sort of ‘metamodel’ and can have complexities of its own. So what are the ‘ML smells’ that might alert us to deeper problems in our prediction tools?
I asked this question on Twitter (below) and in the Software Underground…
I got some great responses. Here are some ideas adapted from them, with due credit to the people named:
Very high accuracy, especially a complex model on a novel task. (Ari Hartikainen, Helsinki and Lukas Mosser, Athens; both mentioned numbers around 0.99 but on earth science problems I start to get suspicious well before that: anything over 0.7 is excellent, and anything over 0.8 suggests ‘special efforts’ have been made.)
Excessive precision on hyperparameters might suggest over-tuning. (Chris Dinneen, Perth)
Counterintuitive model weights, e.g. known effects have low feature importance. (Reece Hopkins, Anchorage)
Unreproducible, non-deterministic code, e.g. not setting random seeds. (Reece Hopkins again)
No description of the train–val–test split, or justification for how it was done. Leakage between training and blind data is easy to introduce with random splits in spatially correlated data. (Justin Gosses, Houston)
No discussion of ground truth and how the target labels relate to it. (Justin Gosses again)
Less than 80% of the effort spent on preparing the data. (Michael Pyrcz, Austin — who actually said 90%)
No discussion of the evaluation metric, eg how it was selected or designed (Dan Buscombe, Flagstaff)
No consideration of the precision–recall trade-off, especially in a binary classification task. (Dan Buscombe again)
Strong class imbalance and no explicit mention of how it was handled. (Dan Buscombe again)
Skewed feature importance (on one or two features) might suggest feature leakage. (John Ramey, Austin)
Excuses, excuses — “we need more data”, “the labels are bad”, etc. (Hallgrim Ludvigsen, Stavanger)
AutoML, e.g. using a black box service, or an exhaustive automated seach of models and hyperparameters.
That’s already a long list, but I’m sure there are others. Or perhaps some of these are really the same thing, or are at least connected. What do you think? What red — or at least yellow — flags do you look out for when reviewing machine learning projects? Let us know in the comments below.
If you enjoyed this post, check out the Machine learning project review checklist I wrote bout last year. I’m currently working on a new version of the checklist that includes some tips for things to look for when going over the checklist. Stay tuned for that.
The thumbnail for this post was generated automatically from text (something like, “a robot smelling a flower”… but I made so many I can’t remember exactly!). Like a lot of unconstrained image generation by AI’s, it’s not great, but I quite like it all the same.
The AI is LXMERT from the Allen Institute. Try it out or read the paper.