Functional but unsafe machine learning

There are always more ways to mess something up than to get it right. That’s just statistics, specifically entropy: building things is a fight against the second law of thermodynamics. And while messing up a machine learning model might sound abstract, it could result in poor decisions, leading to wasted resources, environmental risk, or unsafe conditions.

Okay then, bad solutions outnumber good solutions. No problem: we are professionals, we can tell the difference between good ones and bad ones… most of the time. Sometimes, though, bad solutions are difficult to discern — especially when we’re so motivated to find good solutions to things!

How engineered systems fail

A machine learning pipeline is an engineered system:

Engineered system: a combination of components that work in synergy to collectively perform a useful function

Some engineered systems are difficult to put together badly because when you do, they very obviously don't work. Not only can they not be used for their intended purpose, but any lay person can tell this. Take a poorly assembled aeroplane: it probably won’t fly. If it does, it then has safety criteria to meet. So if you have a working system, you're happy.

There are multiple forces at work here: decades of industrial design narrow the options, physics takes care of big chunk of failed builds, strong regulation takes care of almost all of the rest, and daily inspections keep it all functional. The result: aeroplane accidents are very rare.

In other domains, systems can be put together badly and still function safely. Take cookery — most of the failures are relatively benign, they just taste horrible. They are not unsafe and they 'function' insofar as they sustain you. So in cookery, if you have a working system, you might not be happy, but at least you're alive.

Where does machine learning fit? Is it like building aeroplanes, or cooking supper? Neither.

Engineered_system_failure_types.png

Machine learning with modern tools combines the worst of both worlds: a great many apparently functional but malignantly unfit failure modes. Broken ML models appear to work — given data \(X\), you get predictions \(\hat{y}\) — so you might think you're happy… but the predictions are bad, so you end up in hospital with food poisoning.

What kind of food poisoning? It ranges from severe and acute malfunction to much more subtle and insidious errors. Here are some examples:

  • Allowing information leakage across features or across records, resulting in erroneously high accuracy claims. For example, splitting related (e.g. nearby) records into the training and validation sets.

  • Not accounting for under-represented classes so that predictions are biased towards over-represented ones. This kind of error was common to see in the McMurray Formation of Alberta, which is 80% pay.

  • Forgetting to standardize or normalize numerical inputs to a model in production, producing erroneous predictions. For example, training on gamma-ray Z-scores of roughly –3 to +3, then asking for a prediction for a value of 75.

  • Using cost functions that do not reflect the opinions of humans with expertise about ‘good’ vs ‘bad’ predictions.

  • Racial or gender bias in a human resource model, such as might be used for hiring or career mapping.

Tomorrow I’ll suggest some ways to build safe machine learning models. In the meantime, please share what you think about this idea. Does it help to think about machine learning failures this way? Do you have another perspective? Let me know in the comments below.


UPDATE on 7 January 2021: Here’s the follow-up post, Machine learning safety measures >


Like this? Check out these other posts about quality assurance and machine learning:

Machine learning project review checklist

Imagine being a manager or technical chief whose team has been working on a machine learning project. What questions should you be thinking about when your team tells you about their work?

Here are some suggestions. Some of the questions are getting at reproducibility (for testing, archiving, or sharing the workflow), others at quality assurance. A few of the questions might depend on the particular task in hand, although I’ve tried to keep it pretty generic.

There are a few must-ask questions, highlighted in bold.

High-level questions about the project

  • What question were you trying to answer? How did you frame it as an ML task?

  • What is human-level performance on that task? What level of performance is needed?

  • Is it possible to approach this problem without machine learning?

  • If the analysis focused on deep learning methods, did you try shallow learning methods?

  • What are the ethical and legal aspects of this project?

  • Which domain experts were involved in this analysis?

  • Which data scientists were involved in this analysis?

  • Which tools or framework did you use? (How much of a known quantity is it?)

  • Where is the pipeline published? (E.g. public or internal git repositories.)

  • How thorough is the documentation?

Questions about the data preparation

  • Where did the feature data come from?

  • Where did the labels come from?

  • What kind of data exploration did you do?

  • How did you clean the data? How long did this take?

  • Are the classes balanced? How did the distribution change your workflow?

  • What kind of normalization did you do?

  • What did you do about missing data? E.g. what kind of imputation did you do?

  • What kind of feature engineering did you do?

  • How did you split the data into train, validate and test?

Questions about training and evaluation

  • Which models did you explore and why? Did you also try the simplest models that fit the problem?

  • How did you tune the hyperparameters of the model? Did you try grid search or other methods?

  • What kind of validation did you do? Did you use cross-validation? How did you choose the folds?

  • What evaluation metric are you using? Why is it the most appropriate one?

  • How do training, validation, and test metrics compare?

  • If this was a classification task, how does a dummy classifier score?

  • How are errors/residuals distributed? (Ideally normally distributed and  homoscedastic.)

  • How interpretable is your model? That is, do the learned parameters mean anything, and can we learn from them? E.g. what is the feature importance?

  • If this was a classification task, are probabilities available in your model and did you use them?

  • If this was a regression task, have you checked the residuals for normality and homoscedasticity?

  • Are there benchmarks for this task, and how well does your model do on them?

Next steps for the project

  • How will you improve the model?

  • Would collecting more data help? Can we address the imbalance with more data?

  • Are there human or computing resources you need access to?

  • How will you deploy the model?

Rather than asking them explicitly, a reviewer might check things off while reading a report or listening to a presentation. A thorough review would cover most of the points without being prompted. And I’d go so far as to say that a person or team who has done a rigorous treatment should readily have answers to all of these questions. They aren't supposed to be 'traps' exactly, but they are supposed to get to the heart of the issues the data scientist or team likely faced during their work.

What do you think? Are the questions fair? Are there any you would remove, or others you would add? Let me know in the comments.

Visit a Google Docs version of this checklist.


Thank you to members of the Software Underground Slack channel for discussion of these questions, especially Anton Biryukov, Justin Gosses, and Lukas Mosser.