A machine learning safety net

A while back, I wrote about machine learning safety measures. I was thinking about how easy it is to accidentally make terrible models (e.g. training a support vector machine on unscaled data), or misuse good models (e.g. forgetting to scale data before making a prediction). I suggested that one solution might be to make tools that help spot these kinds of mistakes:

[We should build] software to support good practice. Many of the problems I’m talking about are quite easy to catch, or at least warn about, during the training and evaluation process. Unscaled features, class imbalance, correlated features, non-IID records, and so on. Education is essential, but software can help us notice and act on them.

Introducing redflag

I’m pleased, and a bit nervous, to introduce redflag, a new Python library to help find the sorts of issues I’m describing. The vision for this tool is as a kind of safety net, or ‘entrance exam for data’ (a phrase Evan coined several years ago). It should be able to look at an array (or Pandas DataFrame), and flag potential issues, perhaps generating a report. And it should be able to sit in your Scikit-Learn pipeline, watching for issues.

The current version, 0.1.9 is still rather rough and experimental. The code is far from optimal, with quite a bit of repetition. But it does a few useful things. For example, suppose we have a DataFrame with a column, Lithology, which contains strings denoting 9 rock types (‘sandstone’, ‘limestone’, etc). We’d like to know if the classes are ‘balanced’ — present in roughly similar numbers — or not. If they are not, we will have to be careful with how we split this dataset up for our model evaluation workflow.

>>> import redflag as rf
>>> rf.imbalance_degree(df['Lithology'])
3.37859304086633
>>> rf.imbalance_ratio([df['Lithology'])
8.347368421052632

The imbalance degree, defined by Ortigosa-Hernandez et al. (2017), tells us that there are 4 minority classes (the next integer above this number), and that the imbalance severity is somewhere in the middle (3.1 would be well balanced, 3.9 would be strongly imbalanced). The simpler imbalance ratio tells us that there’s about 8 times as much of the biggest majority class as of the smallest minority class. Conclusion: depending on the size of this dataset, the class imbalance is probably not a show-stopper, but we need to pay attention.

Our dataset contains well log data. Unless they are very far apart, well log samples are usually not independent — they are correlated in depth — and this means we can’t split the data randomly in our evaluation workflow. Redflag has a function to help detect features that are correlated to themselves in this way:

>>> rf.is_correlated(df['GR'])
True

We need to be careful!

Another function, rf.wasserstein() computes the Wasserstein distance, aka the earth mover’s distance, between distributions. This can help us figure out if our data splits all have similar distributions or not — an important condition of our evaluation workflow. I’ll feed it 3 splits in which I have forgotten to scale the first feature (i.e. the first column) in the X_test dataset:

>>> rf.wasserstein([X_train, X_val, X_test])
array([[32.108,  0.025,  0.043,  0.034],
       [16.011,  0.025,  0.039,  0.057],
       [64.127,  0.049,  0.056,  0.04 ]])

The large distances in the first column are the clue that the distribution of the data in this column varies a great deal between the three datasets. Plotting the distributions make it clear what happened.

Working with sklearn

Since we’re often already working with scikit-learn pipelines, and because I don’t really want to have to remember all these extra steps and functions, I thought it would be useful to make a special redflag pipeline that runs “all the things”. It’s called rf.pipeline and it might be all you need. Here’s how to use it:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipe = make_pipeline(StandardScaler(), rf.pipeline, SVC())

Here’s what this object contains:

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pipeline',
                 Pipeline(steps=[('rf.imbalance', ImbalanceDetector()),
                                 ('rf.clip', ClipDetector()),
                                 ('rf.correlation', CorrelationDetector()),
                                 ('rf.outlier', OutlierDetector()),
                                 ('rf.distributions',
                                  DistributionComparator())])),
                ('svc', SVC())])

Those redflag items in the inner pipeline are just detectors — think of them like smoke alarms — they do not change any data. Some of them acquire statistics during model fitting, then apply them during prediction. For example, the DistributionComparator learns the feature distributions from the training data, then compares the prediction data to them, to help ensure that you aren’t trying to extrapolate with your model. For example, it will warn you if you train a model on low-GR sandstones then try to predict on high-GR shales.

Here’s what happens when I fit my data with this pipeline:

These are just warnings, and it’s up to me to act on them. I can adjust detection thresholds and other aspects of the algorithms under the hood, but the goal is for redflag to wave its little flag, but not to get in the way. Apart from the warnings, this pipeline works exactly as it did before.


If this project sounds interesting or useful to you, please give it a look. The documentation is here, and contains more examples like those above. If you find bugs or want to request enhancements, there’s the GitHub Issues page. And if you use it for anything you can share, I’d love to hear how you get along!

Machine learning safety measures

Yesterday in Functional but unsafe machine learning I wrote about how easy it is to build machine learning pipelines that yield bad predictions — a clear business risk. Today I want to look at some ways we might reduce this risk.


The diagram I shared yesterday tries to illustrate the idea that it’s easy to find a functional solution in machine learning, but only a few of those solutions are safe or fit for purpose. The question to ask is: what can we do about it?

Engineered_system_failure_types.png

You can’t make bad models safe, so there’s only one thing to do: shrink the field of functional models so that almost all of them are safe:

Engineered_system_safer_ML.png

But before we do this any old way, we should ask why the orange circle is so big, and what we’re prepared to do to shrink it.

Part of the reason is that libraries like scikit-learn, and the Python ecosystem in general, are very easy to use and completely free. So it’s absolutely possible for any numerate person with a bit of training to make sophisticated machine learning models in a matter of minutes. This is a wonderful and powerful thing, unprecedented in history, and it’s part of why machine learning has been so hot for the last 6 or 8 years.

Given that we don’t want to lose this feature, what actions could we take to make it harder to build bad models? How can we improve over time like aviation has, and without premature regulation? Here are some ideas:

  • Fix and maintain the data pipeline (not the data!). We spend most of our time getting training and validation data straight, and it always makes a big difference to the outcomes. But we’re obsessed with fixing broken things (which is not sustainable), when we should be coping with them instead.

  • Raise the digital literacy rate: educate all scientists about machine learning and data-driven discovery. This process starts at grade school, but it must continue at university, through grad school, and at work. It’s not a ‘nice to have’, it’s essential to being a scientist in the 21st century.

  • Build software to support good practice. Many of the problems I’m talking about are quite easy to catch, or at least warn about, during the training and evaluation process. Unscaled features, class imbalance, correlated features, non-IID records, and so on. Education is essential, but software can help us notice and act on them.

  • Evolve quality assurance processes to detect ML smell. Organizations that are adopting (building or buying) machine learning (i.e. all of them), must get really good at sniffing out problems with machine learning projects — then fixing those problems — and at connecting practitioners so they can learng together and share good practice.

  • Recognizing that machine learning models are made from code, and must be subject to similar kinds of quality assurance. We should adopt habits such as testing, documentation, code review, continuous integration, and issue tracking for users to report bugs and request enhancements. We already know how to do these things.

I know some of this might sound like I’m advocating command and control, but that approach is not compatible with a lean, agile organization. So if you’re a CTO reading this, the fastest path to success here is not hiring a know-it-all Chief Data Officer from a cool tech giant, then brow-beating your data science practitioners with Best Practice documents. Instead, help your digital professionals create a high-functioning community of practice, connected both inside and outside the organizations, and support them learning and adapting together. Yes it takes longer, but it’s much more effective.

What do you think? Are people already doing these things? Do you see people using other strategies to reduce the risk of building poor machine learning models? Share your stories in the comments below.

Functional but unsafe machine learning

There are always more ways to mess something up than to get it right. That’s just statistics, specifically entropy: building things is a fight against the second law of thermodynamics. And while messing up a machine learning model might sound abstract, it could result in poor decisions, leading to wasted resources, environmental risk, or unsafe conditions.

Okay then, bad solutions outnumber good solutions. No problem: we are professionals, we can tell the difference between good ones and bad ones… most of the time. Sometimes, though, bad solutions are difficult to discern — especially when we’re so motivated to find good solutions to things!

How engineered systems fail

A machine learning pipeline is an engineered system:

Engineered system: a combination of components that work in synergy to collectively perform a useful function

Some engineered systems are difficult to put together badly because when you do, they very obviously don't work. Not only can they not be used for their intended purpose, but any lay person can tell this. Take a poorly assembled aeroplane: it probably won’t fly. If it does, it then has safety criteria to meet. So if you have a working system, you're happy.

There are multiple forces at work here: decades of industrial design narrow the options, physics takes care of big chunk of failed builds, strong regulation takes care of almost all of the rest, and daily inspections keep it all functional. The result: aeroplane accidents are very rare.

In other domains, systems can be put together badly and still function safely. Take cookery — most of the failures are relatively benign, they just taste horrible. They are not unsafe and they 'function' insofar as they sustain you. So in cookery, if you have a working system, you might not be happy, but at least you're alive.

Where does machine learning fit? Is it like building aeroplanes, or cooking supper? Neither.

Engineered_system_failure_types.png

Machine learning with modern tools combines the worst of both worlds: a great many apparently functional but malignantly unfit failure modes. Broken ML models appear to work — given data \(X\), you get predictions \(\hat{y}\) — so you might think you're happy… but the predictions are bad, so you end up in hospital with food poisoning.

What kind of food poisoning? It ranges from severe and acute malfunction to much more subtle and insidious errors. Here are some examples:

  • Allowing information leakage across features or across records, resulting in erroneously high accuracy claims. For example, splitting related (e.g. nearby) records into the training and validation sets.

  • Not accounting for under-represented classes so that predictions are biased towards over-represented ones. This kind of error was common to see in the McMurray Formation of Alberta, which is 80% pay.

  • Forgetting to standardize or normalize numerical inputs to a model in production, producing erroneous predictions. For example, training on gamma-ray Z-scores of roughly –3 to +3, then asking for a prediction for a value of 75.

  • Using cost functions that do not reflect the opinions of humans with expertise about ‘good’ vs ‘bad’ predictions.

  • Racial or gender bias in a human resource model, such as might be used for hiring or career mapping.

Tomorrow I’ll suggest some ways to build safe machine learning models. In the meantime, please share what you think about this idea. Does it help to think about machine learning failures this way? Do you have another perspective? Let me know in the comments below.


UPDATE on 7 January 2021: Here’s the follow-up post, Machine learning safety measures >


Like this? Check out these other posts about quality assurance and machine learning:

Does your machine learning smell?

Martin Fowler and Kent Beck popularized the term ‘code smell’ in the book Refactoring. They were describing the subtle signs of deeper trouble in code — signs that a program’s source code might need refactoring (restructuring and rewriting). There are too many aromas to list here, but here are some examples (remember, these things are not necessarily problems in themselves, but they suggest you need to look more closely):

  • Duplicated code.

  • Contrived complexity (also known as showing off).

  • Functions with many arguments, suggesting overwork.

  • Very long functions, which are hard to read.

More recently, data scientist Felienne Hermans applied the principle to the world’s number one programming environment: spreadsheets. The statistics on spreadsheet bugs are quite worrying, and Hermans enumerated the smells that might lead you to them. Here are four of her original five ‘formula’ smells, notice how they correspond to the code smells above:

  • Duplicated formulas.

  • Conditional complexity (e.g. nested IF statements).

  • Multiple references, analogous to the ‘many arguments’ smell.

  • Multiple operations in one cell.

What does a machine learning project smell like?

Most machine learning projects are code projects, so some familiar smells might be emanating from the codebase (if we even have access to it). But machine learning models are themselves functions — machines that map input X to some target y. And even if the statistical model is simple, like a KNN classifier, the workflow is a sort of ‘metamodel’ and can have complexities of its own. So what are the ‘ML smells’ that might alert us to deeper problems in our prediction tools?

I asked this question on Twitter (below) and in the Software Underground

I got some great responses. Here are some ideas adapted from them, with due credit to the people named:

  • Very high accuracy, especially a complex model on a novel task. (Ari Hartikainen, Helsinki and Lukas Mosser, Athens; both mentioned numbers around 0.99 but on earth science problems I start to get suspicious well before that: anything over 0.7 is excellent, and anything over 0.8 suggests ‘special efforts’ have been made.)

  • Excessive precision on hyperparameters might suggest over-tuning. (Chris Dinneen, Perth)

  • Counterintuitive model weights, e.g. known effects have low feature importance. (Reece Hopkins, Anchorage)

  • Unreproducible, non-deterministic code, e.g. not setting random seeds. (Reece Hopkins again)

  • No description of the train–val–test split, or justification for how it was done. Leakage between training and blind data is easy to introduce with random splits in spatially correlated data. (Justin Gosses, Houston)

  • No discussion of ground truth and how the target labels relate to it. (Justin Gosses again)

  • Less than 80% of the effort spent on preparing the data. (Michael Pyrcz, Austin — who actually said 90%)

  • No discussion of the evaluation metric, eg how it was selected or designed (Dan Buscombe, Flagstaff)

  • No consideration of the precision–recall trade-off, especially in a binary classification task. (Dan Buscombe again)

  • Strong class imbalance and no explicit mention of how it was handled. (Dan Buscombe again)

  • Skewed feature importance (on one or two features) might suggest feature leakage. (John Ramey, Austin)

  • Excuses, excuses — “we need more data”, “the labels are bad”, etc. (Hallgrim Ludvigsen, Stavanger)

  • AutoML, e.g. using a black box service, or an exhaustive automated seach of models and hyperparameters.

That’s already a long list, but I’m sure there are others. Or perhaps some of these are really the same thing, or are at least connected. What do you think? What red — or at least yellow — flags do you look out for when reviewing machine learning projects? Let us know in the comments below.


If you enjoyed this post, check out the Machine learning project review checklist I wrote bout last year. I’m currently working on a new version of the checklist that includes some tips for things to look for when going over the checklist. Stay tuned for that.


The thumbnail for this post was generated automatically from text (something like, “a robot smelling a flower”… but I made so many I can’t remember exactly!). Like a lot of unconstrained image generation by AI’s, it’s not great, but I quite like it all the same.

The AI is LXMERT from the Allen Institute. Try it out or read the paper.

download (7).png

Machine learning project review checklist

Imagine being a manager or technical chief whose team has been working on a machine learning project. What questions should you be thinking about when your team tells you about their work?

Here are some suggestions. Some of the questions are getting at reproducibility (for testing, archiving, or sharing the workflow), others at quality assurance. A few of the questions might depend on the particular task in hand, although I’ve tried to keep it pretty generic.

There are a few must-ask questions, highlighted in bold.

High-level questions about the project

  • What question were you trying to answer? How did you frame it as an ML task?

  • What is human-level performance on that task? What level of performance is needed?

  • Is it possible to approach this problem without machine learning?

  • If the analysis focused on deep learning methods, did you try shallow learning methods?

  • What are the ethical and legal aspects of this project?

  • Which domain experts were involved in this analysis?

  • Which data scientists were involved in this analysis?

  • Which tools or framework did you use? (How much of a known quantity is it?)

  • Where is the pipeline published? (E.g. public or internal git repositories.)

  • How thorough is the documentation?

Questions about the data preparation

  • Where did the feature data come from?

  • Where did the labels come from?

  • What kind of data exploration did you do?

  • How did you clean the data? How long did this take?

  • Are the classes balanced? How did the distribution change your workflow?

  • What kind of normalization did you do?

  • What did you do about missing data? E.g. what kind of imputation did you do?

  • What kind of feature engineering did you do?

  • How did you split the data into train, validate and test?

Questions about training and evaluation

  • Which models did you explore and why? Did you also try the simplest models that fit the problem?

  • How did you tune the hyperparameters of the model? Did you try grid search or other methods?

  • What kind of validation did you do? Did you use cross-validation? How did you choose the folds?

  • What evaluation metric are you using? Why is it the most appropriate one?

  • How do training, validation, and test metrics compare?

  • If this was a classification task, how does a dummy classifier score?

  • How are errors/residuals distributed? (Ideally normally distributed and  homoscedastic.)

  • How interpretable is your model? That is, do the learned parameters mean anything, and can we learn from them? E.g. what is the feature importance?

  • If this was a classification task, are probabilities available in your model and did you use them?

  • If this was a regression task, have you checked the residuals for normality and homoscedasticity?

  • Are there benchmarks for this task, and how well does your model do on them?

Next steps for the project

  • How will you improve the model?

  • Would collecting more data help? Can we address the imbalance with more data?

  • Are there human or computing resources you need access to?

  • How will you deploy the model?

Rather than asking them explicitly, a reviewer might check things off while reading a report or listening to a presentation. A thorough review would cover most of the points without being prompted. And I’d go so far as to say that a person or team who has done a rigorous treatment should readily have answers to all of these questions. They aren't supposed to be 'traps' exactly, but they are supposed to get to the heart of the issues the data scientist or team likely faced during their work.

What do you think? Are the questions fair? Are there any you would remove, or others you would add? Let me know in the comments.

Visit a Google Docs version of this checklist.


Thank you to members of the Software Underground Slack channel for discussion of these questions, especially Anton Biryukov, Justin Gosses, and Lukas Mosser.