Machine learning safety measures

Yesterday in Functional but unsafe machine learning I wrote about how easy it is to build machine learning pipelines that yield bad predictions — a clear business risk. Today I want to look at some ways we might reduce this risk.


The diagram I shared yesterday tries to illustrate the idea that it’s easy to find a functional solution in machine learning, but only a few of those solutions are safe or fit for purpose. The question to ask is: what can we do about it?

Engineered_system_failure_types.png

You can’t make bad models safe, so there’s only one thing to do: shrink the field of functional models so that almost all of them are safe:

Engineered_system_safer_ML.png

But before we do this any old way, we should ask why the orange circle is so big, and what we’re prepared to do to shrink it.

Part of the reason is that libraries like scikit-learn, and the Python ecosystem in general, are very easy to use and completely free. So it’s absolutely possible for any numerate person with a bit of training to make sophisticated machine learning models in a matter of minutes. This is a wonderful and powerful thing, unprecedented in history, and it’s part of why machine learning has been so hot for the last 6 or 8 years.

Given that we don’t want to lose this feature, what actions could we take to make it harder to build bad models? How can we improve over time like aviation has, and without premature regulation? Here are some ideas:

  • Fix and maintain the data pipeline (not the data!). We spend most of our time getting training and validation data straight, and it always makes a big difference to the outcomes. But we’re obsessed with fixing broken things (which is not sustainable), when we should be coping with them instead.

  • Raise the digital literacy rate: educate all scientists about machine learning and data-driven discovery. This process starts at grade school, but it must continue at university, through grad school, and at work. It’s not a ‘nice to have’, it’s essential to being a scientist in the 21st century.

  • Build software to support good practice. Many of the problems I’m talking about are quite easy to catch, or at least warn about, during the training and evaluation process. Unscaled features, class imbalance, correlated features, non-IID records, and so on. Education is essential, but software can help us notice and act on them.

  • Evolve quality assurance processes to detect ML smell. Organizations that are adopting (building or buying) machine learning (i.e. all of them), must get really good at sniffing out problems with machine learning projects — then fixing those problems — and at connecting practitioners so they can learng together and share good practice.

  • Recognizing that machine learning models are made from code, and must be subject to similar kinds of quality assurance. We should adopt habits such as testing, documentation, code review, continuous integration, and issue tracking for users to report bugs and request enhancements. We already know how to do these things.

I know some of this might sound like I’m advocating command and control, but that approach is not compatible with a lean, agile organization. So if you’re a CTO reading this, the fastest path to success here is not hiring a know-it-all Chief Data Officer from a cool tech giant, then brow-beating your data science practitioners with Best Practice documents. Instead, help your digital professionals create a high-functioning community of practice, connected both inside and outside the organizations, and support them learning and adapting together. Yes it takes longer, but it’s much more effective.

What do you think? Are people already doing these things? Do you see people using other strategies to reduce the risk of building poor machine learning models? Share your stories in the comments below.

Functional but unsafe machine learning

There are always more ways to mess something up than to get it right. That’s just statistics, specifically entropy: building things is a fight against the second law of thermodynamics. And while messing up a machine learning model might sound abstract, it could result in poor decisions, leading to wasted resources, environmental risk, or unsafe conditions.

Okay then, bad solutions outnumber good solutions. No problem: we are professionals, we can tell the difference between good ones and bad ones… most of the time. Sometimes, though, bad solutions are difficult to discern — especially when we’re so motivated to find good solutions to things!

How engineered systems fail

A machine learning pipeline is an engineered system:

Engineered system: a combination of components that work in synergy to collectively perform a useful function

Some engineered systems are difficult to put together badly because when you do, they very obviously don't work. Not only can they not be used for their intended purpose, but any lay person can tell this. Take a poorly assembled aeroplane: it probably won’t fly. If it does, it then has safety criteria to meet. So if you have a working system, you're happy.

There are multiple forces at work here: decades of industrial design narrow the options, physics takes care of big chunk of failed builds, strong regulation takes care of almost all of the rest, and daily inspections keep it all functional. The result: aeroplane accidents are very rare.

In other domains, systems can be put together badly and still function safely. Take cookery — most of the failures are relatively benign, they just taste horrible. They are not unsafe and they 'function' insofar as they sustain you. So in cookery, if you have a working system, you might not be happy, but at least you're alive.

Where does machine learning fit? Is it like building aeroplanes, or cooking supper? Neither.

Engineered_system_failure_types.png

Machine learning with modern tools combines the worst of both worlds: a great many apparently functional but malignantly unfit failure modes. Broken ML models appear to work — given data \(X\), you get predictions \(\hat{y}\) — so you might think you're happy… but the predictions are bad, so you end up in hospital with food poisoning.

What kind of food poisoning? It ranges from severe and acute malfunction to much more subtle and insidious errors. Here are some examples:

  • Allowing information leakage across features or across records, resulting in erroneously high accuracy claims. For example, splitting related (e.g. nearby) records into the training and validation sets.

  • Not accounting for under-represented classes so that predictions are biased towards over-represented ones. This kind of error was common to see in the McMurray Formation of Alberta, which is 80% pay.

  • Forgetting to standardize or normalize numerical inputs to a model in production, producing erroneous predictions. For example, training on gamma-ray Z-scores of roughly –3 to +3, then asking for a prediction for a value of 75.

  • Using cost functions that do not reflect the opinions of humans with expertise about ‘good’ vs ‘bad’ predictions.

  • Racial or gender bias in a human resource model, such as might be used for hiring or career mapping.

Tomorrow I’ll suggest some ways to build safe machine learning models. In the meantime, please share what you think about this idea. Does it help to think about machine learning failures this way? Do you have another perspective? Let me know in the comments below.


UPDATE on 7 January 2021: Here’s the follow-up post, Machine learning safety measures >


Like this? Check out these other posts about quality assurance and machine learning:

Does your machine learning smell?

Martin Fowler and Kent Beck popularized the term ‘code smell’ in the book Refactoring. They were describing the subtle signs of deeper trouble in code — signs that a program’s source code might need refactoring (restructuring and rewriting). There are too many aromas to list here, but here are some examples (remember, these things are not necessarily problems in themselves, but they suggest you need to look more closely):

  • Duplicated code.

  • Contrived complexity (also known as showing off).

  • Functions with many arguments, suggesting overwork.

  • Very long functions, which are hard to read.

More recently, data scientist Felienne Hermans applied the principle to the world’s number one programming environment: spreadsheets. The statistics on spreadsheet bugs are quite worrying, and Hermans enumerated the smells that might lead you to them. Here are four of her original five ‘formula’ smells, notice how they correspond to the code smells above:

  • Duplicated formulas.

  • Conditional complexity (e.g. nested IF statements).

  • Multiple references, analogous to the ‘many arguments’ smell.

  • Multiple operations in one cell.

What does a machine learning project smell like?

Most machine learning projects are code projects, so some familiar smells might be emanating from the codebase (if we even have access to it). But machine learning models are themselves functions — machines that map input X to some target y. And even if the statistical model is simple, like a KNN classifier, the workflow is a sort of ‘metamodel’ and can have complexities of its own. So what are the ‘ML smells’ that might alert us to deeper problems in our prediction tools?

I asked this question on Twitter (below) and in the Software Underground

I got some great responses. Here are some ideas adapted from them, with due credit to the people named:

  • Very high accuracy, especially a complex model on a novel task. (Ari Hartikainen, Helsinki and Lukas Mosser, Athens; both mentioned numbers around 0.99 but on earth science problems I start to get suspicious well before that: anything over 0.7 is excellent, and anything over 0.8 suggests ‘special efforts’ have been made.)

  • Excessive precision on hyperparameters might suggest over-tuning. (Chris Dinneen, Perth)

  • Counterintuitive model weights, e.g. known effects have low feature importance. (Reece Hopkins, Anchorage)

  • Unreproducible, non-deterministic code, e.g. not setting random seeds. (Reece Hopkins again)

  • No description of the train–val–test split, or justification for how it was done. Leakage between training and blind data is easy to introduce with random splits in spatially correlated data. (Justin Gosses, Houston)

  • No discussion of ground truth and how the target labels relate to it. (Justin Gosses again)

  • Less than 80% of the effort spent on preparing the data. (Michael Pyrcz, Austin — who actually said 90%)

  • No discussion of the evaluation metric, eg how it was selected or designed (Dan Buscombe, Flagstaff)

  • No consideration of the precision–recall trade-off, especially in a binary classification task. (Dan Buscombe again)

  • Strong class imbalance and no explicit mention of how it was handled. (Dan Buscombe again)

  • Skewed feature importance (on one or two features) might suggest feature leakage. (John Ramey, Austin)

  • Excuses, excuses — “we need more data”, “the labels are bad”, etc. (Hallgrim Ludvigsen, Stavanger)

  • AutoML, e.g. using a black box service, or an exhaustive automated seach of models and hyperparameters.

That’s already a long list, but I’m sure there are others. Or perhaps some of these are really the same thing, or are at least connected. What do you think? What red — or at least yellow — flags do you look out for when reviewing machine learning projects? Let us know in the comments below.


If you enjoyed this post, check out the Machine learning project review checklist I wrote bout last year. I’m currently working on a new version of the checklist that includes some tips for things to look for when going over the checklist. Stay tuned for that.


The thumbnail for this post was generated automatically from text (something like, “a robot smelling a flower”… but I made so many I can’t remember exactly!). Like a lot of unconstrained image generation by AI’s, it’s not great, but I quite like it all the same.

The AI is LXMERT from the Allen Institute. Try it out or read the paper.

download (7).png

Back to work

This post first appeared as a chapter in 52 Things You Should Know About Geophysics (Agile Libre, 2012 — also at Amazon). To follow up on Back to school on Tuesday, I thought I'd share it here on the blog. It's aimed at young professionals, but to be honest, I could do with re-reading it myself now and again...


Five things I wish I'd known

For years I struggled under some misconceptions about scientific careers and professionalism. Maybe I’m not particularly enlightened, and haven't really woken up to them yet, and it's all obvious to everyone else, but just in case I am, I have, and it's not, here are five things I wish I'd known at the start of my career.

Always go the extra inch. You don't need to go the extra mile — there often isn't time and there's a risk that no one will notice anyway. An inch is almost always enough. When you do something, like work for someone or give a presentation, people only really remember two things: the best thing you did, and the last thing you did. So make sure those are awesome. It helps to do something unexpected, or something no one has seen before. It is not as hard as you'd think — read a little around the edges of your subject and you'll find something. Which brings me to...

Read, listen, and learn. Subscribe to some periodicals, preferably ones you will actually enjoy reading. You can see my favourites in J is for Journal. Go to talks and conferences, as often as you reasonably can. But, and this is critical, don't just go — take part. Write notes, ask questions, talk to presenters, discuss with others afterwards. And learn: do take courses, but choose them wisely. In my experience, most courses are not memorable or especially effective. So ask for recommendations from your colleagues, and make sure there is plenty of hands-on interaction in the course, preferably on computers or in the field. Good: Dan Hampson talking you through AVO analysis on real data. Bad: sitting in a classroom watching someone derive equations.

Write, talk, and teach. The complement to read, listen, and learn. It's never too early in your career to start — don't fall into the trap of thinking no one will be interested in what you do, or that you have nothing to share. Even new graduates have something in their experience that nobody else has. Technical conference organizers are desperate for stories from the trenches, to dilute the blue-sky research and pseudo-marketing that most conferences are saturated with. Volunteer to help with courses. Organize workshops and lunch-and-learns. Write articles for Recorder, First Break, or The Leading Edge. Be part of your science! You'll grow from the experience, and it will help you to...

Network, inside and outside your organization. Networking is almost a dirty word to some people, but it doesn’t mean taking people to hockey games or connecting with them on LinkedIn. By far the best way to network is to help people. Help people often, for free, and for fun, and it will make you memorable and get you connected. And it's easy: at least 50 percent of the time, the person just needs a sounding board and they quickly solve their own problem. The rest of the time, chances are good that you can help, or know someone who can. Thanks to the Matthew Effect, whereby the rich get richer, your network can grow exponentially this way. And one thing is certain in this business: one day you will need your network.

Learn to program. You don't need to turn yourself into a programmer, but my greatest regret of my first five years out of university is that I didn't learn to read, re-use, and write code. Read Learn to program to find out why, and how.


Do you have any advice for new geoscientists starting out in their careers? What do you wish you'd known on Day 1?

Two ways for Q&A

If you have ever tried to figure something out on your own, you will know that it is a lot harder than doing something that you already know. It is hard because it is new to you. But just because it is new to you, doesn't mean that it is new to everyone else. And now, in a time when it is easier than ever to connect with everyone online, a new kind of scarcity is emerging. Helpfulness.

How not to get an answer to your question

For better or for worse, I follow more than a dozen discussion groups on LinkedIn. Why? I believe that candid discussions are important and enriching, so I sign up eagerly for the action. Signing up to a discussion group is like showing up at a cocktail party. Maybe you will get noticed alongside other people and brands worth noticing. There is hoopla, and echoing, but I don't think there is any real value being created for the members. If anything, it's a constant distraction you put up with to hedge against the fomo

Click to enlargeYet, hoards of users flock to these groups with questions that are clearly more appropriate for technical hot-lines, or at least an honest attempt at reading the manual. Users helping users is a great way to foster brand loyalty, but not if the technical help desk failed them first. On LinkedIn, even on the rare case a question is sufficiently articulated, users can't upload a screen shot or share a snippet of code. Often times I think people are just fishing (not phishing mind you) and haven't put in enough ground work to deserve the attention of helpers.

What is in it for me?

Stack Overflow is a 'language-independent' question and answer site for programmers. If it is not the first place I land on with a google search, it is consistently the place from which I bounce back to the terminal with my answer. Also, nearly everything that I know about open-source GIS has come from other people taking part in Q&A on GIS Stack Exchange. The reason Stack Exchange works is because there is value and incentive for each of the three types of people that show up. Something for the asker, something for answerer, something for the searcher.

It is easy to see what is in it for the asker. They have got a problem, and they are looking for help. Similarly, it's easy to see what is in it for the searcher. They might find something they are looking for, without even having to ask. But what is in it for the answerer? There is no payment, there is no credit, at least not of the monetary kind. The answerer gets practice being helpful. They willingly put themselves into other people's business to put themselves to the test. How awesome is that? The site, in turn helps the helpers by ensuring the questions contain just enough context to garner meaningful answers.

Imagine if applied geoscientists could incorporate a little more of that.

The deliberate search for innovation & excellence

Collaboration, knowledge sharing, and creativity — the soft skills — aren't important as ends in themselves. They're really about getting better at two things: excellence (your craft today) and innovation (your craft tomorrow). Soft skills matter not because they are means to those important ends, but because they are the only means to those ends. So it's worth getting better at them. Much better.

One small experiment

The Unsession three weeks ago was one small but deliberate experiment in our technical community's search for excellence and innovation. The idea was to get people out of one comfort zone — sitting in the dark sipping coffee and listening to a talk — and into another — animated discussion with a roomful of other subsurface enthusiasts. It worked: there was palpable energy in the room. People were talking and scribbling and arguing about geoscience. It was awesome. You should have been there. If you weren't, you can get a 3-minute hint of what you missed from the feature film...

Go on, share the movie — we want people to see what a great time we had! 

Big thank you to the award-winning Craig Hall Video & Photography (no relation :) of Canmore, Alberta, for putting this video together so professionally. Time lapse, smooth pans, talking heads, it has everything. We really loved working with them. Follow them on Twitter. 

Stop waiting for permission to knock someone's socks off

When I had a normal job, this was the time of year when we set our goals for the coming months. Actually, we sometimes didn't do it till March. Then we'd have the end-of-year review in October... Anyway, when I thought of this, it made me think about my own goals for the year, for Agile, and my career (if you can call it that). Here's my list:

1. Knock someone's socks off.

That's it. That's my goal. I know it's completely stupid. It's not SMART: specific, measurable, attainable, realistic, or timely. I don't believe in SMART. For a start, it's obviously a backronym. That's why there's attainable and realistic in there—what's the difference? They're equally depressing and uninspiring. Measurable, attainable goals are easy, and I'm going to do them anyway: it's called work. It's the corporate equivalent of saying my goals for the day are waking up, getting out of bed, having a shower, making a list of attainable goals... Maybe those are goals if you're in rehab, but if you're a person with a job or a family they're just part of being a person.

I don't mean we should not make plans and share lists of tasks to help get stuff done. It's important to have everyone working at least occasionally in concert. In my experience people tend to do this anyway, but there's no harm in writing them down for everyone to see. Managers can handle this, and everyone should read them.

Why do these goals seem so dry? You love geoscience or engineering or whatever you do. That's a given. (If you don't, for goodness's sake save yourself.) But people keep making you do boring stuff that you don't like or aren't much good at and there's no time left for the awesomeness you are ready to unleash, if only there was more time, if someone would just ask. 

Stop thinking like this. 

You are not paid to be at work, or really to do your job. Your line manager might think this way, because that's how hierarchical management works: it's essentially a system of passing goals and responsiblities down to the workforce. A nameless, interchangeable workforce. But what the executives and shareholders of your company really want from you, what they really pay you for, is Something Amazing. They don't know what it is, or what you're capable of — that's your job. Your job is to systematically hunt and break and try and build until you find the golden insight, the new play, the better way. The real challenge is how you fit the boring stuff alongside this, not the other way around.

Knock someone's socks off, then knock them back on again with these seismic beauties.Few managers will ever come to you and say, "If you think there's something around here you can transform into the most awesome thing I've ever seen, go ahead and spend some time on it." You will never get permission to take risks, commit to something daring, and enjoy yourself. But secretly, everyone around you is dying to have their socks knocked right off. Every day they sadly go home with their socks firmly on: nothing awesome today.

I guarantee that, in the process of trying to do something no-one has ever done or thought of before, you will still get the boring bits of your job done. The irony is that no-one will notice, because they're blinded by the awesome thing no-one asked you for. And their socks have been knocked off.