The machine learning algo zoo

One of the wonderful, but also baffling, things about machine learning is that there are so many ways to do it. At some very high level, most of them do something like this (highlighting some jargon):

  1. The human settles on a task (“Predict lithology”) and finds a bunch of data relevant to that task (say, some well logs A, B, and C). Then the human has to come up with some known instances or examples where these well log data go with those lithology labels.

  2. Stuff the logs into an equation. Not an equation like A + B + C, because there’s nothing to tweak in that equation. The equation needs parameters or coefficients, like \(\alpha A + \beta B + \gamma C\). The machine can tweak those Greek letters to change the output. At first, they’ll be random guesses.

  3. See how the output of that equation, which is the machine’s prediction, compares to the known labels. Come up with another equation whose output is a good measure of how far away the predictions are from the known labels. This distance is called the cost, and the equation to compute it the cost function.

  4. Now that the machine has something to guess (the Greek parameters) and a way to know how well its doing (the cost function), it just needs a way to minimize the cost, or to put it another way, optimize the parameters. This optimization process is called learning.

Together, these steps constitute a learning algorithm. An algorithm with a set of optimized weights is usually referred to simply as a model.

All the algorithms

Every piece of this story is worth a whole blog post on its own, but for today let’s stay high-level.

The problem is that the algorithm zoo can be overwhelming. My post last week was an attempt to compare a lot of regression algorithms, in terms of how they make sense of three synthetic datasets.

Today I’m sharing a Big Giant Spreadsheet™ that attempts to compare some of the most popular ‘shallow’ learning algorithms in terms of their most important characteristics. For example, can they predict probabilities? Are they deterministic? What are the key hyperparameters? And so on.

Here’s a small version of the table (see the links below for other versions):

There’s a PDF version here — and here’s the original spreadsheet.

Eventually, I’m visualizing a poster for the wall. I think it would be nice to have some equations on here. Maybe the plots from the various comparisons too (see last weeks post!). And even more advice, like which ones break when you have too many features. What else would you like to see on there?

Comparing regressors

There are several really nice comparisons between various algorithms in the Scikit-Learn documentation. The most famous, and useful, one is probably the classifier comparison:

A comparison of classification algorithms. Each row is a different dataset; each column (except the first) is a different classifier, each trying to separate the blue and red points. The accuracy score of each classifier is show in the lower right corner of each plot. There’s so much to look at in this one plot!

There’s also a very nice clustering algorithm comparison, and this anomaly detection comparison. As usual with awesome open source software packages like Scikit-Learn, the really wonderful thing is that all the source code is right there so you can hack these things to show your own data.

What about regression?

Regression problems are the other major kind of machine learning task. If the thing you’re trying to predict is not a category (like ‘blue’ or ‘red’, as above) but a continuous property (like porosity, say), then you’re looking at a regression problem.

I wondered what a comparison plot for the various regressors in Scikit-Learn would look like. I couldn’t find one, so I made one. I made up three one-dimensional datasets — one linear, one polynomial, and one periodic. Then I tried predicting each one with various different model types, from linear regression to a deep neural network. Here’s version 1 (well, 0.1 really) of my script; feel free to adapt and improve it!

Here’s the plot it produces:

A comparison of most of the regressors in scikit-learn, made with this script. The red lines are unregularized models; the blue have regularization. The pale points are the validation data. The small numbers in each plot are RMS error (lower is better!).

I think this plot repays careful study. Notice the smoothing effect of regularization. See how tree-based methods result in discretized predictions, and kernel-based ones are pretty horrible at extrapolation.

I’m 100% open to feedback on ways to improve this plot… or please improve it and show me how it goes!

Rocks in the Playground

It’s debatable whether neural networks should feature in an introductory course on machine learning. But it’s hard to avoid at least mentioning them, and many people are attracted to machine learning courses because they have heard so much about deep learning. So, reluctantly, we almost always get into neural nets in our Machine learning for geoscientists classes.

Our approach is to build a neural network from scratch, using only standard Python and NumPy data structures — that is, without using a specialist deep-learning framework. The code is all adapted from Gram Ganssle’s awesome Leading Edge tutorial from 2018. I like it because it lays out the components — the data, the activation function, the cost function, the forward pass, and all the steps involved in backpropagation — then combines them into a working neural network.

Figure 2 from Gram Ganssle’s 2018 tutorial in the Leading Edge. Licensed CC BY.

Figure 2 from Gram Ganssle’s 2018 tutorial in the Leading Edge. Licensed CC BY.

One drawback of our approach is that it would be quite fiddly to change some aspects of the network. For example, adding regularization, which almost all networks use, or even just adding another layer, are both beyond the scope of the class. So I like to follow it up with getting the students to build the same network using the scikit-learn library’s multilayer perceptron model. Then we build the same network using the PyTorch framework. (And one could do it in TensorFlow too, of course.) All of these libraries make it easier to play with the various options.

Introducing the Rocky Playground

Now we have another tool — one that makes it even easier to change parameters, add layers, use regularization, and so on. The students don’t even have to write any code! I invite you to play with it too — check out the Rocky Playground, an interactive deep neural network you can see inside.

Part of the user interface. Click on the image to visit the site.

Part of the user interface. Click on the image to visit the site.

This tool is a fork of Google’s well-known Neural Network Playground, as described at the bottom of our tool’s page. We made a few changes:

  • Added several new real and synthetic datasets, with descriptions.

  • There are more activation functions to try, including ELU and Swish.

  • You can change the regularization during training and watch the weights.

  • Anyone can upload their own dataset! (These stay on your computer, they are not uploaded anywhere.)

  • We added an the expression of the network in Python code.

One of the datasets we added is the same shear-sonic prediction dataset we use in the neural network class. So students can watch the same neural net they built (more or less) learn the task in real time. It’s really very cool.

I’ve written before about different expressions of mathematical ideas — words, symbols, annotations, code, etc. — and this is really just a natural extension of that thought. When people can hear and see the same idea in three — or five, or ten — different ways, it sticks. Or at least has a better chance of sticking.

What do you think? Do this tool help you? Could you use it for teaching? If you have suggestions feel free to drop them here in the comments, or submit an issue to the tool’s repo. We’d love your help to make it even more useful.

New virtual training for digital geoscience

Looking to skill up before 2022 hits us with… whatever 2022 is planning? We have quite a few training classes coming up — don’t miss out! Our classes use 100% geoscience data and examples, and are taught exclusively by earth scientists.

We’re also always happy to teach special classes in-house for you and your colleagues. Just get in touch.

Special classes for CSEG in Calgary

Public classes with timing for Americas

  • Geocomputing: week of 22 November

  • Machine Learning: week of 6 December

Public classes with timing for Europe, Africa and Middle East

  • Geocomputing: week of 27 September

  • Machine Learning: week of 8 November

So far we’ve taught 748 people on the Geocomputing class, and 445 on the Machine Learning class — this wave of new digital scientists is already doing fascinating new work and publishing new research. I’m very excited to see what unfolds over the next year or two!

Find out more about Agile’s public classes by clicking this big button:

Future proof

Last week I wrote about the turmoil many subsurface professionals are experiencing today. There’s no advice that will work for everyone, but one thing that changed my life (ok, my career at least) was learning a programming language. Not only because programming computers is useful and fun, but also because of the technology insights it brings. Whether you’re into data management or machine learning, workflow automation or just being a more rounded professional, there really is no faster way to build digital muscles!

learn_python_thumb_2.png

Six classes

We have six public classes coming up in the next few weeks. But there are thousands of online and virtual classes you can take — what’s different about ours? Here’s what I think:

  • All of the instructors are geoscientists, and we have experience in sedimentology, geophysics, and structural geology. We’ve been programming in Python for years, but we remember how it felt to learn it for the first time.

  • We refer to subsurface data and typical workflows throughout the class. We don’t use abstract or unfamiliar examples. We focus 100% on scientific computing and data visualization. You can get a flavour of our material from the X Lines of Python blog series.

  • We want you to be self-sufficient, so we give you everything you need to start being productive right away. You’ll walk away with the full scientific Python stack on your computer, and dozens of notebooks showing you how to do all sorts of things from loading data to making a synthetic seismogram.

Let’s look at what we have on offer.

python_examples.png

Upcoming classes

We have a total of 6 public classes coming up, in two sets of three classes: one set with timing for North, Central and South America, and one set with timing for Europe, Africa, and the Middle East. Here they are:

  • Intro to Geocomputing, 5 half-days, 15–19 Feb — 🌎 Timing for Americas — 🌍 Timing for Europe & Africa — If you’re just getting started in scientific computing, or are coming to Python from another language, this is the class for you. No prerequisites.

  • Digital Geology with Python, 4 half-days, 22–25 Feb — 🌍 Timing for Europe & Africa — A closer look at geological workflows using Python. This class is for scientists and engineers with some Python experience.

  • Digital Geophysics with Python, 4 half-days, 22–25 Feb — 🌎 Timing for Americas — We get into some geophysical workflows using Python. This class is for quantitative scientists with some Python experience.

  • Machine Learning for Subsurface, 4 half-days in March — 🌎 Timing for Americas (1–4 Mar) — 🌍 Timing for Europe & Africa (8–11 Mar) — The best way into machine learning for earth scientists and subsurface engineers. We give you everything you need to manage your data and start exploring the world of data science and machine learning.

Follow the links above to find out more about each class. We have space for 14 people in each class. You find pricing options for students and those currently out of work. If you are in special circumstances, please get in touch — we don’t want price to be a barrier to these classes.

In-house options

If you have more than about 5 people to train, it might be worth thinking about an in-house class. That way, the class is full of colleagues learning things together — they can speak more openly and share more freely. We can also tailor the content and the examples to your needs more easily.

Get in touch if you want more info about this approach.

Three books about machine learning

I recently finished a Udemy machine learning course, and wrote on LinkedIn afterwards: “While I am no [machine learning] expert, this is one step on the way to better skills with [Python]”. So which other steps have I taken along that route to learn more about machine learning?

Here I share my thoughts on three books; two of which I have read cover to cover, and the third which I can hardly put down! When students in our machine learning class ask about books, these are the ones we recommend.

The Hundred-Page Machine Learning Book

Andriy Burkov (2019). Self-published, 141 p, ISBN 978-1-9995795-0-0. List price USD35. $30.83 at Amazon.com, £25.27 at Amazon.co.uk.

Andriy Burkov states right at the start that “[This] book is distributed on the read first, buy later principle.” That is the first time I’ve seen this in a book despite the fact you can try a car before buying or visit a house before taking out a mortgage.

This was the first book I read that is fully dedicated to machine learning. I knew a little about the topic beforehand, but wasn’t yet ready to use any machine learning algorithm at that point, so this was a perfect introduction to the what, the why and the how of machine learning. The mathematics are introduced and explained in a way that is accessible without being overwhelming, although I acknowledge that this is of course a very subjective comment.

When I turned the last page of this book (and there are a few more than 100), I was even keener to explore further, and I still refer back to this book when I want a quick summary of a machine learning concept.

Data Science from Scratch

Joel Grus (2015). O’Reilly, 311 p, ISBN 978-1-492-04113-9. List price USD 41.99 at O’Reilly. $38.85 at Amazon.com, £27.56 at Amazon.co.uk.

I read the 1st edition of this book, which uses Python 2.7 but often refers to Python 3.4; the 2nd edition (2019) uses Python 3.6 throughout.

Joel Grus, of Ten Essays on Fizz Buzz fame amongst many other achievements, has a knack of breaking problems down to their constituent parts and gracefully rebuilding a solution. While I sometimes struggled with the level of mathematics he’s comfortable with, I never felt that I couldn’t follow his journey. This book really gave me the sequence of steps in data science, and a fantastic resource to refer back to whenever an algorithm seems too opaque to me.

Introduction to Machine Learning with Python: A Guide for Data Scientists

Andreas C. Müller and Sarah Guido (2017). O’Reilly, 384 p, ISBN 978-1-449-36941-5. $40.00 at Amazon.com, £31.45 at Amazon.co.uk

At the time of writing I am halfway through this book but I’ve already gone through Chapter 2 twice: once with the book and a second time to practice with different data sets. This is symptomatic of my experience with this book so far: it’s totally addictive. Tremendously well explained, building on the power of Jupyter notebooks thanks to all the code being available on GitHub, always explaining and illustrating the effects of only the important hyperparameters in each algorithm — this is fast turning into my go-to companion for machine learning.

If you only buy one machine learning book, or don’t know where to start, this is probably the one to go with.

We all have different technical backgrounds and abilities, and as mathematics figures prominently in the implementation of all machine learning solutions, it’s not the most approachable of subjects. I’d love to hear your comments about books you would recommend to other scientists getting started in machine learning.


These prices are Amazon's discounted prices and are subject to change. The links contain a tag that earns us a small commission, but does not change the price to you. You can almost certainly buy these books elsewhere. 

The images on this page are copyright of their respective owners and are used here in accordance with fair use doctrine.

Openness is a two-way street

Last week the Data Analysis Study Group of the SPE Gulf Coast Section announced a new machine learning contest (I’m afraid registration is now closed, even though the contest has not started yet). The task is to predict shear-wave sonic from other logs, similar to the SPWLA PDDA contest last year. This is a valuable problem in the subsurface, because shear sonic log is essential for computing elastic properties of rocks and therefore in predicting rock and fluid properties or processing seismic. Indeed, TGS have built a business on predicted logs with their ARLAS product. There’s money in log prediction!

The task looks great, but there’s one big problem: the dataset is not open.

Why is this a problem?

Before answering that, let’s look at some context.

What’s a machine learning contest?

Good question. Typically, an organization releases a dataset (financial timeseries, Netflix viewer data, medical images, or whatever). They invite people to predict some valuable property (when to sell, which show to recommend, how to treat the illness, or whatever). And they pick the best, measured against known labels on a hidden dataset.

Kaggle is one of the largest platforms hosting such challenges, and they often attract thousands of participants — competing for large prizes. TGS ran a seismic salt-picking contest on the platform, attracting almost 74,000 submissions from 3220 teams with a $100k prize purse. Other contests are more grass-roots, like the one I ran with Brendon Hall in 2016 on lithology prediction, and like this SPE contest. It’s being run by a team of enthusiasts without a lot of resources from SPE, and the prize purse is only $1000 — representing about 3 hours of the fully loaded G&A of an oil industry professional.

What has this got to do with reproducibility?

Contests that award a large prize in return for solving a hard problem are essentially just a kind of RFP-combined-with-consulting-job. It’s brutally inefficient: hundreds or even thousands of people spend hours on the problem for free, and a handful are financially rewarded. These contests attract a lot of attention, but I’m not that interested in them.

Community-oriented events like this SPE contest — and the recent FORCE one that Xeek hosted — are more interesting and I believe they are more impactful. They have lots of great outcomes:

  • Lots of people have fun working on a hard problem and connecting with each other.

  • Solutions are often shared after, or even during, the contest, so that everyone learns and grows their toolbox.

  • A new open dataset that might even become a much-needed benchmark for the task in hand.

  • Researchers can publish what they did, or do later. (The SEG ML contest tutorial and results article have 136 citations between them, largely from people revisiting the dataset to show new solutions.)

A lot of new open-source machine learning code is always exciting, but if the data is not open then the work is by definition not reproducible. It seems especially unfair — cheeky, even — to ask participants to open-source their code, but to keep the data proprietary. For sure TGS is interested in how these free solutions compare to their own product.

Well, life’s not fair. Why is this a problem?

The data is being shared with the contest participants on the condition that they may not share it. In other words it’s proprietary. That means:

  • Participants are encumbered with the liability of a proprietary dataset. Sure, TGS is sharing this data in good faith today, but who knows how future TGS lawyers will see it after someone accidentally commits it to their GitHub repo? TGS is a billion-dollar company, they will win a legal argument with you. (Having said that, there’s no NDA or anything, just a checkbox in a form. I don’t know how binding it really is… but I don’t want to be the one that finds out.)

  • Participants can’t publish reproducible papers on their own work. They can publish classic oil-indsutry, non-reproducible work — I did this thing but no-one can check it because I can’t give you the data — but do we really need more of that? (In the contest introductory Zoom, someone asked about publishing plots of the data. The answer: “It should be fine.” Are we really still this naive about data?)

If anyone from TGS is reading this and thinking, “Come on, we’re not going to sue anyone — we’re not GSI! — it’s fine :)” then my response is: Wonderful! In that case, why not just formalize everything by releasing the data under an open licence — preferably Creative Commons Attribution 4.0? (Unmodified! Don’t make the licensing mistakes that Equinor and NAM have made recently.) That way, everyone knows their rights, everyone can safely download the data, and the community can advance. And TGS looks pretty great for contributing an awesome dataset to the subsurface machine learning community.

I hope TGS decides to release the data with an open licence. If they don’t, it feels like a rather one-sided deal to me. And with the arrangement as it stands, there’s no way I would enter this contest.

Machine learning safety measures

Yesterday in Functional but unsafe machine learning I wrote about how easy it is to build machine learning pipelines that yield bad predictions — a clear business risk. Today I want to look at some ways we might reduce this risk.


The diagram I shared yesterday tries to illustrate the idea that it’s easy to find a functional solution in machine learning, but only a few of those solutions are safe or fit for purpose. The question to ask is: what can we do about it?

Engineered_system_failure_types.png

You can’t make bad models safe, so there’s only one thing to do: shrink the field of functional models so that almost all of them are safe:

Engineered_system_safer_ML.png

But before we do this any old way, we should ask why the orange circle is so big, and what we’re prepared to do to shrink it.

Part of the reason is that libraries like scikit-learn, and the Python ecosystem in general, are very easy to use and completely free. So it’s absolutely possible for any numerate person with a bit of training to make sophisticated machine learning models in a matter of minutes. This is a wonderful and powerful thing, unprecedented in history, and it’s part of why machine learning has been so hot for the last 6 or 8 years.

Given that we don’t want to lose this feature, what actions could we take to make it harder to build bad models? How can we improve over time like aviation has, and without premature regulation? Here are some ideas:

  • Fix and maintain the data pipeline (not the data!). We spend most of our time getting training and validation data straight, and it always makes a big difference to the outcomes. But we’re obsessed with fixing broken things (which is not sustainable), when we should be coping with them instead.

  • Raise the digital literacy rate: educate all scientists about machine learning and data-driven discovery. This process starts at grade school, but it must continue at university, through grad school, and at work. It’s not a ‘nice to have’, it’s essential to being a scientist in the 21st century.

  • Build software to support good practice. Many of the problems I’m talking about are quite easy to catch, or at least warn about, during the training and evaluation process. Unscaled features, class imbalance, correlated features, non-IID records, and so on. Education is essential, but software can help us notice and act on them.

  • Evolve quality assurance processes to detect ML smell. Organizations that are adopting (building or buying) machine learning (i.e. all of them), must get really good at sniffing out problems with machine learning projects — then fixing those problems — and at connecting practitioners so they can learng together and share good practice.

  • Recognizing that machine learning models are made from code, and must be subject to similar kinds of quality assurance. We should adopt habits such as testing, documentation, code review, continuous integration, and issue tracking for users to report bugs and request enhancements. We already know how to do these things.

I know some of this might sound like I’m advocating command and control, but that approach is not compatible with a lean, agile organization. So if you’re a CTO reading this, the fastest path to success here is not hiring a know-it-all Chief Data Officer from a cool tech giant, then brow-beating your data science practitioners with Best Practice documents. Instead, help your digital professionals create a high-functioning community of practice, connected both inside and outside the organizations, and support them learning and adapting together. Yes it takes longer, but it’s much more effective.

What do you think? Are people already doing these things? Do you see people using other strategies to reduce the risk of building poor machine learning models? Share your stories in the comments below.

Functional but unsafe machine learning

There are always more ways to mess something up than to get it right. That’s just statistics, specifically entropy: building things is a fight against the second law of thermodynamics. And while messing up a machine learning model might sound abstract, it could result in poor decisions, leading to wasted resources, environmental risk, or unsafe conditions.

Okay then, bad solutions outnumber good solutions. No problem: we are professionals, we can tell the difference between good ones and bad ones… most of the time. Sometimes, though, bad solutions are difficult to discern — especially when we’re so motivated to find good solutions to things!

How engineered systems fail

A machine learning pipeline is an engineered system:

Engineered system: a combination of components that work in synergy to collectively perform a useful function

Some engineered systems are difficult to put together badly because when you do, they very obviously don't work. Not only can they not be used for their intended purpose, but any lay person can tell this. Take a poorly assembled aeroplane: it probably won’t fly. If it does, it then has safety criteria to meet. So if you have a working system, you're happy.

There are multiple forces at work here: decades of industrial design narrow the options, physics takes care of big chunk of failed builds, strong regulation takes care of almost all of the rest, and daily inspections keep it all functional. The result: aeroplane accidents are very rare.

In other domains, systems can be put together badly and still function safely. Take cookery — most of the failures are relatively benign, they just taste horrible. They are not unsafe and they 'function' insofar as they sustain you. So in cookery, if you have a working system, you might not be happy, but at least you're alive.

Where does machine learning fit? Is it like building aeroplanes, or cooking supper? Neither.

Engineered_system_failure_types.png

Machine learning with modern tools combines the worst of both worlds: a great many apparently functional but malignantly unfit failure modes. Broken ML models appear to work — given data \(X\), you get predictions \(\hat{y}\) — so you might think you're happy… but the predictions are bad, so you end up in hospital with food poisoning.

What kind of food poisoning? It ranges from severe and acute malfunction to much more subtle and insidious errors. Here are some examples:

  • Allowing information leakage across features or across records, resulting in erroneously high accuracy claims. For example, splitting related (e.g. nearby) records into the training and validation sets.

  • Not accounting for under-represented classes so that predictions are biased towards over-represented ones. This kind of error was common to see in the McMurray Formation of Alberta, which is 80% pay.

  • Forgetting to standardize or normalize numerical inputs to a model in production, producing erroneous predictions. For example, training on gamma-ray Z-scores of roughly –3 to +3, then asking for a prediction for a value of 75.

  • Using cost functions that do not reflect the opinions of humans with expertise about ‘good’ vs ‘bad’ predictions.

  • Racial or gender bias in a human resource model, such as might be used for hiring or career mapping.

Tomorrow I’ll suggest some ways to build safe machine learning models. In the meantime, please share what you think about this idea. Does it help to think about machine learning failures this way? Do you have another perspective? Let me know in the comments below.


UPDATE on 7 January 2021: Here’s the follow-up post, Machine learning safety measures >


Like this? Check out these other posts about quality assurance and machine learning:

Does your machine learning smell?

Martin Fowler and Kent Beck popularized the term ‘code smell’ in the book Refactoring. They were describing the subtle signs of deeper trouble in code — signs that a program’s source code might need refactoring (restructuring and rewriting). There are too many aromas to list here, but here are some examples (remember, these things are not necessarily problems in themselves, but they suggest you need to look more closely):

  • Duplicated code.

  • Contrived complexity (also known as showing off).

  • Functions with many arguments, suggesting overwork.

  • Very long functions, which are hard to read.

More recently, data scientist Felienne Hermans applied the principle to the world’s number one programming environment: spreadsheets. The statistics on spreadsheet bugs are quite worrying, and Hermans enumerated the smells that might lead you to them. Here are four of her original five ‘formula’ smells, notice how they correspond to the code smells above:

  • Duplicated formulas.

  • Conditional complexity (e.g. nested IF statements).

  • Multiple references, analogous to the ‘many arguments’ smell.

  • Multiple operations in one cell.

What does a machine learning project smell like?

Most machine learning projects are code projects, so some familiar smells might be emanating from the codebase (if we even have access to it). But machine learning models are themselves functions — machines that map input X to some target y. And even if the statistical model is simple, like a KNN classifier, the workflow is a sort of ‘metamodel’ and can have complexities of its own. So what are the ‘ML smells’ that might alert us to deeper problems in our prediction tools?

I asked this question on Twitter (below) and in the Software Underground

I got some great responses. Here are some ideas adapted from them, with due credit to the people named:

  • Very high accuracy, especially a complex model on a novel task. (Ari Hartikainen, Helsinki and Lukas Mosser, Athens; both mentioned numbers around 0.99 but on earth science problems I start to get suspicious well before that: anything over 0.7 is excellent, and anything over 0.8 suggests ‘special efforts’ have been made.)

  • Excessive precision on hyperparameters might suggest over-tuning. (Chris Dinneen, Perth)

  • Counterintuitive model weights, e.g. known effects have low feature importance. (Reece Hopkins, Anchorage)

  • Unreproducible, non-deterministic code, e.g. not setting random seeds. (Reece Hopkins again)

  • No description of the train–val–test split, or justification for how it was done. Leakage between training and blind data is easy to introduce with random splits in spatially correlated data. (Justin Gosses, Houston)

  • No discussion of ground truth and how the target labels relate to it. (Justin Gosses again)

  • Less than 80% of the effort spent on preparing the data. (Michael Pyrcz, Austin — who actually said 90%)

  • No discussion of the evaluation metric, eg how it was selected or designed (Dan Buscombe, Flagstaff)

  • No consideration of the precision–recall trade-off, especially in a binary classification task. (Dan Buscombe again)

  • Strong class imbalance and no explicit mention of how it was handled. (Dan Buscombe again)

  • Skewed feature importance (on one or two features) might suggest feature leakage. (John Ramey, Austin)

  • Excuses, excuses — “we need more data”, “the labels are bad”, etc. (Hallgrim Ludvigsen, Stavanger)

  • AutoML, e.g. using a black box service, or an exhaustive automated seach of models and hyperparameters.

That’s already a long list, but I’m sure there are others. Or perhaps some of these are really the same thing, or are at least connected. What do you think? What red — or at least yellow — flags do you look out for when reviewing machine learning projects? Let us know in the comments below.


If you enjoyed this post, check out the Machine learning project review checklist I wrote bout last year. I’m currently working on a new version of the checklist that includes some tips for things to look for when going over the checklist. Stay tuned for that.


The thumbnail for this post was generated automatically from text (something like, “a robot smelling a flower”… but I made so many I can’t remember exactly!). Like a lot of unconstrained image generation by AI’s, it’s not great, but I quite like it all the same.

The AI is LXMERT from the Allen Institute. Try it out or read the paper.

download (7).png