What makes a good benchmark dataset?

Last week I mentioned that we need more open benchmark datasets in geoscience. I think benchmarks are important for researchers to work on, as a teaching aid, and as a way for us to objectively measure how well we’re doing on a particular problem. How else can we know how we’re doing, or compare Company X’s claim with Company Y’s?

What makes a good benchmark?

I haven’t unearthed any guides from other domains to help answer this question, and we don’t yet have enought experience to know for ourselves. But here’s what I’m thinking:

  • It must address at least one clear machine learning task. The more obviously useful the task, the more useful (and important) the benchmark. The benchmark dataset should be well suited to the task (but does not have to be comprehensive or definitive).

  • It must be open. That means explicitly licensed with an open, and preferably permissive, license. I think we need to avoid non-permissive (so-called ‘copyleft’) licenses, because it’s not clear how the ‘sharealike’ clause would affect works that depended on the dataset. And we definitely need to avoid restrictive non-commercial clauses.

  • It must be discoverable and accessible. In other words, it needs to be easy to find, and anyone should be able to get it, without registering on a website or waiting for an email or doing anything else that slows down the pace of their research. A properly open dataset can be replicated anywhere, so openness should take care of this.

  • It must have enough features to be interesting. This might mean different things for different tasks, but in general we’d like to see a few physical measurements (e.g. seismic, well logs, RockEval, core photos, field observations, flow rates, and so on). The features should be independent — we can always generate derivatives.

  • It must have labels. As well as some interesting features, the dataset must have some interpretive information with high information value (e.g. seismic facies, lithologies, deposotional environment, sequence boundaries, EURs, and so on). Usually, these are expensive to acquire (which is partly why we’d like to be able to predct them).

  • It should name suitable prediction error evaluation methods, with reference implementations, for the intended task. If people are to use it as a score benchmark, they need to know how to score their own implementations of the task.

  • It can be de-localized, but not completely. We don’t need to know the exact whereabouts of the dataset, but if we remove the relative spatial relationships between wells, say, or don’t know which basin we’re in, then the questions we can ask about the data get a lot less interesting, and the whole situation gets much less realistic.

  • It should not be too big. More than about 1GB means unwieldy. It means difficult to download. It means too much room for nuance. And it means it’s probably impossible to explore in the space of a tutorial. It’s also much harder to get a big dataset into shape than a smaller one. A few thousand records, maybe 100,000 in some cases, is probably plenty.

  • It should be clean, but not too clean. No-one wants to spend hours processing a dataset before it can be used, or — worse — be bitten by some esoteric data problem only a domain expert would spot. But, on the other hand, a dataset with no issues at all might be a bit boring. And, in subsurface at least, completely unrepresentative!

  • It should be well documented. The dataset needs to be described to non-technical people, who know little or nothing about the subsurface. Remember that many users will not be proficient programmers either, so…

  • It should have an accompanying demonstration. For example, a script or notebook, preferably in at least a couple of languages, that shows how to load and inspect the data. Ideally this would include a demonstration of how to pose, and answer, a straightforward question as a machine learning task.

I’m not sure we can call this last one a criterion, but maybe in an ideal world…

  • It should be launched with a data science contest. If you’re felling really brave, what better way to attract attention to the new open dataset than with a Kaggle-style contest?

It’s certainly true that there are several datasets around. Unfortunately, the openness criterion eliminates most of them, so they fall at the first hurdle. For example, the very nice dataset that Brendon Hall used in the SEG machine learning contest is not open.

If you know of a dataset that could be coerced into meeting most of these criteria, we’d like to hear about it. I know a small army of people that would love to help get it into the open, and into the hands of machine learning researchers all over the world.

The thumbnail image for this post was adapted from an image by user arg_flickr on Flickr, licensed CC-BY.

Thanks to several people on Software Underground, for the discussion on this topic. In particular, Justin Gosses and Lukas Mosser pointed out the need for transparent error evaluation.

Subsurface Hackathon project round-up, part 2

Following on from Part 1 yesterday, here are the other seven team projects from the hackathon:

Interactive visualization of Water Table heights over many years.

Interactive visualization of Water Table heights over many years.

Water, water everywhere

Water Underground: Martin Bentley (NMMU), Joseph Barraud (Rolls Royce), Rabah Cheknoun (UPPA)

The team built readers for the groundwater data available from dinoloket.nl, both the groundwater levels and the hydrochemistry. They clustered the data by aggregating by month and then looking for similarities in levels in the boreholes and built an open Jupyter notebook.




Seismic from noise

OBSNoise: Fernando Villanueva-Robles (IPGP), Yann Huet (Setec-Lerm), Ngoc Huyen Luu (Ecole Polytechnique), Dorian Bagur (Telecom ParisTech), Jonathan Grandjean (Independent)

The OBSNoise project investigated the application of machine learning to coherently stack ambient noise records collected from ocean bottom seismic (OBS) arrays in order to extract reservoir information. The team's results from synthetic data showed promise. If fully developed, this technology could be a virtually real-time monitoring system of dynamic reservoir properties.

The Killers. Killing It. 

The Killers. Killing It. 

Global geochemical data analytics

The Killers: Alexandre Sache, Violaine Delahaye, Karl Sache (all from Institute Polytechnique UniLaSalle), Côme Arvis, Guillaume Ligner (Ecole Polytechnique)

Two geoscience undergrads and one automotive design student (I know right?) from UniLaSalle hooked up with two data science students from Ecole Polytechnique to interogate the massive GeoRoc database using some clever data analytics tricks and did some novel many-dimensional geochemical classifications.

Team LogFix.

Team LogFix.

Fixing broken well data

LogFix: Guillaume Coffin (Telecom Evolution), Florian Napierala (EISTI), Camille Gimenez (Université Paris-Saclay), Tristan Siméon (Université de Montpellier), Robert Leckenby (Independent)

A truly pristine, calibrated, and corrected petrophysical data is so rare it has a sort of mythical status. Team LogFix used machine learning to identify bad-data zones, repair, QC, and fill-in missing sections. They got an impressive way with the problem, using a dataset from the Athabasca of Canada.

Between the hand-drawn lines

Automagical: Louis Poirier (Independent), Maggie Baber (Independent), Georg Semmler (GiGa infosystems), Björn Wieczoreck (GiGa infosystems), Jonas Kopcsek (GiGa infosystems)


You don't need to believe in magic. Team Automagical used machine learning to create 3D geological models from 2D cross-sections sections. They trained a predictive model using a collection of standardized hand-drawn cross-sections from human geoscientists. The model learns how to propagate rocks throughout a 3D scene. Their goal is to be able to generate cross-sections along any direction through the model. The AI learned how to do geologically realistic interpolation on simple structures. What kind of geologic complexity is possible with more input from more cross-sections?

The document on the left contains a log display with a lithology column. It's a 'hit'. The one on the right has no lithlogies and is a 'miss'.   

The document on the left contains a log display with a lithology column. It's a 'hit'. The one on the right has no lithlogies and is a 'miss'.


There's rocks in them hills! Hills of paper, that is

Logs on the Rocks: Daniel Stanton (Leeds University), Jack Woolam (Leeds University), Adam Goddard (Leeds University), Henri Blondelle (AgileDD)

If the oil and gas industry is to get more efficient, we better get really good at finding lithology and fluid information in the mountains of paper we've collectively built. Team Logs on the Rocks used CNNs to identify graphical depictions of rock types in a sea of unstructured PDFs and TIFFs. They introduced themselves as a team of non-coders, but these guys were were doing cloud computing on AWS and using NVIDIA's GPUs before the end of the weekend. 

Robot vision for seismic interpretation

It's not our FAULT! Claire Birnie (Leeds University), Carlos Alberto da Costa Filho (Edinburgh University), Matteo Ravasi (Statoil), Filippo Broggini (ETHZ), Gijs Straathof (SGS)

Geologic feature recognition using machine learning. The goal was to assist seismic interpreters in detecting geologic features – faults, folds, traps, etc. – in seismic data . They used Haar cascade classifiers, which are routinely used for identifying faces or kittens or beer bottles in photographs and video streams, specially trained to work on seismic data. They used the awesome OpenCV library to build this technology. At the time of writing, their website appears to be maxed out for the month, so if you're dying to see it, leave them a comment on LinkedIn asking them increase their capacity. And in the meantime, you can check out their project's repo on GitHub.

Kudos for the open source repo, team!

It was thrilling to see such a large range of data and applications. Digital thin-sections, ground water maps, seismic data, well logs, cross-sections, information in unstructured documents, and so on. Thanks to each and every individual that showed up with their expertise and enthusiasm. We're all better off because of it.

A quick reminder that our sponsors are awesome! Please high-five them next time you meet them...

Hard things that look easy

After working on a few data science (aka data analytics aka machine learning) problems with geoscientific data, I think we've figured out the 10-step workflow. I'm happy to share it with you now:

  1. Look at all these cool problems, machine learning can solve all of these! I just need to figure out which model to use, parameterize it, and IT'S GONNA BE AWESOME, WE'LL BE RICH. Let's just have a quick look at the data...
  2. Oh, there's no data.
  3. Three months later: we have data! Oh, the data's a bit messy.
  4. Six months later: wow, cleaning the data is gross and/or impossible. I hate my life.
  5. Finally, nice clean data. Now, which model do I choose? How do I set parameters? At least you expected these problems. These are well-known problems.
  6. Wait, maybe there are physical laws governing this natural system... oh well, the model will learn them.
  7. Hmm, the results are so-so. I guess it's harder to make predictions than I thought it would be.
  8. Six months later: OK, this sort of works. And people think it sounds cool. They just need a quick explanation.
  9. No-one understands what I've done.
  10. Where is everybody?

I'm being facetious of course, but only a bit. Modeling natural systems is really hard. Much harder for the earth than for, say, the human body, which is extremely well-known and readily available for inspection. Even the weather is comparitively easy.

Coupled with the extreme difficulty of the problem, we have a challenging data environment. Proprietary, heterogeneous, poor quality, lost, non-digital... There are lots of ways the data goblins can poop on the playground of machine learning.

If the machine learning lark is so hard, why not just leave it to non-artificial intelligence — humans. We already learned how to interpret data, right? We know the model takes years to train. Of course, but I don't accept that we couldn't use some of the features of intelligently applied big data analytics: objectivity, transparency, repeatability (by me), reproducibility (by others), massive scale, high speed... maybe even error tolerance and improved decisions, but those seem far off right now.

I also believe that AI models, like any software, can encode the wisdom of professionals — before they retire. This seems urgent, as the long-touted Great Crew Change is finally underway.

What will we work on?

There are lots of fascinating and tractable problems for machine learning to attack in geoscience — I hope many of them get attacked at the hackathon in June — and the next 2 to 3 years are going to be very exciting. There will be the usual marketing melée to wade through, but it's up to the community of scientists and data analysts to push their way through that with real results based on open data and, ideally with open code.

To be sure, this is happening already — we've had over 25 entrants publishing their solutions to the SEG machine learning contest already, and there will be more like this. It's the only way to building transparent problem-solving systems that we can all participate in and, ultimately, trust.

What machine learning problems are most pressing in geoscience?
I'm collecting ideas for projects to tackle in the hackathon. Please visit this Tricider question and contribute your comments, opinions, or ideas of your own. Help the community work on the problems you care about.