Reproducibility Zoo

repro-zoo-main-banner.png

The Repro Zoo was a new kind of event at the SEG Annual Meeting this year. The goal: to reproduce the results from well-known or important papers in GEOPHYSICS or The Leading Edge. By reproduce, we meant that the code and data should be open and accessible. By results, we meant equations, figures, and other scientific outcomes.

And some of the results are scary enough for Hallowe’en :)

What we did

All the work went straight into GitHub, mostly as Jupyter Notebooks. I had a vague goal of hitting 10 papers at the event, and we achieved this (just!). I’ve since added a couple of other papers, since the inspiration for the work came from the Zoo… and I haven’t been able to resist continuing.

The scene at the Repro Zoo. An air of quiet productivity hung over the booth. Yes, that is Sergey Fomel and Jon Claerbout. Thank you to David Holmes of Dell EMC for the picture.

The scene at the Repro Zoo. An air of quiet productivity hung over the booth. Yes, that is Sergey Fomel and Jon Claerbout. Thank you to David Holmes of Dell EMC for the picture.

Here’s what the Repro Zoo team got up to, in alphabetical order:

  • Aldridge (1990). The Berlage wavelet. GEOPHYSICS 55 (11). The wavelet itself, which has also been added to bruges.

  • Batzle & Wang (1992). Seismic properties of pore fluids. GEOPHYSICS 57 (11). The water properties, now added to bruges.

  • Claerbout et al. (2018). Data fitting with nonstationary statistics, Stanford. Translating code from FORTRAN to Python.

  • Claerbout (1975). Kolmogoroff spectral factorization. Thanks to Stewart Levin for this one.

  • Connolly (1999). Elastic impedance. The Leading Edge 18 (4). Using equations from bruges to reproduce figures.

  • Liner (2014). Long-wave elastic attentuation produced by horizontal layering. The Leading Edge 33 (6). This is the stuff about Backus averaging and negative Q.

  • Luo et al. (2002). Edge preserving smoothing and applications. The Leading Edge 21 (2).

  • Yilmaz (1987). Seismic data analysis, SEG. Okay, not the whole thing, but Sergey Fomel coded up a figure in Madagascar.

  • Partyka et al. (1999). Interpretational aspects of spectral decomposition in reservoir characterization.

  • Röth & Tarantola (1994). Neural networks and inversion of seismic data. Kudos to Brendon Hall for this implementation of a shallow neural net.

  • Taner et al. (1979). Complex trace analysis. GEOPHYSICS 44. Sarah Greer worked on this one.

  • Thomsen (1986). Weak elastic anisotropy. GEOPHYSICS 51 (10). Reproducing figures, again using equations from bruges.

As an example of what we got up to, here’s Figure 14 from Batzle & Wang’s landmark 1992 paper on the seismic properties of pore fluids. My version (middle, and in red on the right) is slightly different from that of Batzle and Wang. They don’t give a numerical example in their paper, so it’s hard to know where the error is. Of course, my first assumption is that it’s my error, but this is the problem with research that does not include code or reference numerical examples.

Figure 14 from Batzle & Wang (1992). Left: the original figure. Middle: My attempt to reproduce it. Right: My attempt in red, overlain on the original.

This was certainly not the only discrepancy. Most papers don’t provide the code or data to reproduce their figures, and this is a well-known problem that the SEG is starting to address. But most also don’t provide worked examples, so the reader is left to guess the parameters that were used, or to eyeball results from a figure. Are we really OK with assuming the results from all the thousands of papers in GEOPHYSICS and The Leading Edge are correct? There’s a long conversation to have here.

What next?

One thing we struggled with was capturing all the ideas. Some are on our events portal. The GitHub repo also points to some other sources of ideas. And there was the Big Giant Whiteboard (below). Either way, there’s plenty to do (there are thousands of papers!) and I hope the zoo continues in spirit. I will take pull requests until the end of the year, and I don’t see why we can’t add more papers until then. At that point, we can start a 2019 repo, or move the project to the SEG Wiki, or consider our other options. Ideas welcome!

IMG_20181017_163926.jpg

Thank you!

The following people and organizations deserve accolades for their dedication to the idea and hard work making it a reality. Please give them a hug or a high five when you see them.

  • David Holmes (Dell EMC) and Chance Sanger worked their tails off on the booth over the weekend, as well as having the neighbouring Dell EMC booth to worry about. David also sourced the amazing Dell tech we had at the booth, just in case anyone needed 128GB of RAM and an NVIDIA P5200 graphics card for their Jupyter Notebook. (The lights in the convention centre actually dimmed when we powered up our booths in the morning.)

  • Luke Decker (UT Austin) organized a corps of volunteer Zookeepers to help manage the booth, and provided enthusiasm and coding skills. Karl Schleicher (UT Austin), Sarah Greer (MIT), and several others were part of this effort.

  • Andrew Geary (SEG) for keeping things moving along when I became delinquent over the summer. Lots of others at SEG also helped, mainly with the booth: Trisha DeLozier, Rebecca Hayes, and Beth Donica all contributed.

  • Diego Castañeda got the events site in shape to support the Repro Zoo, with a dashboard showing the latest commits and contributors.

Reproduce this!

logo_simple.png

There’s a saying in programming: untested code is broken code. Is unreproducible science broken science?

I hope not, because geophysical research is — in general — not reproducible. In other words, we have no way of checking the results. Some of it, hopefully not a lot of it, could be broken. We have no way of knowing.

Next week, at the SEG Annual Meeting, we plan to change that. Well, start changing it… it’s going to take a while to get to all of it. For now we’ll be content with starting.

We’re going to make geophysical research reproducible again!

Welcome to the Repro Zoo!

If you’re coming to SEG in Anaheim next week, you are hereby invited to join us in Exposition Hall A, Booth #749.

We’ll be finding papers and figures to reproduce, equations to implement, and data tables to digitize. We’ll be hunting down datasets, recreating plots, and dissecting derivations. All of it will be done in the open, and all the results will be public and free for the community to use.

You can help

There are thousands of unreproducible papers in the geophysical literature, so we are going to need your help. If you’ll be in Anaheim, and even if you’re not, here some things you can do:

That’s all there is to it! Whether you’re a coder or an interpreter, whether you have half an hour or half a day, come along to the Repro Zoo and we’ll get you started.

Figure 1 from Connolly’s classic paper on elastic impedance. This is the kind of thing we’ll be reproducing.

Figure 1 from Connolly’s classic paper on elastic impedance. This is the kind of thing we’ll be reproducing.

Tools for drawing geoscientific figures

This is a response to Boyan Vakarelov's useful post on LinkedIn about tools for creating geological figures. I especially liked his SketchUp tip.

It's a while since we wrote about our toolset, so I thought I'd document what we're currently using for making figures. You won't be surprised to hear that they're mostly open source. 

Our figure creation toolbox

  • QGIS — if it's a map, you should make it in a GIS, it's as simple as that.
  • Inkscape — for most drawing and figure creation tasks. It's just as good as Illustrator.
  • GIMP — for raster editing tasks. Rasters are no good for editable figures or line art though.
  • TimeScale Creator — a little-known tool for making editable chronostratigraphic columns. Here's an example from way back on this very blog. The best thing: you can export SVG files, then edit them in Inkscape.
  • Python, R, etc. — the best way to make reproducible scientific figures is not to draw them at all. Instead, create data visualizations programmatically.

To really appreciate how fantastic the programmatic approach is, check out Sergey Fomel's treasure trove of reproducible documents, in which every figure is really just the output of a little program that anyone can run. Here's one of my own, adapted from a previous post and a sneak peek of an upcoming Leading Edge tutorial:

Different sample interpolation styles give different amplitudes for inter-sample positions, as shown at the red 'horizon' time pick. From upcoming tutorial in the April edition of The Leading Edge

Everything you wanted to know about images

Screenshots often form part of a figure, because they're so much easier than trying to figure out how to export an image, or trying to wrangle the data from scratch. If you find yourself grabbing a screenshot, and any time you're providing an image for someone else — especially if it's destined for print — you need to know all about image resolution. Read my post Save the samples for my advice. 

If you still save your images as JPEG, you also need to read my post about How to choose an image format. One day you might need the fidelity you are throwing away! Here's the short version: save everything as a PNG.

Last thing: know the difference between vector and raster graphics. Make vectors when you can.

Stop using PowerPoint!

The only bit of Boyan's post I didn't like was the bit about PowerPoint. I admit, fifteen years ago I was a bit of a slave to PowerPoint. I'd have preferred to use Illustrator at the time, but it was well beyond corporate IT's ken, and I hadn't yet discovered Inkscape. But I'm over it now — and just as well because it's a horrible drawing tool. The main limitation is not having layers, which is a show-stopper for me, but there's also the generic typography, simplistic spline editing, the inability to handle standard formats like SVG, and no scripting or plug-ins.

Getting good

If you want to learn about making effective scientific figures, I strongly recommend reading anything you can by Edward Tufte, Robert Kosara, Alberto Cairo, and Mike Bostock. For some quick inspiration check out the #dataviz hashtag on Twitter, or feast your eyes on this amazing collection of graphics, or Mike Bostock's interactive examples, or... there are too many resources to choose from.

How about you? Share your favourite tools in the comments or on Boyan's post.

The Blangy equation

After reading Chris Liner's recent writings on attenuation and negative Q — both in The Leading Edge and on his blog — I've been reading up a bit on anisotropy. The idea was to stumble a little closer to writing the long-awaited Q is for Q post in our A to Z series. As usual, I got distracted...

In his 1994 paper AVO in tranversely isotropic media—An overview, Blangy (now the chief geophysicist at Hess) answered a simple question: How does anisotropy affect AVO? Stigler's law notwithstanding, I'm calling his solution the Blangy equation. The answer turns out to be: quite a bit, especially if impedance contrasts are low. In particular, Thomsen's parameter δ affects the AVO response at all offsets (except zero of course), while ε is relatively negligible up to about 30°.

The key figure is Figure 2. Part (a) shows isotropic vs anisotropic Type I, Type II, and Type III responses:

Unpeeling the equation

Converting the published equation to Python was straightforward (well, once Evan pointed out a typo — yay editors!). Here's a snippet, with the output (here's all of it):

For the plot below, I computed the terms of the equation separately for the Type II case. This way we can see the relative contributions of the terms. Note that the 3-term solution is equivalent to the Aki–Richards equation.

Interestingly, the 5-term result is almost the same as the 2-term approximation.

Reproducible results

One of the other nice features of this paper — and the thing that makes it reproducible — is the unambiguous display of the data used in the models. Often, this sort of thing is buried in the text, or not available at all. A table makes it clear:

Last thought: is it just me, or is it mind-blowing that this paper is now over 20 years old?

Reference

Blangy, JP (1994). AVO in tranversely isotropic media—An overview. Geophysics 59 (5), 775–781.

Don't miss the IPython Notebook that goes with this post.

Why we should embrace openness

Openness—open ideas, open data, open teams—can help us build more competitive, higher performing, more sutainable organizations in this industry.

Last week I took this message to the annual convention of the three big applied geoscience organizations in Canada: the Canadian Society of Petroleum Geologists (CSPG), the Canadian Society of Exploration Geophysicist (CSEG), and the Canadian Well Logging Society (CWLS). Evan and I attended the conference as scientists, but also experimented a bit with live tweeting and event blogging.

The talk was a generalization of the talk I did in March about open source software in geoscience. I wasn't sure at all how it would go over, and spent most of the morning sitting in technical talks fretting about how flaky and meta my talk would sound. But it went quite well, and at least served as some light relief from the erudition in the rest of the agenda. It was certainly fun to give an opinion-filled talk, and it started plenty of conversations afterwards.

You can access a PDF of the visuals, with commentary, from the thumbnail (left).

What do you think? Is a competitive, secretive industry like oil and gas capable of seeing value in openness? Might regulators eventually force us to share more as the resources society demands become scarcer? Or are we doomed to more mistrust and secrecy as oil and gas become more expensive to produce?

← Click the image for the PDF (6.8M)

Geo-FLOSS

Newton didn't need open source, so why do you?Free and open source software is catalyzing a revolution in subsurface science. As a key part of the growing movement to open access to data, information, and the very process of doing science, open software is not just for the geeks. It's a party we're all invited to. 

I have been in California this week, attending a conference in Long Beach called Mathematical and Computational Issues in the Geosciences, organized by the Society of Industrial and Applied Mathematicians. In 2009 I started being more active in my search for lectures and courses that lie outside my usual comfort zone. I have done courses in reservoir engineering and Java programming. I have heard talks on radiology and financial forecasting. It's like being back at university; I like it.

How did I end up at this conference? Last spring, I wrote a little review article about open source software (available here at dGB Earth Science's site). It was really just a copy-edited version of notes I had made whilst looking for free geoscience software and reading up on the subject for my own interest. After some brushes with open source, I was curious about the history behind the idea, how projects are built, and how they are licensed. At the same time, I also started a couple of Wikipedia articles about free software in geology and geophysics, as a place to list the projects I had come across. Kristin Flornes, of IRIS in Stavanger, Norway, saw the article and her colleagues got in touch about the conference.

The talk, which you can access via the thumbnail (left) or look at in Google Docs, is part FLOSS primer, part geo-FLOSS advert, part manifesto for a revolution of innovation. I hope the speaker notes are sufficient. 

What do you think? Is software availability or architecture or capable of driving change, or is it just a tool, passive and inert?

← Click the image for the PDF (6.9M)