Reproducibility Zoo

repro-zoo-main-banner.png

The Repro Zoo was a new kind of event at the SEG Annual Meeting this year. The goal: to reproduce the results from well-known or important papers in GEOPHYSICS or The Leading Edge. By reproduce, we meant that the code and data should be open and accessible. By results, we meant equations, figures, and other scientific outcomes.

And some of the results are scary enough for Hallowe’en :)

What we did

All the work went straight into GitHub, mostly as Jupyter Notebooks. I had a vague goal of hitting 10 papers at the event, and we achieved this (just!). I’ve since added a couple of other papers, since the inspiration for the work came from the Zoo… and I haven’t been able to resist continuing.

The scene at the Repro Zoo. An air of quiet productivity hung over the booth. Yes, that is Sergey Fomel and Jon Claerbout. Thank you to David Holmes of Dell EMC for the picture.

The scene at the Repro Zoo. An air of quiet productivity hung over the booth. Yes, that is Sergey Fomel and Jon Claerbout. Thank you to David Holmes of Dell EMC for the picture.

Here’s what the Repro Zoo team got up to, in alphabetical order:

  • Aldridge (1990). The Berlage wavelet. GEOPHYSICS 55 (11). The wavelet itself, which has also been added to bruges.

  • Batzle & Wang (1992). Seismic properties of pore fluids. GEOPHYSICS 57 (11). The water properties, now added to bruges.

  • Claerbout et al. (2018). Data fitting with nonstationary statistics, Stanford. Translating code from FORTRAN to Python.

  • Claerbout (1975). Kolmogoroff spectral factorization. Thanks to Stewart Levin for this one.

  • Connolly (1999). Elastic impedance. The Leading Edge 18 (4). Using equations from bruges to reproduce figures.

  • Liner (2014). Long-wave elastic attentuation produced by horizontal layering. The Leading Edge 33 (6). This is the stuff about Backus averaging and negative Q.

  • Luo et al. (2002). Edge preserving smoothing and applications. The Leading Edge 21 (2).

  • Yilmaz (1987). Seismic data analysis, SEG. Okay, not the whole thing, but Sergey Fomel coded up a figure in Madagascar.

  • Partyka et al. (1999). Interpretational aspects of spectral decomposition in reservoir characterization.

  • Röth & Tarantola (1994). Neural networks and inversion of seismic data. Kudos to Brendon Hall for this implementation of a shallow neural net.

  • Taner et al. (1979). Complex trace analysis. GEOPHYSICS 44. Sarah Greer worked on this one.

  • Thomsen (1986). Weak elastic anisotropy. GEOPHYSICS 51 (10). Reproducing figures, again using equations from bruges.

As an example of what we got up to, here’s Figure 14 from Batzle & Wang’s landmark 1992 paper on the seismic properties of pore fluids. My version (middle, and in red on the right) is slightly different from that of Batzle and Wang. They don’t give a numerical example in their paper, so it’s hard to know where the error is. Of course, my first assumption is that it’s my error, but this is the problem with research that does not include code or reference numerical examples.

Figure 14 from Batzle & Wang (1992). Left: the original figure. Middle: My attempt to reproduce it. Right: My attempt in red, overlain on the original.

This was certainly not the only discrepancy. Most papers don’t provide the code or data to reproduce their figures, and this is a well-known problem that the SEG is starting to address. But most also don’t provide worked examples, so the reader is left to guess the parameters that were used, or to eyeball results from a figure. Are we really OK with assuming the results from all the thousands of papers in GEOPHYSICS and The Leading Edge are correct? There’s a long conversation to have here.

What next?

One thing we struggled with was capturing all the ideas. Some are on our events portal. The GitHub repo also points to some other sources of ideas. And there was the Big Giant Whiteboard (below). Either way, there’s plenty to do (there are thousands of papers!) and I hope the zoo continues in spirit. I will take pull requests until the end of the year, and I don’t see why we can’t add more papers until then. At that point, we can start a 2019 repo, or move the project to the SEG Wiki, or consider our other options. Ideas welcome!

IMG_20181017_163926.jpg

Thank you!

The following people and organizations deserve accolades for their dedication to the idea and hard work making it a reality. Please give them a hug or a high five when you see them.

  • David Holmes (Dell EMC) and Chance Sanger worked their tails off on the booth over the weekend, as well as having the neighbouring Dell EMC booth to worry about. David also sourced the amazing Dell tech we had at the booth, just in case anyone needed 128GB of RAM and an NVIDIA P5200 graphics card for their Jupyter Notebook. (The lights in the convention centre actually dimmed when we powered up our booths in the morning.)

  • Luke Decker (UT Austin) organized a corps of volunteer Zookeepers to help manage the booth, and provided enthusiasm and coding skills. Karl Schleicher (UT Austin), Sarah Greer (MIT), and several others were part of this effort.

  • Andrew Geary (SEG) for keeping things moving along when I became delinquent over the summer. Lots of others at SEG also helped, mainly with the booth: Trisha DeLozier, Rebecca Hayes, and Beth Donica all contributed.

  • Diego Castañeda got the events site in shape to support the Repro Zoo, with a dashboard showing the latest commits and contributors.

Volve: not open after all

Back in June, Equinor made the bold and exciting decision to release all its data from the decommissioned Volve oil field in the North Sea. Although the intent of the release seemed clear, the dataset did not carry a license of any kind. Since you cannot use unlicensed content without permission, this was a problem. I wrote about this at the time.

To its credit, Equinor listened to the concerns from me and others, and considered its options. Sensibly, it chose an off-the-shelf license. It announced its decision a few days ago, and the dataset now carries a Creative Commons Attribution-NonCommercial-ShareAlike license.

Unfortunately, this license is not ‘open’ by any reasonable definition. The non-commercial stipulation means that a lot of people, perhaps most people, will not be able to legally use the data (which is why non-commercial licenses are not open licenses). And the ShareAlike part means that we’re in for some interesting discussion about what derived products are, because any work based on Volve will have to carry the CC BY-NC-SA license too.

Non-commercial licenses are not open

Here are some of the problems with the non-commercial clause:

NC licenses come at a high societal cost: they provide a broad protection for the copyright owner, but strongly limit the potential for re-use, collaboration and sharing in ways unexpected by many users

  • NC licenses are incompatible with CC-BY-SA. This means that the data cannot be used on Wikipedia, SEG Wiki, or AAPG Wiki, or in any openly licensed work carrying that license.

  • NC-licensed data cannot be used commercially. This is obvious, but far-reaching. It means, for example, that nobody can use the data in a course or event for which they charge a fee. It means nobody can use the data as a demo or training data in commercial software. It means nobody can use the data in a book that they sell.

  • The boundaries of the license are unclear. It's arguable whether any business can use the data for any purpose at all, because many of the boundaries of the scope have not been tested legally. What about a course run by AAPG or SEG? What about a private university? What about a government, if it stands to realize monetary gain from, say, a land sale? All of these uses would be illiegal, because it’s the use that matters, not the commercial status of the user.

Now, it seems likely, given the language around the release, that Equinor will not sue people for most of these use cases. They may even say this. Goodness knows, we have enough nudge-nudge-wink-wink agreements like that already in the world of subsurface data. But these arrangements just shift the onus onto the end use and, as we’ve seen with GSI, things can change and one day you wake up with lawsuits.

ShareAlike means you must share too

Creative Commons licenses are, as the name suggests, intended for works of creativity. Indeed, the whole concept of copyright, depends on creativity: copyright protects works of creative expression. If there’s no creativity, there’s no basis for copyright. So for example, a gamma-ray log is unlikely to be copyrightable, but seismic data is (follow the GSI link above to find out why). Non-copyrightable works are not covered by Creative Commons licenses.

All of which is just to help explain some of the language in the CC BY-NC-SA license agreement, which you should read. But the key part is in paragraph 4(b):

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under the terms of this License

What’s a ‘derivative work’? It’s anything ‘based upon’ the licensed material, which is pretty vague and therefore all-encompassing. In short, if you use or show Volve data in your work, no matter how non-commercial it is, then you must attach a CC BY-NC-SA license to your work. This is why SA licenses are sometimes called ‘viral’.

By the way, the much-loved F3 and Penobscot datasets also carry the ShareAlike clause, so any work (e.g. a scientific paper) that uses them is open-access and carries the CC BY-SA license, whether the author of that work likes it or not. I’m pretty sure no-one in academic publishing knows this.

By the way again, everything in Wikipedia is CC BY-SA too. Maybe go and check your papers and presentations now :)

problems-dont-have.png

What should Equinor do?

My impression is that Equinor is trying to satisfy some business partner or legal edge case, but they are forgetting that they have orders of magnitude more capacity to deal with edge cases than the potential users of the dataset do. The principle at work here should be “Don’t solve problems you don’t have”.

Encumbering this amazing dataset with such tight restrictions effectively kills it. It more or less guarantees it cannot have the impact I assume they were looking for. I hope they reconsider their options. The best choice for any open data is CC-BY.

Reproduce this!

logo_simple.png

There’s a saying in programming: untested code is broken code. Is unreproducible science broken science?

I hope not, because geophysical research is — in general — not reproducible. In other words, we have no way of checking the results. Some of it, hopefully not a lot of it, could be broken. We have no way of knowing.

Next week, at the SEG Annual Meeting, we plan to change that. Well, start changing it… it’s going to take a while to get to all of it. For now we’ll be content with starting.

We’re going to make geophysical research reproducible again!

Welcome to the Repro Zoo!

If you’re coming to SEG in Anaheim next week, you are hereby invited to join us in Exposition Hall A, Booth #749.

We’ll be finding papers and figures to reproduce, equations to implement, and data tables to digitize. We’ll be hunting down datasets, recreating plots, and dissecting derivations. All of it will be done in the open, and all the results will be public and free for the community to use.

You can help

There are thousands of unreproducible papers in the geophysical literature, so we are going to need your help. If you’ll be in Anaheim, and even if you’re not, here some things you can do:

That’s all there is to it! Whether you’re a coder or an interpreter, whether you have half an hour or half a day, come along to the Repro Zoo and we’ll get you started.

Figure 1 from Connolly’s classic paper on elastic impedance. This is the kind of thing we’ll be reproducing.

Figure 1 from Connolly’s classic paper on elastic impedance. This is the kind of thing we’ll be reproducing.

EarthArXiv wants your preprints

eartharxiv.png

If you're into science, and especially physics, you've heard of arXiv, which has revolutionized how research in physics is shared. BioarXiv, SocArXiv and PaleorXiv followed, among others*.

Well get excited, because today, at last, there is an open preprint server especially for earth science — EarthArXiv has landed! 

I could write a long essay about how great this news is, but the best way to get the full story is to listen to two of the founders — Chris Jackson (Imperial College London and fellow University of Manchester alum) and Tom Narock (University of Maryland, Baltimore) — on Undersampled Radio this morning:

Congratulations to Chris and Tom, and everyone involved in EarthArXiv!

  • Friedrich Hawemann, ETH Zurich, Switzerland
  • Daniel Ibarra, Earth System Science, Standford University, USA
  • Sabine Lengger, University of Plymouth, UK
  • Andelo Pio Rossi, Jacobs University Bremen, Germany
  • Divyesh Varade, Indian Institute of Technology Kanpur, India
  • Chris Waigl, University of Alaska Fairbanks, USA
  • Sara Bosshart, International Water Association, UK
  • Alodie Bubeck, University of Leicester, UK
  • Allison Enright, Rutgers - Newark, USA
  • Jamie Farquharson, Université de Strasbourg, France
  • Alfonso Fernandez, Universidad de Concepcion, Chile
  • Stéphane Girardclos, University of Geneva, Switzerland
  • Surabhi Gupta, UGC, India

Don't underestimate how important this is for earth science. Indeed, there's another new preprint server coming to the earth sciences in 2018, as the AGU — with Wiley! — prepare to launch ESSOAr. Not as a competitor for EarthArXiv (I hope), but as another piece in the rich open-access ecosystem of reproducible geoscience that's developing. (By the way, AAPG, SEG, SPE: you need to support these initiatives. They want to make your content more relevant and accessible!)

It's very, very exciting to see this new piece of infrastructure for open access publishing. I urge you to join in! You can submit all your published work to EarthArXiv — as long as the journal's policy allows it — so you should make sure your research gets into the hands of the people who need it.

I hope every conference from now on has an EarthArXiv Your Papers party. 


* Including snarXiv, don't miss that one!

GeoConvention highlights

We were in Calgary last week at the Canada GeoConvention 2017. The quality of the talks seemed more variable than usual but, as usual, there were some gems in there too. Here are our highlights from the technical talks...

Filling in gaps

Mauricio Sacchi (University of Alberta) outlined a new reconstruction method for vector field data. In other words, filling in gaps in multi-compononent seismic records. I've got a soft spot for Mauricio's relaxed speaking style and the simplicity with which he presents linear algebra, but there are two other reasons that make this talk worthy of a shout out:

  1. He didn't just show equations in his talk, he used pseudocode to show the algorithm.
  2. He linked to his lab's seismic processing toolkit, SeismicJulia, on GitHub.

I am sure he'd be the first to admit that it is early days for for this library and it is very much under construction. But what isn't? All the more reason to showcase it openly. We all need a lot more of that.

Update on 2017-06-7 13:45 by Evan Bianco: Mauricio, has posted the slides from his talk

Learning about errors

Anton Birukov (University of Calgary & graduate intern at Nexen) gave a great talk in the induced seismicity session. It was a lovely mashing-together of three of our favourite topics: seismology, machine-learning, and uncertainty. Anton is researching how to improve microseismic and earthquake event detection by framing it as a machine-learning classification problem. He's using Monte Carlo methods to compute myriad synthetic seismic events by making small velocity variations, and then using those synthetic events to teach a model how to be more accurate about locating earthquakes.

Figure 2 from  Anton Biryukov's abstract . An illustration of the signal classification concept. The signals originating from the locations on the grid (a) are then transformed into a feature space and labeled by the class containing the event origin. From Biryukov (2017). Event origin depth uncertainty - estimation and mitigation using waveform similarity. Canada GeoConvention, May 2017.

Figure 2 from Anton Biryukov's abstract. An illustration of the signal classification concept. The signals originating from the locations on the grid (a) are then transformed into a feature space and labeled by the class containing the event origin. From Biryukov (2017). Event origin depth uncertainty - estimation and mitigation using waveform similarity. Canada GeoConvention, May 2017.

The bright lights of geothermal energy
Matt Hall

Two interesting sessions clashed on Wednesday afternoon. I started off in the Value of Geophysics panel discussion, but left after James Lamb's report from the mysterious Chief Geophysicists' Forum. I had long wondered what went on in that secretive organization; it turns out they mostly worry about how to make important people like your CEO think geophysics is awesome. But the large room was a little dark, and — in keeping with the conference in general — so was the mood.

Feeling a little down, I went along to the Diversification of the Energy Industry session instead. The contrast was abrupt and profound. The bright room was totally packed with a conspicuously young audience numbering well over 100. The mood was hopeful, exuberant even. People were laughing, but not wistfully or ironically. I think I saw a rainbow over the stage.

If you missed this uplifting session but are interested in contributing to Canada's geothermal energy scene, which will certainly need geoscientists and reservoir engineers if it's going to get anywhere, there are plenty of ways to find out more or get involved. Start at cangea.ca and follow your nose.

We'll be writing more about the geothermal scene — and some of the other themes in this post — so stay tuned. 


DID YOU KNOW?

You can get regular updates right to your email, just drop your address in the box:

The fine print: No spam, we promise! We never share email addresses with 3rd parties. Unsubscribe any time with the link in the emails. The service is provided by MailChimp in accordance with Canada's anti-spam regulations.

SEG machine learning contest: there's still time

Have you been looking for an excuse to find out what machine learning is all about? Or maybe learn a bit of Python programming language? If so, you need to check out Brendon Hall's tutorial in the October issue of The Leading Edge. Entitled, "Facies classification using machine learning", it's a walk-through of a basic statistical learning workflow, applied to a small dataset from the Hugoton gas field in Kansas, USA.

But it was also the launch of a strictly fun contest to see who can get the best prediction from the available data. The rules are spelled out in ther contest's README, but in a nutshell, you can use any reproducible workflow you like in Python, R, Julia or Lua, and you must disclose the complete workflow. The idea is that contestants can learn from each other.

Left: crossplots and histograms of wireline log data, coloured by facies — the idea is to highlight possible data issues, such as highly correlated features. Right: true facies (left) and predicted facies (right) in a validation plot. See the rest of the paper for details.

What's it all about?

The task at hand is to predict sedimentological facies from well logs. Such log-derived facies are sometimes called e-facies. This is a familiar task to many development geoscientists, and there are many, many ways to go about it. In the article, Brendon trains a support vector machine to discriminate between facies. It does a fair job, but the accuracy of the result is less than 50%. The challenge of the contest is to do better.

Indeed, people have already done better; here are the current standings:

Team F1 Algorithm Language Solution
1 gccrowther 0.580 Random forest Python Notebook
2 LA_Team 0.568 DNN Python Notebook
3 gganssle 0.561 DNN Lua Notebook
4 MandMs 0.552 SVM Python Notebook
5 thanish 0.551 Random forest R Notebook
6 geoLEARN 0.530 Random forest Python Notebook
7 CannedGeo 0.512 SVM Python Notebook
8 BrendonHall 0.412 SVM Python Initial score in article

As you can see, DNNs (deep neural networks) are, in keeping with the amazing recent advances in the problem-solving capability of this technology, doing very well on this task. Of the 'shallow' methods, random forests are quite prominent, and indeed are a great first-stop for classification problems as they tend to do quite well with little tuning.

How do I enter?

There is still over 6 weeks to enter: you have until 31 January. There is a little overhead — you need to learn a bit about git and GitHub, there's some programming, and of course machine learning is a massive field to get up to speed on — but don't be discouraged. The very first entry was from Bryan Page, a self-described non-programmer who dusted off some basic skills to improve on Brendon's notebook. But you can run the notebook right here in mybinder.org (if it's up today — it's been a bit flaky lately) and a play around with a few parameters yourself.

The contest aspect is definitely low-key. There's no money on the line — just a goody bag of fun prizes and a shedload of kudos that will surely get the winners into some awesome geophysics parties. My hope is that it will encourage you (yes, you) to have fun playing with data and code, trying to do that magical thing: predict geology from geophysical data.


Reference

Hall, B (2016). Facies classification using machine learning. The Leading Edge 35 (10), 906–909. doi: 10.1190/tle35100906.1. (This paper is open access: you don't have to be an SEG member to read it.)

52 Things... Rock Physics

There's a new book in the 52 Things family! 

52 Things You Should Know About Rock Physics is out today, and available for purchase at Amazon.com. It will appear in their European stores in the next day or two, and in Canada... well, soon. If you can't wait for that, you can buy the book immediately direct from the printer by following this link.

The book mines the same vein as the previous volumes. In some ways, it's a volume 2 of the original 52 Things... Geophysics book, just a little bit more quantitative. It features a few of the same authors — Sven Treitel, Brian Russell, Rachel Newrick, Per Avseth, and Rob Simm — but most of the 46 authors are new to the project. Here are some of the first-timers' essays:

  • Ludmilla Adam, Why echoes fade.
  • Arthur Cheng, How to catch a shear wave.
  • Peter Duncan, Mapping fractures.
  • Paul Johnson, The astonishing case of non-linear elasticity.
  • Chris Liner, Negative Q.
  • Chris Skelt, Five questions to ask the petrophysicist.

It's our best collection of essays yet. We're very proud of the authors and the collection they've created. It stretches from childhood stories to linear algebra, and from the microscope to seismic data. There's no technical book like it. 

Supporting Geoscientists Without Borders

Purchasing the book will not only bring you profund insights into rock physics — there's more! Every sale sends $2 to Geoscientists Without Borders, the SEG charity that supports the humanitarian application of geoscience in places that need it. Read more about their important work.

It's been an extra big effort to get this book out. The project was completely derailed in 2015, as we — like everyone else — struggled with some existential questions. But we jumped back into it earlier this year, and Kara (the managing editor, and my wife) worked her magic. She loves working with the authors on proofs and so on, but she doesn't want to see any more equations for a while.

If you choose to buy the book, I hope you enjoy it. If you enjoy it, I hope you share it. If you want to share it with a lot of people, get in touch — we can help. Like the other books, the content is open access — so you are free to share and re-use it as you wish. 

ORCL vs GOOG: the $9 billion API

What's this? Two posts about the legal intricacies of copyright in the space of a fortnight? Before you unsubscribe from this definitely-not-a-law-blog, please read on because the case of Oracle America, Inc vs Google, Inc is no ordinary copyright fight. For a start, the damages sought by Oracle in this case [edit: could] exceed $9 billion. And if they win, all hell is going to break loose.

The case is interesting for some other reasons besides the money and the hell breaking loose thing. I'm mostly interested in it because it's about open source software. Specifically, it's about Android, Google's open source mobile operating system. The claim is that the developers of Android copied 37 application programming interfaces, or APIs, from the Java software environment that Sun released in 1995 and Oracle acquired in its $7.4 billion acquisition of Sun in 2010. There were also claims that they copied specific code, not just the interface the code presents to the user, but it's the API bit that's interesting.

What's an API then?

You might think of software in terms of applications like the browser you're reading this in, or the seismic interpretation package you use. But this is just one, very high-level, type of software. Other, much lower-level software runs your microwave. Developers use software to build software; these middle levels contain FORTRAN libraries for tasks like signal processing, tools for making windows and menus appear, and still others for drawing, or checking spelling, or drawing shapes. You can think of an API like a user interface for programmers. Where the user interface in a desktop application might have menus and dialog boxes, the interface for a library has classes and methods — pieces of code that hold data or perform tasks. A good API can be a pleasure to use. A bad API can make grown programmers cry. Or at least make them write a new library.

The Android developers didn't think the Java API was bad. In fact, they loved it. They tried to license it from Sun in 2007 and, when Sun was bought by Oracle, from Oracle. When this didn't work out, they locked themselves in a 'cleanroom' and wrote a runtime environment called Dalvik. It implemented the same API as the Java Virtual Machine, but with new code. The question is: does Oracle own the interface — the method names and syntaxes? Are APIs copyrightable?

I thought this case ended years ago?

It did. Google already won the argument once, on 31 May 2012, when the court held that APIs are "a system or method of operations" and therefore not copyrightable. Here's the conclusion of that ruling:

The original 2012 holding that Google did not violate the copyright Act by copying 37 of Java's interfaces. Click for the full PDF.

The original 2012 holding that Google did not violate the copyright Act by copying 37 of Java's interfaces. Click for the full PDF.

But it went to the Federal Circuit Court of Appeals, Google's petition for 'fair use' was denied, and the decision was sent back to the district court for a jury trial to decide on Google's defence. So now the decision will be made by 10 ordinary citizens... none of whom know anything about programming. (There was a computer scientist in the pool, but Oracle sent him home. It's okay --- Google sent a free-software hater packing.)

This snippet from one of my favourite podcasts, Leo Laporte's Triangulation, is worth watching. Leo is interviewing James Gosling, the creator of Java, who was involved in some of the early legal discovery process...

Why do we care about this?

The problem with all this is that, when it come to open source software and the Internet, APIs make the world go round. As the Electronic Frontier Foundation argued on behalf of 77 computer scientists (including Alan Kay, Vint Cerf, Hal Abelson, Ray Kurzweil, Guido van Rossum, and Peter Norvig, ) in its amicus brief for the Supreme Court... we need uncopyrightable interfaces to get computers to cooperate. This is what drove the personal computer explosion of the 1980s, the Internet explosion of the 1990s, and the cloud computing explosion of the 2000s, and most people seem to think those were awesome. The current bot explosion also depends on APIs, but the jury is out (lol) on how awesome that one is.

The trial continues. Google concluded its case yesterday, and Oracle called its first witness, co-CEO Safra Catz. "We did not buy Sun to file this lawsuit," she said. Reassuring, but if they win there's going to be a lot of that going around. A lot.

For a much more in-depth look at the story behind the trial, this epic article by Sarah Jeong is awesome. Follow the rest of the events over the next few days on Ars Technica, Twitter, or wherever you get your news. Meanwhile on Agile*, we will return to normal geophysical programming, I promise :)


ADDENDUM on 26 May 2016... Google won the case with the "fair use" argument. So the appeal court's decision that APIs are copyrightable stands, but the jury were persuaded that this particular instance qualified as fair use. Oracle will appeal.

Copyright and seismic data

Seismic company GSI has sued a lot of organizations recently for sharing its copyrighted seismic data, undermining its business. A recent court decision found that seismic data is indeed copyrightable, but Canadian petroleum regulations can override the copyright. This allows data to be disclosed by the regulator and copied by others — made public, effectively.


Seismic data is not like other data

Data is uncopyrightable. Like facts and ideas, data is considered objective, uncreative — too cold to copyright. But in an important ruling last month, the Honourable Madam Justice Eidsvik established at the Alberta Court of the Queen's Bench that seismic data is not like ordinary data. According to this ruling:

 

...the creation of field and processed [seismic] data requires the exercise of sufficient skill and judgment of the seismic crew and processors to satisfy the requirements of [copyrightability].

 

These requirements were established in the case of accounting firm CCH Canadian Limited vs The Law Society of Upper Canada (2004) in the Supreme Court of Canada. Quoting from that ruling:

 

What is required to attract copyright protection in the expression of an idea is an exercise of skill and judgment. By skill, I mean the use of one’s knowledge, developed aptitude or practised ability in producing the work. By judgment, I mean the use of one’s capacity for discernment or ability to form an opinion or evaluation by comparing different possible options in producing the work.

 

Interestingly:

 

There exist no cases expressly deciding whether Seismic Data is copyrightable under the American Copyright Act [in the US].

 

Fortunately, Justice Eidsvik added this remark to her ruling — just in case there was any doubt:

 

I agree that the rocks at the bottom of the sea are not copyrightable.

It's really worth reading through some of the ruling, especially sections 7 and 8, entitled Ideas and facts are not protected and Trivial and purely mechanical respectively. 

Why are we arguing about this?

This recent ruling about seismic data was the result of an action brought by Geophysical Service Incorporated against pretty much anyone they could accuse of infringing their rights in their offshore seismic data, by sharing it or copying it in some way. Specifically, the claim was that data they had been required to submit to regulators like the C-NLOPB and the C-NSOPB was improperly shared, undermining its business of shooting seismic data on spec.

You may not have heard of GSI, but the company has a rich history as a technical and business innovator. The company was the precursor to Texas Instruments, a huge player in the early development of computing hardware — seismic processing was the 'big data' of its time. GSI still owns the largest offshore seismic dataset in Canada. Recently, however, the company seems to have focused entirely on litigation.

The Calgary company brought more than 25 lawsuits in Alberta alone against corporations, petroleum boards, and others. There have been other actions in other jurisdictions. This ruling is just the latest one; here's the full list of defendants in this particular suit (there were only 25, but some were multiple entities):

  • Devon Canada Corporation
  • Statoil Canada Ltd.
  • Anadarko Petroleum Corporation
  • Anadarko US Offshore Corporation
  • NWest Energy Corp.
  • Shoal Point Energy Ltd.
  • Vulcan Minerals Inc.
  • Corridor Resources Inc.
  • CalWest Printing and Reproductions
  • Arcis Seismic Solutions Corp.
  • Exploration Geosciences (UK) Limited
  • Lynx Canada Information Systems Ltd.
  • Olympic Seismic Ltd.
  • Canadian Discovery Ltd.
  • Jebco Seismic UK Limited
  • Jebco Seismic (Canada) Company
  • Jebco Seismic, LP
  • Jebco/Sei Partnership LLC
  • Encana Corporation
  • ExxonMobil Canada Ltd.
  • Imperial Oil Limited
  • Plains Midstream Canada ULC
  • BP Canada Energy Group ULC
  • Total S.A.
  • Total E&P Canada Ltd.
  • Edison S.P.A.
  • Edison International S.P.A.
  • ConocoPhillips Canada Resources Corp.
  • Canadian Natural Resources Limited
  • MGM Energy Corp
  • Husky Oil Limited
  • Husky Oil Operations Limited
  • Nalcor Energy – Oil and Gas Inc.
  • Suncor Energy Inc.
  • Murphy Oil Company Ltd.
  • Devon ARL Corporation

Why did people share the data?

According to Section 101 Disclosure of Information of the Canada Petroleum Resources Act (1985) , geophysical data should be released to regulators — and thus, effectively, the public — five years after acquisition:

 

(2) Subject to this section, information or documentation is privileged if it is provided for the purposes of this Act [...]
(2.1) Subject to this section, information or documentation that is privileged under subsection 2 shall not knowingly be disclosed without the consent in writing of the person who provided it, except for the purposes of the administration or enforcement of this Act [...]

(7) Subsection 2 does not apply in respect of the following classes of information or documentation obtained as a result of carrying on a work or activity that is authorized under the Canada Oil and Gas Operations Act, namely, information or documentation in respect of

(d) geological work or geophysical work performed on or in relation to any frontier lands,
    (i) in the case of a well site seabed survey [...], or
    (ii) in any other case, after the expiration of five years following the date of completion of the work;

 

As far as I can tell, this does not necessarily happen, by the way. There seems to be a great deal of confusion in Canada about what 'seismic data' actually is — companies submit paper versions, sometimes with poor processing, or perhaps only every 10th line of a 3D. But the Canada Oil and Gas Geophysical Operations Regulations are quite clear. This is from the extensive and pretty explicit 'Final Report' requirements: 

 

(j) a fully processed, migrated seismic section for each seismic line recorded and, in the case of a 3-D survey, each line generated from the 3-D data set;

 

The intent is quite clear: the regulators are entitled to the stacked, migrated data. The full list is worth reading, it covers a large amount of data. If this is enforced, it is not very rigorous. If these datasets ever make it into the hands of the regulators, and I doubt it ever all does, then it's still subject to the haphazard data management practices that this industry has ubiquitously adopted.

GSI argued that 'disclosure', as set out in Section 101 of the Act, does not imply the right to copy, but the court was unmoved:

 

Nonetheless, I agree with the Defendants that [Section 101] read in its entirety does not make sense unless it is interpreted to mean that permission to disclose without consent after the expiry of the 5 year period [...] must include the ability to copy the information. In effect, permission to access and copy the information is part of the right to disclose.

 

So this is the heart of the matter: the seismic data was owned and copyrighted by GSI, but the regulations specify that seismic data must be submitted to regulators, and that they can disclose that data to others. There's obvious conflict between these ideas, so which one prevails? 

The decision

There is a principle in law called Generalia Specialibus Non Derogant. Quoting from another case involving GSI:

 

Where two provisions are in conflict and one of them deals specifically with the matter in question while the other is of more general application, the conflict may be avoided by applying the specific provision to the exclusion of the more general one. The specific prevails over the more general: it does not matter which was enacted first.

 

Quoting again from the recent ruling in GSI vs Encana et al.:

 

Parliament was aware of the commercial value of seismic data and attempted to take this into consideration in its legislative drafting. The considerations balanced in this regard are the same as those found in the Copyright Act, i.e. the rights of the creator versus the rights of the public to access data. To the extent that GSI feels that this policy is misplaced, its rights are political ones – it is not for this Court to change the intent of Parliament, unfair as it may be to GSI’s interests.

 

Finally:

 

[...the Regulatory Regime] is a complete answer to the suggestion that the Boards acted unlawfully in disclosing the information and documentation to the public. The Regulatory Regime is also a complete answer to whether the copying companies and organizations were entitled to receive and copy the information and documentation for customers. For the oil companies, it establishes that there is nothing unlawful about accessing or copying the information from the Boards [...]

 

So that's it: the data was copyright, but the regulations override the copyright, effectively. The regulations were legal, and — while GSI might find the result unfair — it must operate under them. 

The decision must be another step towards the end of this ugly matter. Maybe it's the end. I'm sure those (non-lawyers) involved can't wait for it to be over. I hope GSI finds a way back to its technical core and becomes a great company again. And I hope the regulators find ways to better live up to the fundamental idea behind releasing data in the first place: that the availability of the data to the public should promote better science and better decisions for Canada's offshore. As things stand today, the whole issue of 'public subsurface data' in Canada is, frankly, a mess.

Why I don't flout copyright

Lots of people download movies illegally. Or spoof their IP addresses to get access to sports fixtures. Or use random images they found on the web in publications and presentations (I've even seen these with the watermark of the copyright owner on them!). Or download PDFs for people who aren't entitled to access (#icanhazpdf). Or use sketchy Russian paywall-crumbling hacks. It's kind of how the world works these days. And I realize that some of these things don't even sound illegal.

This might surprise some people, because I go on so much about sharing content, open geoscience, and so on. But I am an annoying stickler for copyright rules. I want people to be able to re-use any content they like, without breaking the law. And if people don't want to share their stuff, then I don't want to share it.

Maybe I'm just getting old and cranky, but FWIW here are my reasons:

  1. I'm a content producer. I would like to set some boundaries to how my stuff is shared. In my case, the boundaries amount to nothing more than attribution, which is only fair. But still, it's my call, and I think that's reasonable, at least until the material is, say, 5 years old. But some people don't understand that open is good, that shareable content is better than closed content, that this is the way the world wants it. And that leads to my second reason:
  2. I don't want to share closed stuff as if it was open. If someone doesn't openly license their stuff, they don't deserve the signal boost — they told the world to keep their stuff secret. Why would I give them the social and ethical benefits of open access while they enjoy the financial benefits of closed content? This monetary benefit comes from a different segment of the audience, obviously. At least half the people who download a movie illegally would not, I submit, have bought the movie at a fair price.

So make a stand for open content! Don't share stuff that the creator didn't give you permission to share. They don't deserve your gain filter.