May 20, 2021

Equinor should change its open data licence

May 20, 2021/ Matt Hall

This is an open letter to Equinor to appeal for a change to the licence used on Volve, Northern Lights, and other datasets. If you wish to co-sign, please add a supportive comment below. (Or if you disagree, please speak up too!)

Open data has had huge impact on science and society. Whether the driving purpose is innovation, transparency, engagement, or something else, open data can make a difference. Underpinning the dataset itself is its licence, which grants permission to others to re-use and distribute open data. Open data licences are licences that meet the Open Definition.

In 2018, Equinor generously released a very large dataset from the decommissioned field Volve. Initially it was released with no licence. Later in 2018, a licence was added but it was a non-open licence, CC BY-NC-SA (open licences cannot be limited to non-commercial use, which is what the NC stands for). Then, in 2020, the licence was changed to a modified CC BY licence, which you can read here.

As far as I know, Volve and other projects still carry this licence. I’ll refer to this licence as “the Equinor licence”. I assume it applies to the collection of data, and to the contents of the collection (where applicable).

There are 3 problems with the licence as it stands:

The licence is not open.
Modified CC licences have issues.
The licence is not clear and exposes licencees to risk of infringement.

Let's look at these in turn.

The licence is not open

The Equinor licence is not an open licence. It does not meet the Open Definition, section 2.1.2 of which states:

“The license must allow redistribution of the licensed work, including sale, whether on its own or as part of a collection made from works from different sources.”

The licence does not allow sale and therefore does not meet this criterion. Non-open licences are not compatible with open licences, therefore these datasets cannot be remixed and re-used with open content. This greatly limits the usefulness of the dataset.

Modified CC licences have issues

The Equinor licence states:

“This license is based on CC BY 4.0 license ”

I interpret this to mean that it is intended to act as a modified CC BY licence. There are two issues with this:

The copyright lawyers at Creative Commons strongly advises against modifying (in particular, adding restrictions to) their licences.
If you do modify one, you may not refer to it as a CC BY licence or use Creative Commons trademarks; doing so violates their trademarks.

Both of these issues are outlined in the Creative Commons Wiki. According to that document, these issues arise because modified licences confuse the public. In my opinion (and I am not a lawyer, etc), the Equinor licence is confusing, and it appears to violate the Creative Commons organization's trademark policy.

Note that 'modify' really means 'add restrictions to' here. It is easier to legally and clearly remove restrictions from CC licences, using the CCPlus licence extension pattern.

The licence is not clear

The Equinor licence contains five restrictions:

You may not sell the Licensed Material.
You must give Equinor and the Volve license partners credit, and provide a link to these terms and conditions, as well as a copyright notice if applicable.
You may not share Adapted Material under a license that prevents recipients from complying with these terms and conditions.
You shall not use the Licensed Material in a manner that appears misleading nor present the Licensed Material in a distorted or incorrect manner.
The license covers all data in the dataset whether or not it is by law covered by copyright.

Looking at the points in turn:

Point 1 is, I believe, the main issue for Equinor. For some reason, this is paramount for them.

Point 2 seems like a restatement of the BY restriction that is the main feature of the CC-BY licence and is extensively described in Section 3.a of that licence.

Point 3 is already covered by CC BY in Section 3.a.4.

Point 4 is ambiguous and confusing. Who is the arbiter of this potentially subjective criterion? How will it be applied? Will Equinor examine every use of the data? The scenario this point is trying to prevent seems already to be covered by standard professional ethics and 'errors and omissions'. It's a bit like saying you can't use the data to commit a crime — it doesn't need saying because commiting crimes is already illegal.

Point 5 is strange. I don’t know why Equinor wants to licence material that no-one owns, but licences are legal contracts, and you can bind people into anything you can agree on. One note here — the rights in the database (so-called 'database rights') are separate from the rights in the contents: it is possible in many jurisdictions to claim sui generis rights in a collection of non-copyrightable elements; maybe this is what was intended? Importantly, Sui generis database rights are explicitly covered by CC BY 4.0.

Finally, I recently received an email communication from Equinor that stated the following:

“[...] nothing in our present licencing inhibits the fair and widespread use of our data for educational, scientific, research and commercial purposes. You are free to download the Licensed Material for non-commercial and commercial purposes. Our only requirement is that you must add value to the data if you intend to sell them on.”

The last sentence (“Our only requirement…”) states that there is only one added restriction. But, as I just pointed out, this is not what the licence document states. The Equinor licence states that one may not sell the licensed material, period. The email states that I can sell it if I add value. Then the questions are, "What does 'add value' mean?", and "Who decides?". (It seems self-evident to me that it would be very hard to sell open material if one wasn't adding value!)

My recommendations

In its current state, I would not recommend anyone to use the Volve or Northern Lights data for any purpose. I know this sounds extreme, but it’s important to appreciate the huge imbalance in the relationship between Equinor and its licensees. If Equinor's future counsel — maybe in a decade — decides that lots of people have violated this licence, what happens next could be quite unjust. Equinor can easily put a small company out of business with a lawsuit. I know that might seem unlikely today, but I urge you to read about GSI's extensive lawsuits in Canada — this is a real situation that cost many companies a lot of money. You can read about it in my blog post, Copyright and seismic data.

When it comes to licences, and legal contracts in general, I believe that less is more. Taking a standard licence and adding words to solve problems you don’t have but can imagine having — and lawyers have very good imaginations — just creates confusion.

I therefore recommend the following:

Adopt an unmodifed CC BY 4.0 licence for the collection as a whole.
Adopt an unmodifed CC BY 4.0 licence for the contents of the collection, where copyrightable.
Include copyright notices that clearly state the copyright owners, in all relevant places in the collection (e.g. data folders, file headers) and at least at the top level. This way, it's clear how attribution should be done.
Quell the fear of people selling the dataset by removing as many possible barriers to using the free version as possible, and generally continuing to be a conspicuous champion for open data.

If Equinor opts to keep a version of the current licence, I recommend at least removing any mention of CC BY, it only adds to the confusion. The Equinor licence is not a CC BY licence, and mentioning Creative Commons violates their policy. We also suggest simplifying the licence if possible, and clarifying any restrictions that remain. Use plain language, give examples, and provide a set of Frequently Asked Questions.

The best path forward for fostering a community around these wonderful datasets that Equinor has generously shared with the community, is to adopt a standard open licence as soon as possible.

May 19, 2021

How can technical societies support openness?

May 19, 2021/ Matt Hall

There’s an SPE conference on openness happening this week. Around 60 people paid the $400 registration fee — does that seem like a lot for a virtual conference? — and it’s mostly what you’d expect: talks and panel discussions. But there’s 20 minutes per day for open discussion, and we must be grateful for small things! For sure, it is always good to see the technical societies pay attention to open data, open source code, and open access content.

But what really matters is action, and in my breakout room today I asked about SPE’s role in raising the community’s level of literacy around openness. Someone asked in turn what sorts of things the organization could do. I said my answer needed to be written down 😄 so here it is.

To save some breath, I’m going to use the word openness to talk about open access content, open source code, and open data. And when I say ‘open’, I mean that something meets the Open Definition. In a nutshell, this states:

“Open data and content can be freely used, modified, and shared by anyone for any purpose”

Remember that ‘free’ here means many things, but not necessarily ‘free of charge’.

So that we don’t lose sight of the forest for the tree, my advice boils down to this: I would like to see all of the technical societies understand and embrace the idea that openness is an important way for them to increase their reach, improve their accessibility, become more equitable, increase engagement, and better serve their communities of practice.

No, ‘increase their revenue’ is not on that list. Yes, if they do those things, their revenue will go up. (I’ve written about the societies’ counterproductive focus on revenue before.)

Okay, enough preamble. What can the societies do to better serve their members? I can think of a few things:

Advocate for producers of the open content and technology that benefits everyone in the community.
Help member companies understand the role openness plays in innovation and help them find ways to support it.
Take a firm stance on expectations of reproducibility for journal articles and conference papers.
Provide reasonable, affordable options for authors to choose open licences for their work (and such options must not require a transfer of copyright).
When open access papers are published, be clear about the licence. (I could not figure out the licence on the current most read paper in SPE Journal, although it says ‘open access’.)
Find ways to get well-informed legal advice about openness to members (this advice is hard to find; most lawyers are not well informed about copyright law, nevermind openness).
Offer education on openness to members.
Educate editors, associate editors, and meeting convenors on openness so that they can coach authors, reviewers., and contributors.
Improve peer review machinery to better support the review of code and data submissions.
Highlight exemplary open research projects, and help project maintainers improve over time. (For example, what would it take to accelerate MRST’s move to an open language? Could SPE help create those conditions?)
Recognize that open data benchmarks are badly needed and help organize labour around them.
Stop running data science contests that depend on proprietary data.
Put an open licence on PetroWiki. I believe this was Apache’s intent when they funded it, hence the open licences on AAPG Wiki and SEG Wiki. (Don’t get me started on the missed opportunity of the SEG/AAPG/SPE wikis.)
Allow more people from more places to participate in events, with sympathetic pricing, asynchronous activities, recorded talks, etc. It is completely impossible for a great many engineers to participate in this openness workshop.
Organize more events around openness!

I know that SPE, like the other societies, has some way to go before they really internalize all of this. That’s normal — change takes time. But I’m afraid there is some catching up to do. The petroleum industry is well behind here, and none of this is really new — I’ve been banging on about it for a decade and I think of myself as a newcomer to the openness party. Jon Claerbout and Paul de Groot must be utterly exhausted by the whole thing!

The virtual conference this week is an encouraging step in the right direction, as are the recent SPE datathons (notwithstanding what I said about the data). Although it’s a late move — making me wonder if it’s an act of epiphany or of desperation — I’m cautiously encouraged. I hope the trend continues and picks up pace. And I’m looking forward to more debate and inspiration as the week goes on.

February 17, 2021

Which open licence should I choose?

February 17, 2021/ Matt Hall

I’ve written about open data a few times recently. And not-so-recently. And there’s been quite a bit of chat about open subsurface benchmarks in the Software Underground recently. As more people consider openly releasing data — or code, or other content — one question comes up fairly often is: Which licence should I choose?

I’ll start at the beginning, and I am not a lawyer, but this is going to be very high level. So do click on the links to read more.

What is copyright?

You automatically own the copyright to anything original that you create. You don’t have to register it, but the thing you made — and it must be a thing, you can’t copyright ideas — must be original. It could be a photo, a song, or a seismic interpretation. Physical measurements with no creative input, such as well logs, are not copyrightable… but a database consisting of such data is (so-called database rights). Your rights are exclusive, worldwide, and last until some years after you die (it varies).

If someone wants to use your work, even if they just found it on the Internet, they must either claim Fair Use, or seek permission from you. Giving permission means granting a licence; it can be as restrictive and arcane as you want.

If you don’t want people bothering you about licences, or if you want to actively encourage people to use and adapt your work, you can preemptively grant an open licence.

What is openness?

Before you start thinking about licences, there are two more big things to learn about:

What is open? Not all licences, not even all Creative Commons licences, meet the Open Definition. In brief, this states that “Open data and content can be freely used, modified, and shared by anyone for any purpose” — you can’t restrict people based on their use case or location. So licences that forbid commercial application are not open.
What is permissiveness? Once you’ve decided to go open, you need to decide where you stand on permissiveness. Some licences, notably those advocated by the GNU Free Software Movement, compel licensees (users) to preserve the openness of the work in any future redistribution. This ‘viral’ condition is sometimes called copyleft.

In some circles, a near-religious war smoulders on the permissiveness issue. You need to make up your own mind where you stand, or at least understand the issues.

By the way, granting a licence does not mean giving up your rights. In fact, you must own the copyright in order to grant the licence. Many scientists don’t realize we’ve been giving away the copyright in our work for decades, as a (completely unnecessary and made up) condition of publication.

Another source of confusion: open licences are also not the same thing as public domain. Public domain means that the work is free from copyright restrictions. In general though, it cannot be applied to a copyrighted work (though CC0 tries to relinquish copyright where possible). For example, On The Origin of Species is public domain, as is most work produced by the United States government (for example, by the USGS).

One last thing: an often overlooked aspect of licensing is protection for you, the licensor. All common licences include language that indemnifies you from misuse or misinterpretation of your work. So be careful about putting your stuff ‘out there’ with anything other than a standard licence: you may be leaving yourself open to liability issues later.

Open licences

Rather than writing a lot of stuff that’s been written by smarter people than me, I thought I’d draw a diagram to try to explain the differences between some common licences (there are certainly a lot more than the ones I mention here).

Just to re-iterate: there are a lot more licences than the ones mentioned here, these are just examples.

What do I recommend?

For content, my personal belief is that CC-BY most aptly captures the way science works. Scientists 'build on the shoulders of giants' by re-using the work of others with fastidious attribution, usually by citation. Accordingly, the CC-BY protects the licensor, ensures attribution, and that's it. If you prefer copyleft licences, the equivalent licence is CC-BY-SA.

But Creative Commons recommend against using CC licences for source code, so what should you do then?

For code, the permissive licence closest to CC-BY is the MIT/BSD/Apache family of licences, of which only the Apache 2.0 licence offers some specific protections with respect to patents (in particular, it protects licensees from ‘upstream’ patent infringements). The equivalent copyleft licences are the GPL (for applications) and LGPL (for libraries).

For data, I tend to use CC-BY, but there are some specialist data licences (beware, they are poorly named in my opinion: the seemingly ‘vanilla’ ODbL is copyleft; the permissive equivalent is ODC-By).

What about mixed content, like a Jupyter Notebook? You have to be practical; maybe it depends on whether you consider your notebooks to be 'content' or 'source code'. I sometimes put at the bottom of a notebook something like Open source content. Text is CC-BY, code is Apache 2.0 and I think this makes my intent clear.

Tools

There are some tools around to help you make a choice of licence:

Licence selector for open source software or data.
Choose a CC Licence for open content (images, text, perhaps data).
Choose a Licence for open source software.
TL;DR Legal has plain English summaries of popular software licences.

Last thing

Note that open licences are just one piece of the jigsaw puzzle of reproducible science and reusable content. You also need to think about open and accessible data formats (e.g. CSV not XLS), accessible content (DOIs and open indexes), and documentation.

Although insufficient, open licences are a necessary component though. And while licences can be changed, they cannot be revoked… so it’s worth putting some thought into your choices before you start pushing your content out into the world.

If it seems hard to navigate, do get in touch, we’d be happy to help if and where we can (notwithstadning IANAL). If your situation is at all complicated I recommend seeking professional legal advice — but do go out of your way to find one who understands both the motivation for, and the legal issues around, open licensing.

December 08, 2020

An update on Volve

December 08, 2020/ Matt Hall

Writing about the new almost-open dataset at Groningen yesterday reminded me that things have changed a little on Equinor’s Volve dataset in Norway. Illustrating the principle that there are more ways to get something wrong than to get them right, here’s the situation there.

In 2018, Equinor generously released a very large dataset from the decommissioned field Volve. The data is undoubtedly cool, but initially it was released with no licence. Later in 2018, a licence was added but it was a non-open licence, CC BY-NC-SA. Then, earlier this year, the licence was changed to a modified CC BY licence. Progress, sort of.

I think CC BY is an awesome licence for open data. But modifying licences is always iffy and in this case the modifications mean that the licence can no longer be called ‘open’, because the restrictions they add are not permitted by the Open Definition. For me, the problematic clauses in the modification are:

You can’t sell the dataset. This is almost as ambiguous as the previous “non-commercial” clause. What if it’s a small part of a bigger offering that adds massive value, for example as demo data for a software package? Or as one piece in a large data collection? Or as the basis for a large and expensive analysis? Or if it was used to train a commercial neural network?

The license covers all data in the dataset whether or not it is by law covered by copyright. It's a bit weird that this is tucked away in a footnote, but okay. I don't know how it would work in practice because CC licenses depend on copyright. (The whole point of uncopyrightable content is that you can't own rights in it, nevermind license it.)

It’s easy to say, “It’s fine, that’s not what Equinor meant.” My impression is that the subsurface folks in Equinor have always said, "This is open," and their motivation is pure and good, but then some legal people get involved and so now we have what we have. Equinor is an enormous company with (compared to me) infinite resources and a lot of lawyers. Who knows how their lawyers in a decade will interpret these terms, and my motivations? Can you really guarantee that I won’t be put in an awkward situation, or bankrupted, by a later claim — like some of GSI’s clients were when they decided to get tough on their seismic licenses?

Personally, I’ve decided not to touch Volve until it has a proper open licence that does not carry this risk.

December 07, 2020

A big new almost-open dataset: Groningen

December 07, 2020/ Matt Hall

Open data enthusiasts rejoice! There’s a large new openly licensed subsurface dataset. And it’s almost awesome.

Go to the dataset

The dataset has been released by Dutch oil and gas operator Nederlandse Aardolie Maatschappij (NAM), which is a 50–50 joint venture between Shell and ExxonMobil. They have operated the giant Groningen gas field since 1963, producing from the Permian Rotliegend Group, a 50 to 225 metre-thick sandstone with excellent reservoir properties. The dataset consists of a static geological model and its various components: data from over [edit: 6000 well logs], a prestack-depth migrated seismic volume, plus seismic horizons, and a large number of interpreted faults. It’s 4.4GB in total — not ginormous.

Induced seismicity

There’s a great deal of public interest in the geology of the area: Groningen has been plagued by induced seismicity for over 30 years. The cause has been identified as subsidence resulting from production, and became enough of a concern that the government took steps to limit production in 2014, and has imposed a plan to shut down the field completely by 2030. There are also pressure maintenance measures in place, as well as a lot of monitoring. However, the earthquakes continue, and have been as large as magnitude 3.6 — a big worry for people living in the area. I assume this issue is one of the major reasons for NAM releasing the data.*

In the map of the Top Rotliegendes (right, from Kortekaas & Jaarsma 2017), the elevation varies from –2442 m (red) to –3926 m. Major faults are shown in blue, along with seismic events of local magnitude 1.3 to 3.6. The Groningen field outline is shown in red.

Can you use the data? Er, maybe.

Anyone can access the data. NAM and Utrecht University, who have published the data, have selected a Creative Commons Attribution 4.0 licence, which is (in my opinion) the best licence to use. And unlike certain other data owners (see below!) they have resisted the temptation to modify the licence and confuse everyone. (It seems like they were tempted though, as the metadata contains the plea, “If you intend on using the model, please let us know […]”, but it’s not a requirement.)

However, the dataset does not meet the Open Definition (see section 1.4). As the owners themselves point out, there’s a rather major flaw in their dataset:

This model can only be used in combination with Petrel software • The model has taken years of expert development. Please use only if you are a skilled Petrel user.

I’ll assume this is a statement of fact, as opposed to a formal licence restriction. It’s clear that requiring (de facto or otherwise) the use of proprietary software (let alone software costing more than USD 100,000!) is not ‘open’ at all. No normal person has access to Petrel, and the annoying thing is that there’s absolutely no reason to make the dataset this inconvenient to use. The obvious format for seismic data is SEG-Y (although there is a ZGY reader out now), and there’s LAS 2 or even DLIS for wireline logs. There are no open standard formats for seismic horizons or formation tops, but some sort of text file would be fine. All of these formats have open source file readers, or can be parsed as text. Admittedly the geomodel is a tricky one; I don’t know about any open formats. [UPDATE: see the note below from EPOS-NL.]

Happily, even if the data owners do nothing, I think this problem will be remedied by the community. Some kind soul with access to Petrel will export the data into open formats, and then this dataset really will be a remarkable addition to the open subsurface data family. Stay tuned for more on this.

References

NAM (2020). Petrel geological model of the Groningen gas field, the Netherlands. Open access through EPOS-NL. Yoda data publication platform Utrecht University. DOI 10.24416/UU01-1QH0MW.

M Kortekaas & B Jaarsma (2017). Improved definition of faults in the Groningen field using seismic attributes. Netherlands Journal of Geosciences — Geologie en Mijnbouw 96 (5), p 71–85, 2017 DOI 10.1017/njg.2017.24.

UPDATE on 7 December 2020

* According to Henk Kombrink’s sources, the dataset release is “an initiative from NAM itself, driven primarily by a need from the research community for a model of the field.” Check out Henk’s article about the dataset:

Kombrink, H (2020). Static model giant Groningen field publicly available. Article in Expro News. https://expronews.com/technology/static-model-giant-groningen-field-publicly-available/

UPDATE 10 December 2020

I got the following information from EPOS-NL:

“EPOS-NL and NAM are happy to see the enthusiasm for this most recent data publication. Petrel is one of the most commonly used software among geologists in both academia and industry, and so provides a useful platform for many users worldwide. For those without a Petrel license, the data publication includes a RESCUE 3d grid data export of the model. RESCUE data can be read by a number of open source software. This information was not yet very clearly provided in the data description, so thanks for pointing this out. Finally, the well log data and seismic data used in the Petrel model are also openly accessible, without having to use Petrel software, on the NLOG website (https://www.nlog.nl/en/data), i.e. the Dutch oil and gas portal. Hope this helps!”

October 31, 2018

Reproducibility Zoo

October 31, 2018/ Matt Hall

The Repro Zoo was a new kind of event at the SEG Annual Meeting this year. The goal: to reproduce the results from well-known or important papers in GEOPHYSICS or The Leading Edge. By reproduce, we meant that the code and data should be open and accessible. By results, we meant equations, figures, and other scientific outcomes.

And some of the results are scary enough for Hallowe’en :)

What we did

All the work went straight into GitHub, mostly as Jupyter Notebooks. I had a vague goal of hitting 10 papers at the event, and we achieved this (just!). I’ve since added a couple of other papers, since the inspiration for the work came from the Zoo… and I haven’t been able to resist continuing.

The scene at the Repro Zoo. An air of quiet productivity hung over the booth. Yes, that is Sergey Fomel and Jon Claerbout. Thank you to David Holmes of Dell EMC for the picture.

Here’s what the Repro Zoo team got up to, in alphabetical order:

Aldridge (1990). The Berlage wavelet. GEOPHYSICS 55 (11). The wavelet itself, which has also been added to bruges.
Batzle & Wang (1992). Seismic properties of pore fluids. GEOPHYSICS 57 (11). The water properties, now added to bruges.
Claerbout et al. (2018). Data fitting with nonstationary statistics, Stanford. Translating code from FORTRAN to Python.
Claerbout (1975). Kolmogoroff spectral factorization. Thanks to Stewart Levin for this one.
Connolly (1999). Elastic impedance. The Leading Edge 18 (4). Using equations from bruges to reproduce figures.
Liner (2014). Long-wave elastic attentuation produced by horizontal layering. The Leading Edge 33 (6). This is the stuff about Backus averaging and negative Q.
Luo et al. (2002). Edge preserving smoothing and applications. The Leading Edge 21 (2).
Yilmaz (1987). Seismic data analysis, SEG. Okay, not the whole thing, but Sergey Fomel coded up a figure in Madagascar.
Partyka et al. (1999). Interpretational aspects of spectral decomposition in reservoir characterization.
Röth & Tarantola (1994). Neural networks and inversion of seismic data. Kudos to Brendon Hall for this implementation of a shallow neural net.
Taner et al. (1979). Complex trace analysis. GEOPHYSICS 44. Sarah Greer worked on this one.
Thomsen (1986). Weak elastic anisotropy. GEOPHYSICS 51 (10). Reproducing figures, again using equations from bruges.

As an example of what we got up to, here’s Figure 14 from Batzle & Wang’s landmark 1992 paper on the seismic properties of pore fluids. My version (middle, and in red on the right) is slightly different from that of Batzle and Wang. They don’t give a numerical example in their paper, so it’s hard to know where the error is. Of course, my first assumption is that it’s my error, but this is the problem with research that does not include code or reference numerical examples.

Figure 14 from Batzle & Wang (1992). Left: the original figure. Middle: My attempt to reproduce it. Right: My attempt in red, overlain on the original.

This was certainly not the only discrepancy. Most papers don’t provide the code or data to reproduce their figures, and this is a well-known problem that the SEG is starting to address. But most also don’t provide worked examples, so the reader is left to guess the parameters that were used, or to eyeball results from a figure. Are we really OK with assuming the results from all the thousands of papers in GEOPHYSICS and The Leading Edge are correct? There’s a long conversation to have here.

What next?

One thing we struggled with was capturing all the ideas. Some are on our events portal. The GitHub repo also points to some other sources of ideas. And there was the Big Giant Whiteboard (below). Either way, there’s plenty to do (there are thousands of papers!) and I hope the zoo continues in spirit. I will take pull requests until the end of the year, and I don’t see why we can’t add more papers until then. At that point, we can start a 2019 repo, or move the project to the SEG Wiki, or consider our other options. Ideas welcome!

Thank you!

The following people and organizations deserve accolades for their dedication to the idea and hard work making it a reality. Please give them a hug or a high five when you see them.

David Holmes (Dell EMC) and Chance Sanger worked their tails off on the booth over the weekend, as well as having the neighbouring Dell EMC booth to worry about. David also sourced the amazing Dell tech we had at the booth, just in case anyone needed 128GB of RAM and an NVIDIA P5200 graphics card for their Jupyter Notebook. (The lights in the convention centre actually dimmed when we powered up our booths in the morning.)
Luke Decker (UT Austin) organized a corps of volunteer Zookeepers to help manage the booth, and provided enthusiasm and coding skills. Karl Schleicher (UT Austin), Sarah Greer (MIT), and several others were part of this effort.
Andrew Geary (SEG) for keeping things moving along when I became delinquent over the summer. Lots of others at SEG also helped, mainly with the booth: Trisha DeLozier, Rebecca Hayes, and Beth Donica all contributed.
Diego Castañeda got the events site in shape to support the Repro Zoo, with a dashboard showing the latest commits and contributors.

October 16, 2018

Volve: not open after all

October 16, 2018/ Matt Hall

Back in June, Equinor made the bold and exciting decision to release all its data from the decommissioned Volve oil field in the North Sea. Although the intent of the release seemed clear, the dataset did not carry a license of any kind. Since you cannot use unlicensed content without permission, this was a problem. I wrote about this at the time.

To its credit, Equinor listened to the concerns from me and others, and considered its options. Sensibly, it chose an off-the-shelf license. It announced its decision a few days ago, and the dataset now carries a Creative Commons Attribution-NonCommercial-ShareAlike license.

Unfortunately, this license is not ‘open’ by any reasonable definition. The non-commercial stipulation means that a lot of people, perhaps most people, will not be able to legally use the data (which is why non-commercial licenses are not open licenses). And the ShareAlike part means that we’re in for some interesting discussion about what derived products are, because any work based on Volve will have to carry the CC BY-NC-SA license too.

Non-commercial licenses are not open

Here are some of the problems with the non-commercial clause:

NC licenses are not 'open'. They do not meet the Open Definition, because open data must be available for anyone to use, for any purpose. For example, here’s a quote from Hagedorn et al (2011):

NC licenses come at a high societal cost: they provide a broad protection for the copyright owner, but strongly limit the potential for re-use, collaboration and sharing in ways unexpected by many users

NC licenses are incompatible with CC-BY-SA. This means that the data cannot be used on Wikipedia, SEG Wiki, or AAPG Wiki, or in any openly licensed work carrying that license.
NC-licensed data cannot be used commercially. This is obvious, but far-reaching. It means, for example, that nobody can use the data in a course or event for which they charge a fee. It means nobody can use the data as a demo or training data in commercial software. It means nobody can use the data in a book that they sell.
The boundaries of the license are unclear. It's arguable whether any business can use the data for any purpose at all, because many of the boundaries of the scope have not been tested legally. What about a course run by AAPG or SEG? What about a private university? What about a government, if it stands to realize monetary gain from, say, a land sale? All of these uses would be illiegal, because it’s the use that matters, not the commercial status of the user.

Now, it seems likely, given the language around the release, that Equinor will not sue people for most of these use cases. They may even say this. Goodness knows, we have enough nudge-nudge-wink-wink agreements like that already in the world of subsurface data. But these arrangements just shift the onus onto the end use and, as we’ve seen with GSI, things can change and one day you wake up with lawsuits.

ShareAlike means you must share too

Creative Commons licenses are, as the name suggests, intended for works of creativity. Indeed, the whole concept of copyright, depends on creativity: copyright protects works of creative expression. If there’s no creativity, there’s no basis for copyright. So for example, a gamma-ray log is unlikely to be copyrightable, but seismic data is (follow the GSI link above to find out why). Non-copyrightable works are not covered by Creative Commons licenses.

All of which is just to help explain some of the language in the CC BY-NC-SA license agreement, which you should read. But the key part is in paragraph 4(b):

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under the terms of this License

What’s a ‘derivative work’? It’s anything ‘based upon’ the licensed material, which is pretty vague and therefore all-encompassing. In short, if you use or show Volve data in your work, no matter how non-commercial it is, then you must attach a CC BY-NC-SA license to your work. This is why SA licenses are sometimes called ‘viral’.

By the way, the much-loved F3 and Penobscot datasets also carry the ShareAlike clause, so any work (e.g. a scientific paper) that uses them is open-access and carries the CC BY-SA license, whether the author of that work likes it or not. I’m pretty sure no-one in academic publishing knows this.

By the way again, everything in Wikipedia is CC BY-SA too. Maybe go and check your papers and presentations now :)

What should Equinor do?

My impression is that Equinor is trying to satisfy some business partner or legal edge case, but they are forgetting that they have orders of magnitude more capacity to deal with edge cases than the potential users of the dataset do. The principle at work here should be “Don’t solve problems you don’t have”.

Encumbering this amazing dataset with such tight restrictions effectively kills it. It more or less guarantees it cannot have the impact I assume they were looking for. I hope they reconsider their options. The best choice for any open data is CC-BY.

October 09, 2018

Reproduce this!

October 09, 2018/ Matt Hall

There’s a saying in programming: untested code is broken code. Is unreproducible science broken science?

I hope not, because geophysical research is — in general — not reproducible. In other words, we have no way of checking the results. Some of it, hopefully not a lot of it, could be broken. We have no way of knowing.

Next week, at the SEG Annual Meeting, we plan to change that. Well, start changing it… it’s going to take a while to get to all of it. For now we’ll be content with starting.

We’re going to make geophysical research reproducible again!

Welcome to the Repro Zoo!

If you’re coming to SEG in Anaheim next week, you are hereby invited to join us in Exposition Hall A, Booth #749.

We’ll be finding papers and figures to reproduce, equations to implement, and data tables to digitize. We’ll be hunting down datasets, recreating plots, and dissecting derivations. All of it will be done in the open, and all the results will be public and free for the community to use.

You can help

There are thousands of unreproducible papers in the geophysical literature, so we are going to need your help. If you’ll be in Anaheim, and even if you’re not, here some things you can do:

Vote on the papers and figures to reproduce, or propose new ones. Click here!
Show up at Booth #749 to take part. Bring your laptop, or use one of ours (kindly provided by Dell EMC, our amazing booth neighbours). We’ll mostly be coding in Python and Julia, but any open language is welcome.
Tell people about the Repro Zoo and bring them along too.

That’s all there is to it! Whether you’re a coder or an interpreter, whether you have half an hour or half a day, come along to the Repro Zoo and we’ll get you started.

Figure 1 from Connolly’s classic paper on elastic impedance. This is the kind of thing we’ll be reproducing.

October 25, 2017

EarthArXiv wants your preprints

October 25, 2017/ Matt Hall

If you're into science, and especially physics, you've heard of arXiv, which has revolutionized how research in physics is shared. BioarXiv, SocArXiv and PaleorXiv followed, among others*.

Well get excited, because today, at last, there is an open preprint server especially for earth science — EarthArXiv has landed!

I could write a long essay about how great this news is, but the best way to get the full story is to listen to two of the founders — Chris Jackson (Imperial College London and fellow University of Manchester alum) and Tom Narock (University of Maryland, Baltimore) — on Undersampled Radio this morning:

Congratulations to Chris and Tom, and everyone involved in EarthArXiv!

Friedrich Hawemann, ETH Zurich, Switzerland
Daniel Ibarra, Earth System Science, Standford University, USA
Sabine Lengger, University of Plymouth, UK
Andelo Pio Rossi, Jacobs University Bremen, Germany
Divyesh Varade, Indian Institute of Technology Kanpur, India
Chris Waigl, University of Alaska Fairbanks, USA

Sara Bosshart, International Water Association, UK
Alodie Bubeck, University of Leicester, UK
Allison Enright, Rutgers - Newark, USA
Jamie Farquharson, Université de Strasbourg, France
Alfonso Fernandez, Universidad de Concepcion, Chile
Stéphane Girardclos, University of Geneva, Switzerland
Surabhi Gupta, UGC, India

Don't underestimate how important this is for earth science. Indeed, there's another new preprint server coming to the earth sciences in 2018, as the AGU — with Wiley! — prepare to launch ESSOAr. Not as a competitor for EarthArXiv (I hope), but as another piece in the rich open-access ecosystem of reproducible geoscience that's developing. (By the way, AAPG, SEG, SPE: you need to support these initiatives. They want to make your content more relevant and accessible!)

It's very, very exciting to see this new piece of infrastructure for open access publishing. I urge you to join in! You can submit all your published work to EarthArXiv — as long as the journal's policy allows it — so you should make sure your research gets into the hands of the people who need it.

I hope every conference from now on has an EarthArXiv Your Papers party.

* Including snarXiv, don't miss that one!

May 22, 2017

GeoConvention highlights

May 22, 2017/ Evan Bianco

We were in Calgary last week at the Canada GeoConvention 2017. The quality of the talks seemed more variable than usual but, as usual, there were some gems in there too. Here are our highlights from the technical talks...

Filling in gaps

Mauricio Sacchi (University of Alberta) outlined a new reconstruction method for vector field data. In other words, filling in gaps in multi-compononent seismic records. I've got a soft spot for Mauricio's relaxed speaking style and the simplicity with which he presents linear algebra, but there are two other reasons that make this talk worthy of a shout out:

He didn't just show equations in his talk, he used pseudocode to show the algorithm.
He linked to his lab's seismic processing toolkit, SeismicJulia, on GitHub.

I am sure he'd be the first to admit that it is early days for for this library and it is very much under construction. But what isn't? All the more reason to showcase it openly. We all need a lot more of that.

Update on 2017-06-7 13:45 by Evan Bianco: Mauricio, has posted the slides from his talk.

Learning about errors

Anton Birukov (University of Calgary & graduate intern at Nexen) gave a great talk in the induced seismicity session. It was a lovely mashing-together of three of our favourite topics: seismology, machine-learning, and uncertainty. Anton is researching how to improve microseismic and earthquake event detection by framing it as a machine-learning classification problem. He's using Monte Carlo methods to compute myriad synthetic seismic events by making small velocity variations, and then using those synthetic events to teach a model how to be more accurate about locating earthquakes.

Figure 2 from Anton Biryukov's abstract. An illustration of the signal classification concept. The signals originating from the locations on the grid (a) are then transformed into a feature space and labeled by the class containing the event or… — Figure 2 from Anton Biryukov's abstract. An illustration of the signal classification concept. The signals originating from the locations on the grid (a) are then transformed into a feature space and labeled by the class containing the event origin. From Biryukov (2017). Event origin depth uncertainty - estimation and mitigation using waveform similarity. Canada GeoConvention, May 2017.

The bright lights of geothermal energy
Matt Hall

Two interesting sessions clashed on Wednesday afternoon. I started off in the Value of Geophysics panel discussion, but left after James Lamb's report from the mysterious Chief Geophysicists' Forum. I had long wondered what went on in that secretive organization; it turns out they mostly worry about how to make important people like your CEO think geophysics is awesome. But the large room was a little dark, and — in keeping with the conference in general — so was the mood.

Feeling a little down, I went along to the Diversification of the Energy Industry session instead. The contrast was abrupt and profound. The bright room was totally packed with a conspicuously young audience numbering well over 100. The mood was hopeful, exuberant even. People were laughing, but not wistfully or ironically. I think I saw a rainbow over the stage.

If you missed this uplifting session but are interested in contributing to Canada's geothermal energy scene, which will certainly need geoscientists and reservoir engineers if it's going to get anywhere, there are plenty of ways to find out more or get involved. Start at cangea.ca and follow your nose.

We'll be writing more about the geothermal scene — and some of the other themes in this post — so stay tuned.

DID YOU KNOW?

You can get regular updates right to your email, just drop your address in the box:

The fine print: No spam, we promise! We never share email addresses with 3rd parties. Unsubscribe any time with the link in the emails. The service is provided by MailChimp in accordance with Canada's anti-spam regulations.

Blog