May 20, 2021

Equinor should change its open data licence

May 20, 2021/ Matt Hall

This is an open letter to Equinor to appeal for a change to the licence used on Volve, Northern Lights, and other datasets. If you wish to co-sign, please add a supportive comment below. (Or if you disagree, please speak up too!)

Open data has had huge impact on science and society. Whether the driving purpose is innovation, transparency, engagement, or something else, open data can make a difference. Underpinning the dataset itself is its licence, which grants permission to others to re-use and distribute open data. Open data licences are licences that meet the Open Definition.

In 2018, Equinor generously released a very large dataset from the decommissioned field Volve. Initially it was released with no licence. Later in 2018, a licence was added but it was a non-open licence, CC BY-NC-SA (open licences cannot be limited to non-commercial use, which is what the NC stands for). Then, in 2020, the licence was changed to a modified CC BY licence, which you can read here.

As far as I know, Volve and other projects still carry this licence. I’ll refer to this licence as “the Equinor licence”. I assume it applies to the collection of data, and to the contents of the collection (where applicable).

There are 3 problems with the licence as it stands:

The licence is not open.
Modified CC licences have issues.
The licence is not clear and exposes licencees to risk of infringement.

Let's look at these in turn.

The licence is not open

The Equinor licence is not an open licence. It does not meet the Open Definition, section 2.1.2 of which states:

“The license must allow redistribution of the licensed work, including sale, whether on its own or as part of a collection made from works from different sources.”

The licence does not allow sale and therefore does not meet this criterion. Non-open licences are not compatible with open licences, therefore these datasets cannot be remixed and re-used with open content. This greatly limits the usefulness of the dataset.

Modified CC licences have issues

The Equinor licence states:

“This license is based on CC BY 4.0 license ”

I interpret this to mean that it is intended to act as a modified CC BY licence. There are two issues with this:

The copyright lawyers at Creative Commons strongly advises against modifying (in particular, adding restrictions to) their licences.
If you do modify one, you may not refer to it as a CC BY licence or use Creative Commons trademarks; doing so violates their trademarks.

Both of these issues are outlined in the Creative Commons Wiki. According to that document, these issues arise because modified licences confuse the public. In my opinion (and I am not a lawyer, etc), the Equinor licence is confusing, and it appears to violate the Creative Commons organization's trademark policy.

Note that 'modify' really means 'add restrictions to' here. It is easier to legally and clearly remove restrictions from CC licences, using the CCPlus licence extension pattern.

The licence is not clear

The Equinor licence contains five restrictions:

You may not sell the Licensed Material.
You must give Equinor and the Volve license partners credit, and provide a link to these terms and conditions, as well as a copyright notice if applicable.
You may not share Adapted Material under a license that prevents recipients from complying with these terms and conditions.
You shall not use the Licensed Material in a manner that appears misleading nor present the Licensed Material in a distorted or incorrect manner.
The license covers all data in the dataset whether or not it is by law covered by copyright.

Looking at the points in turn:

Point 1 is, I believe, the main issue for Equinor. For some reason, this is paramount for them.

Point 2 seems like a restatement of the BY restriction that is the main feature of the CC-BY licence and is extensively described in Section 3.a of that licence.

Point 3 is already covered by CC BY in Section 3.a.4.

Point 4 is ambiguous and confusing. Who is the arbiter of this potentially subjective criterion? How will it be applied? Will Equinor examine every use of the data? The scenario this point is trying to prevent seems already to be covered by standard professional ethics and 'errors and omissions'. It's a bit like saying you can't use the data to commit a crime — it doesn't need saying because commiting crimes is already illegal.

Point 5 is strange. I don’t know why Equinor wants to licence material that no-one owns, but licences are legal contracts, and you can bind people into anything you can agree on. One note here — the rights in the database (so-called 'database rights') are separate from the rights in the contents: it is possible in many jurisdictions to claim sui generis rights in a collection of non-copyrightable elements; maybe this is what was intended? Importantly, Sui generis database rights are explicitly covered by CC BY 4.0.

Finally, I recently received an email communication from Equinor that stated the following:

“[...] nothing in our present licencing inhibits the fair and widespread use of our data for educational, scientific, research and commercial purposes. You are free to download the Licensed Material for non-commercial and commercial purposes. Our only requirement is that you must add value to the data if you intend to sell them on.”

The last sentence (“Our only requirement…”) states that there is only one added restriction. But, as I just pointed out, this is not what the licence document states. The Equinor licence states that one may not sell the licensed material, period. The email states that I can sell it if I add value. Then the questions are, "What does 'add value' mean?", and "Who decides?". (It seems self-evident to me that it would be very hard to sell open material if one wasn't adding value!)

My recommendations

In its current state, I would not recommend anyone to use the Volve or Northern Lights data for any purpose. I know this sounds extreme, but it’s important to appreciate the huge imbalance in the relationship between Equinor and its licensees. If Equinor's future counsel — maybe in a decade — decides that lots of people have violated this licence, what happens next could be quite unjust. Equinor can easily put a small company out of business with a lawsuit. I know that might seem unlikely today, but I urge you to read about GSI's extensive lawsuits in Canada — this is a real situation that cost many companies a lot of money. You can read about it in my blog post, Copyright and seismic data.

When it comes to licences, and legal contracts in general, I believe that less is more. Taking a standard licence and adding words to solve problems you don’t have but can imagine having — and lawyers have very good imaginations — just creates confusion.

I therefore recommend the following:

Adopt an unmodifed CC BY 4.0 licence for the collection as a whole.
Adopt an unmodifed CC BY 4.0 licence for the contents of the collection, where copyrightable.
Include copyright notices that clearly state the copyright owners, in all relevant places in the collection (e.g. data folders, file headers) and at least at the top level. This way, it's clear how attribution should be done.
Quell the fear of people selling the dataset by removing as many possible barriers to using the free version as possible, and generally continuing to be a conspicuous champion for open data.

If Equinor opts to keep a version of the current licence, I recommend at least removing any mention of CC BY, it only adds to the confusion. The Equinor licence is not a CC BY licence, and mentioning Creative Commons violates their policy. We also suggest simplifying the licence if possible, and clarifying any restrictions that remain. Use plain language, give examples, and provide a set of Frequently Asked Questions.

The best path forward for fostering a community around these wonderful datasets that Equinor has generously shared with the community, is to adopt a standard open licence as soon as possible.

May 19, 2021

How can technical societies support openness?

May 19, 2021/ Matt Hall

There’s an SPE conference on openness happening this week. Around 60 people paid the $400 registration fee — does that seem like a lot for a virtual conference? — and it’s mostly what you’d expect: talks and panel discussions. But there’s 20 minutes per day for open discussion, and we must be grateful for small things! For sure, it is always good to see the technical societies pay attention to open data, open source code, and open access content.

But what really matters is action, and in my breakout room today I asked about SPE’s role in raising the community’s level of literacy around openness. Someone asked in turn what sorts of things the organization could do. I said my answer needed to be written down 😄 so here it is.

To save some breath, I’m going to use the word openness to talk about open access content, open source code, and open data. And when I say ‘open’, I mean that something meets the Open Definition. In a nutshell, this states:

“Open data and content can be freely used, modified, and shared by anyone for any purpose”

Remember that ‘free’ here means many things, but not necessarily ‘free of charge’.

So that we don’t lose sight of the forest for the tree, my advice boils down to this: I would like to see all of the technical societies understand and embrace the idea that openness is an important way for them to increase their reach, improve their accessibility, become more equitable, increase engagement, and better serve their communities of practice.

No, ‘increase their revenue’ is not on that list. Yes, if they do those things, their revenue will go up. (I’ve written about the societies’ counterproductive focus on revenue before.)

Okay, enough preamble. What can the societies do to better serve their members? I can think of a few things:

Advocate for producers of the open content and technology that benefits everyone in the community.
Help member companies understand the role openness plays in innovation and help them find ways to support it.
Take a firm stance on expectations of reproducibility for journal articles and conference papers.
Provide reasonable, affordable options for authors to choose open licences for their work (and such options must not require a transfer of copyright).
When open access papers are published, be clear about the licence. (I could not figure out the licence on the current most read paper in SPE Journal, although it says ‘open access’.)
Find ways to get well-informed legal advice about openness to members (this advice is hard to find; most lawyers are not well informed about copyright law, nevermind openness).
Offer education on openness to members.
Educate editors, associate editors, and meeting convenors on openness so that they can coach authors, reviewers., and contributors.
Improve peer review machinery to better support the review of code and data submissions.
Highlight exemplary open research projects, and help project maintainers improve over time. (For example, what would it take to accelerate MRST’s move to an open language? Could SPE help create those conditions?)
Recognize that open data benchmarks are badly needed and help organize labour around them.
Stop running data science contests that depend on proprietary data.
Put an open licence on PetroWiki. I believe this was Apache’s intent when they funded it, hence the open licences on AAPG Wiki and SEG Wiki. (Don’t get me started on the missed opportunity of the SEG/AAPG/SPE wikis.)
Allow more people from more places to participate in events, with sympathetic pricing, asynchronous activities, recorded talks, etc. It is completely impossible for a great many engineers to participate in this openness workshop.
Organize more events around openness!

I know that SPE, like the other societies, has some way to go before they really internalize all of this. That’s normal — change takes time. But I’m afraid there is some catching up to do. The petroleum industry is well behind here, and none of this is really new — I’ve been banging on about it for a decade and I think of myself as a newcomer to the openness party. Jon Claerbout and Paul de Groot must be utterly exhausted by the whole thing!

The virtual conference this week is an encouraging step in the right direction, as are the recent SPE datathons (notwithstanding what I said about the data). Although it’s a late move — making me wonder if it’s an act of epiphany or of desperation — I’m cautiously encouraged. I hope the trend continues and picks up pace. And I’m looking forward to more debate and inspiration as the week goes on.

February 17, 2021

Which open licence should I choose?

February 17, 2021/ Matt Hall

I’ve written about open data a few times recently. And not-so-recently. And there’s been quite a bit of chat about open subsurface benchmarks in the Software Underground recently. As more people consider openly releasing data — or code, or other content — one question comes up fairly often is: Which licence should I choose?

I’ll start at the beginning, and I am not a lawyer, but this is going to be very high level. So do click on the links to read more.

What is copyright?

You automatically own the copyright to anything original that you create. You don’t have to register it, but the thing you made — and it must be a thing, you can’t copyright ideas — must be original. It could be a photo, a song, or a seismic interpretation. Physical measurements with no creative input, such as well logs, are not copyrightable… but a database consisting of such data is (so-called database rights). Your rights are exclusive, worldwide, and last until some years after you die (it varies).

If someone wants to use your work, even if they just found it on the Internet, they must either claim Fair Use, or seek permission from you. Giving permission means granting a licence; it can be as restrictive and arcane as you want.

If you don’t want people bothering you about licences, or if you want to actively encourage people to use and adapt your work, you can preemptively grant an open licence.

What is openness?

Before you start thinking about licences, there are two more big things to learn about:

What is open? Not all licences, not even all Creative Commons licences, meet the Open Definition. In brief, this states that “Open data and content can be freely used, modified, and shared by anyone for any purpose” — you can’t restrict people based on their use case or location. So licences that forbid commercial application are not open.
What is permissiveness? Once you’ve decided to go open, you need to decide where you stand on permissiveness. Some licences, notably those advocated by the GNU Free Software Movement, compel licensees (users) to preserve the openness of the work in any future redistribution. This ‘viral’ condition is sometimes called copyleft.

In some circles, a near-religious war smoulders on the permissiveness issue. You need to make up your own mind where you stand, or at least understand the issues.

By the way, granting a licence does not mean giving up your rights. In fact, you must own the copyright in order to grant the licence. Many scientists don’t realize we’ve been giving away the copyright in our work for decades, as a (completely unnecessary and made up) condition of publication.

Another source of confusion: open licences are also not the same thing as public domain. Public domain means that the work is free from copyright restrictions. In general though, it cannot be applied to a copyrighted work (though CC0 tries to relinquish copyright where possible). For example, On The Origin of Species is public domain, as is most work produced by the United States government (for example, by the USGS).

One last thing: an often overlooked aspect of licensing is protection for you, the licensor. All common licences include language that indemnifies you from misuse or misinterpretation of your work. So be careful about putting your stuff ‘out there’ with anything other than a standard licence: you may be leaving yourself open to liability issues later.

Open licences

Rather than writing a lot of stuff that’s been written by smarter people than me, I thought I’d draw a diagram to try to explain the differences between some common licences (there are certainly a lot more than the ones I mention here).

Just to re-iterate: there are a lot more licences than the ones mentioned here, these are just examples.

What do I recommend?

For content, my personal belief is that CC-BY most aptly captures the way science works. Scientists 'build on the shoulders of giants' by re-using the work of others with fastidious attribution, usually by citation. Accordingly, the CC-BY protects the licensor, ensures attribution, and that's it. If you prefer copyleft licences, the equivalent licence is CC-BY-SA.

But Creative Commons recommend against using CC licences for source code, so what should you do then?

For code, the permissive licence closest to CC-BY is the MIT/BSD/Apache family of licences, of which only the Apache 2.0 licence offers some specific protections with respect to patents (in particular, it protects licensees from ‘upstream’ patent infringements). The equivalent copyleft licences are the GPL (for applications) and LGPL (for libraries).

For data, I tend to use CC-BY, but there are some specialist data licences (beware, they are poorly named in my opinion: the seemingly ‘vanilla’ ODbL is copyleft; the permissive equivalent is ODC-By).

What about mixed content, like a Jupyter Notebook? You have to be practical; maybe it depends on whether you consider your notebooks to be 'content' or 'source code'. I sometimes put at the bottom of a notebook something like Open source content. Text is CC-BY, code is Apache 2.0 and I think this makes my intent clear.

Tools

There are some tools around to help you make a choice of licence:

Licence selector for open source software or data.
Choose a CC Licence for open content (images, text, perhaps data).
Choose a Licence for open source software.
TL;DR Legal has plain English summaries of popular software licences.

Last thing

Note that open licences are just one piece of the jigsaw puzzle of reproducible science and reusable content. You also need to think about open and accessible data formats (e.g. CSV not XLS), accessible content (DOIs and open indexes), and documentation.

Although insufficient, open licences are a necessary component though. And while licences can be changed, they cannot be revoked… so it’s worth putting some thought into your choices before you start pushing your content out into the world.

If it seems hard to navigate, do get in touch, we’d be happy to help if and where we can (notwithstadning IANAL). If your situation is at all complicated I recommend seeking professional legal advice — but do go out of your way to find one who understands both the motivation for, and the legal issues around, open licensing.

January 18, 2021

Openness is a two-way street

January 18, 2021/ Matt Hall

Last week the Data Analysis Study Group of the SPE Gulf Coast Section announced a new machine learning contest (I’m afraid registration is now closed, even though the contest has not started yet). The task is to predict shear-wave sonic from other logs, similar to the SPWLA PDDA contest last year. This is a valuable problem in the subsurface, because shear sonic log is essential for computing elastic properties of rocks and therefore in predicting rock and fluid properties or processing seismic. Indeed, TGS have built a business on predicted logs with their ARLAS product. There’s money in log prediction!

The task looks great, but there’s one big problem: the dataset is not open.

Why is this a problem?

Before answering that, let’s look at some context.

What’s a machine learning contest?

Good question. Typically, an organization releases a dataset (financial timeseries, Netflix viewer data, medical images, or whatever). They invite people to predict some valuable property (when to sell, which show to recommend, how to treat the illness, or whatever). And they pick the best, measured against known labels on a hidden dataset.

Kaggle is one of the largest platforms hosting such challenges, and they often attract thousands of participants — competing for large prizes. TGS ran a seismic salt-picking contest on the platform, attracting almost 74,000 submissions from 3220 teams with a $100k prize purse. Other contests are more grass-roots, like the one I ran with Brendon Hall in 2016 on lithology prediction, and like this SPE contest. It’s being run by a team of enthusiasts without a lot of resources from SPE, and the prize purse is only $1000 — representing about 3 hours of the fully loaded G&A of an oil industry professional.

What has this got to do with reproducibility?

Contests that award a large prize in return for solving a hard problem are essentially just a kind of RFP-combined-with-consulting-job. It’s brutally inefficient: hundreds or even thousands of people spend hours on the problem for free, and a handful are financially rewarded. These contests attract a lot of attention, but I’m not that interested in them.

Community-oriented events like this SPE contest — and the recent FORCE one that Xeek hosted — are more interesting and I believe they are more impactful. They have lots of great outcomes:

Lots of people have fun working on a hard problem and connecting with each other.
Solutions are often shared after, or even during, the contest, so that everyone learns and grows their toolbox.
A new open dataset that might even become a much-needed benchmark for the task in hand.
Researchers can publish what they did, or do later. (The SEG ML contest tutorial and results article have 136 citations between them, largely from people revisiting the dataset to show new solutions.)

A lot of new open-source machine learning code is always exciting, but if the data is not open then the work is by definition not reproducible. It seems especially unfair — cheeky, even — to ask participants to open-source their code, but to keep the data proprietary. For sure TGS is interested in how these free solutions compare to their own product.

Well, life’s not fair. Why is this a problem?

The data is being shared with the contest participants on the condition that they may not share it. In other words it’s proprietary. That means:

Participants are encumbered with the liability of a proprietary dataset. Sure, TGS is sharing this data in good faith today, but who knows how future TGS lawyers will see it after someone accidentally commits it to their GitHub repo? TGS is a billion-dollar company, they will win a legal argument with you. (Having said that, there’s no NDA or anything, just a checkbox in a form. I don’t know how binding it really is… but I don’t want to be the one that finds out.)
Participants can’t publish reproducible papers on their own work. They can publish classic oil-indsutry, non-reproducible work — I did this thing but no-one can check it because I can’t give you the data — but do we really need more of that? (In the contest introductory Zoom, someone asked about publishing plots of the data. The answer: “It should be fine.” Are we really still this naive about data?)

If anyone from TGS is reading this and thinking, “Come on, we’re not going to sue anyone — we’re not GSI! — it’s fine :)” then my response is: Wonderful! In that case, why not just formalize everything by releasing the data under an open licence — preferably Creative Commons Attribution 4.0? (Unmodified! Don’t make the licensing mistakes that Equinor and NAM have made recently.) That way, everyone knows their rights, everyone can safely download the data, and the community can advance. And TGS looks pretty great for contributing an awesome dataset to the subsurface machine learning community.

I hope TGS decides to release the data with an open licence. If they don’t, it feels like a rather one-sided deal to me. And with the arrangement as it stands, there’s no way I would enter this contest.

December 08, 2020

An update on Volve

December 08, 2020/ Matt Hall

Writing about the new almost-open dataset at Groningen yesterday reminded me that things have changed a little on Equinor’s Volve dataset in Norway. Illustrating the principle that there are more ways to get something wrong than to get them right, here’s the situation there.

In 2018, Equinor generously released a very large dataset from the decommissioned field Volve. The data is undoubtedly cool, but initially it was released with no licence. Later in 2018, a licence was added but it was a non-open licence, CC BY-NC-SA. Then, earlier this year, the licence was changed to a modified CC BY licence. Progress, sort of.

I think CC BY is an awesome licence for open data. But modifying licences is always iffy and in this case the modifications mean that the licence can no longer be called ‘open’, because the restrictions they add are not permitted by the Open Definition. For me, the problematic clauses in the modification are:

You can’t sell the dataset. This is almost as ambiguous as the previous “non-commercial” clause. What if it’s a small part of a bigger offering that adds massive value, for example as demo data for a software package? Or as one piece in a large data collection? Or as the basis for a large and expensive analysis? Or if it was used to train a commercial neural network?

The license covers all data in the dataset whether or not it is by law covered by copyright. It's a bit weird that this is tucked away in a footnote, but okay. I don't know how it would work in practice because CC licenses depend on copyright. (The whole point of uncopyrightable content is that you can't own rights in it, nevermind license it.)

It’s easy to say, “It’s fine, that’s not what Equinor meant.” My impression is that the subsurface folks in Equinor have always said, "This is open," and their motivation is pure and good, but then some legal people get involved and so now we have what we have. Equinor is an enormous company with (compared to me) infinite resources and a lot of lawyers. Who knows how their lawyers in a decade will interpret these terms, and my motivations? Can you really guarantee that I won’t be put in an awkward situation, or bankrupted, by a later claim — like some of GSI’s clients were when they decided to get tough on their seismic licenses?

Personally, I’ve decided not to touch Volve until it has a proper open licence that does not carry this risk.

December 07, 2020

A big new almost-open dataset: Groningen

December 07, 2020/ Matt Hall

Open data enthusiasts rejoice! There’s a large new openly licensed subsurface dataset. And it’s almost awesome.

Go to the dataset

The dataset has been released by Dutch oil and gas operator Nederlandse Aardolie Maatschappij (NAM), which is a 50–50 joint venture between Shell and ExxonMobil. They have operated the giant Groningen gas field since 1963, producing from the Permian Rotliegend Group, a 50 to 225 metre-thick sandstone with excellent reservoir properties. The dataset consists of a static geological model and its various components: data from over [edit: 6000 well logs], a prestack-depth migrated seismic volume, plus seismic horizons, and a large number of interpreted faults. It’s 4.4GB in total — not ginormous.

Induced seismicity

There’s a great deal of public interest in the geology of the area: Groningen has been plagued by induced seismicity for over 30 years. The cause has been identified as subsidence resulting from production, and became enough of a concern that the government took steps to limit production in 2014, and has imposed a plan to shut down the field completely by 2030. There are also pressure maintenance measures in place, as well as a lot of monitoring. However, the earthquakes continue, and have been as large as magnitude 3.6 — a big worry for people living in the area. I assume this issue is one of the major reasons for NAM releasing the data.*

In the map of the Top Rotliegendes (right, from Kortekaas & Jaarsma 2017), the elevation varies from –2442 m (red) to –3926 m. Major faults are shown in blue, along with seismic events of local magnitude 1.3 to 3.6. The Groningen field outline is shown in red.

Can you use the data? Er, maybe.

Anyone can access the data. NAM and Utrecht University, who have published the data, have selected a Creative Commons Attribution 4.0 licence, which is (in my opinion) the best licence to use. And unlike certain other data owners (see below!) they have resisted the temptation to modify the licence and confuse everyone. (It seems like they were tempted though, as the metadata contains the plea, “If you intend on using the model, please let us know […]”, but it’s not a requirement.)

However, the dataset does not meet the Open Definition (see section 1.4). As the owners themselves point out, there’s a rather major flaw in their dataset:

This model can only be used in combination with Petrel software • The model has taken years of expert development. Please use only if you are a skilled Petrel user.

I’ll assume this is a statement of fact, as opposed to a formal licence restriction. It’s clear that requiring (de facto or otherwise) the use of proprietary software (let alone software costing more than USD 100,000!) is not ‘open’ at all. No normal person has access to Petrel, and the annoying thing is that there’s absolutely no reason to make the dataset this inconvenient to use. The obvious format for seismic data is SEG-Y (although there is a ZGY reader out now), and there’s LAS 2 or even DLIS for wireline logs. There are no open standard formats for seismic horizons or formation tops, but some sort of text file would be fine. All of these formats have open source file readers, or can be parsed as text. Admittedly the geomodel is a tricky one; I don’t know about any open formats. [UPDATE: see the note below from EPOS-NL.]

Happily, even if the data owners do nothing, I think this problem will be remedied by the community. Some kind soul with access to Petrel will export the data into open formats, and then this dataset really will be a remarkable addition to the open subsurface data family. Stay tuned for more on this.

References

NAM (2020). Petrel geological model of the Groningen gas field, the Netherlands. Open access through EPOS-NL. Yoda data publication platform Utrecht University. DOI 10.24416/UU01-1QH0MW.

M Kortekaas & B Jaarsma (2017). Improved definition of faults in the Groningen field using seismic attributes. Netherlands Journal of Geosciences — Geologie en Mijnbouw 96 (5), p 71–85, 2017 DOI 10.1017/njg.2017.24.

UPDATE on 7 December 2020

* According to Henk Kombrink’s sources, the dataset release is “an initiative from NAM itself, driven primarily by a need from the research community for a model of the field.” Check out Henk’s article about the dataset:

Kombrink, H (2020). Static model giant Groningen field publicly available. Article in Expro News. https://expronews.com/technology/static-model-giant-groningen-field-publicly-available/

UPDATE 10 December 2020

I got the following information from EPOS-NL:

“EPOS-NL and NAM are happy to see the enthusiasm for this most recent data publication. Petrel is one of the most commonly used software among geologists in both academia and industry, and so provides a useful platform for many users worldwide. For those without a Petrel license, the data publication includes a RESCUE 3d grid data export of the model. RESCUE data can be read by a number of open source software. This information was not yet very clearly provided in the data description, so thanks for pointing this out. Finally, the well log data and seismic data used in the Petrel model are also openly accessible, without having to use Petrel software, on the NLOG website (https://www.nlog.nl/en/data), i.e. the Dutch oil and gas portal. Hope this helps!”

April 03, 2019

What makes a good benchmark dataset?

April 03, 2019/ Matt Hall

Last week I mentioned that we need more open benchmark datasets in geoscience. I think benchmarks are important for researchers to work on, as a teaching aid, and as a way for us to objectively measure how well we’re doing on a particular problem. How else can we know how we’re doing, or compare Company X’s claim with Company Y’s?

What makes a good benchmark?

I haven’t unearthed any guides from other domains to help answer this question, and we don’t yet have enought experience to know for ourselves. But here’s what I’m thinking:

It must address at least one clear machine learning task. The more obviously useful the task, the more useful (and important) the benchmark. The benchmark dataset should be well suited to the task (but does not have to be comprehensive or definitive).
It must be open. That means explicitly licensed with an open, and preferably permissive, license. I think we need to avoid non-permissive (so-called ‘copyleft’) licenses, because it’s not clear how the ‘sharealike’ clause would affect works that depended on the dataset. And we definitely need to avoid restrictive non-commercial clauses.
It must be discoverable and accessible. In other words, it needs to be easy to find, and anyone should be able to get it, without registering on a website or waiting for an email or doing anything else that slows down the pace of their research. A properly open dataset can be replicated anywhere, so openness should take care of this.
It must have enough features to be interesting. This might mean different things for different tasks, but in general we’d like to see a few physical measurements (e.g. seismic, well logs, RockEval, core photos, field observations, flow rates, and so on). The features should be independent — we can always generate derivatives.
It must have labels. As well as some interesting features, the dataset must have some interpretive information with high information value (e.g. seismic facies, lithologies, deposotional environment, sequence boundaries, EURs, and so on). Usually, these are expensive to acquire (which is partly why we’d like to be able to predct them).
It should name suitable prediction error evaluation methods, with reference implementations, for the intended task. If people are to use it as a score benchmark, they need to know how to score their own implementations of the task.
It can be de-localized, but not completely. We don’t need to know the exact whereabouts of the dataset, but if we remove the relative spatial relationships between wells, say, or don’t know which basin we’re in, then the questions we can ask about the data get a lot less interesting, and the whole situation gets much less realistic.
It should not be too big. More than about 1GB means unwieldy. It means difficult to download. It means too much room for nuance. And it means it’s probably impossible to explore in the space of a tutorial. It’s also much harder to get a big dataset into shape than a smaller one. A few thousand records, maybe 100,000 in some cases, is probably plenty.
It should be clean, but not too clean. No-one wants to spend hours processing a dataset before it can be used, or — worse — be bitten by some esoteric data problem only a domain expert would spot. But, on the other hand, a dataset with no issues at all might be a bit boring. And, in subsurface at least, completely unrepresentative!
It should be well documented. The dataset needs to be described to non-technical people, who know little or nothing about the subsurface. Remember that many users will not be proficient programmers either, so…
It should have an accompanying demonstration. For example, a script or notebook, preferably in at least a couple of languages, that shows how to load and inspect the data. Ideally this would include a demonstration of how to pose, and answer, a straightforward question as a machine learning task.

I’m not sure we can call this last one a criterion, but maybe in an ideal world…

It should be launched with a data science contest. If you’re felling really brave, what better way to attract attention to the new open dataset than with a Kaggle-style contest?

It’s certainly true that there are several datasets around. Unfortunately, the openness criterion eliminates most of them, so they fall at the first hurdle. For example, the very nice dataset that Brendon Hall used in the SEG machine learning contest is not open.

If you know of a dataset that could be coerced into meeting most of these criteria, we’d like to hear about it. I know a small army of people that would love to help get it into the open, and into the hands of machine learning researchers all over the world.

The thumbnail image for this post was adapted from an image by user arg_flickr on Flickr, licensed CC-BY.

Thanks to several people on Software Underground, for the discussion on this topic. In particular, Justin Gosses and Lukas Mosser pointed out the need for transparent error evaluation.

November 05, 2018

The next thing

November 05, 2018/ Matt Hall

Over the last several years, Agile has been testing some of the new ways of collaborating, centered on digital connections:

It all started with this blog, which started in 2010 with my move from Calgary to Nova Scotia. It’s become a central part of my professional life, but we’re all about collaboration and blogs are almost entirely one-way, so…
In 2011 we launched SubSurfWiki. It didn’t really catch on, although it was a good basis for some other experiments and I still use it sometimes. Still, we realized we had to do more to connect the community, so…
In 2012 we launched our 52 Things collaborative, open access book series. There are well over 5000 of these out in the wild now, but it made us crave a real-life, face-to-face collaboration, so…
In 2013 we held the first ‘unsession’, a mini-unconference, at the Canada GeoConvention. Over 50 people came to chat about unsolved problems. We realized we needed a way to actually work on problems, so…
Later that year, we followed up with the first geoscience hackathon. Around 15 or so of us gathered in Houston for a weekend of coding and tacos. We realized that the community needed more coding skills, so…
In 2014 we started teaching a one-day Python course aimed squarely at geoscientists. We only teach with subsurface data and algorithms, and the course is now 5 days long. We now needed a way to connect all these new hackers and coders, so…
In 2014, together with Duncan Child, we also launched Software Underground, a chat room for discussing topics related to the earth and computers. Initially it was a Google Group but in 2015 we relaunched it as an open Slack team. We wanted to double down on scientific computing, so…
In 2015 and 2016 we launched a new web app, Pick This (returning soon!), and grew our bruges and welly open source Python projects. We also started building more machine learning projects, and getting really good at it.

Growing and honing

We have spent the recent years growing and honing these projects. The blog gets about 10,000 readers a month. The sixth 52 Things book is on its way. We held two public unsessions this year. The hackathons have now grown to 60 or so hackers, and have had about 400 participants in total, and five of them this year already (plus three to come!). We have also taught Python to 400 geoscientists, including 250 this year alone. And the Software Underground has over 1000 members.

In short, geoscience has gone digital, and we at Agile are grateful and excited to be part of it. At no point in my career have I been more optimistic and energized than I am right now.

So it’s time for the next thing.

The next thing is starting with a new kind of event. The first one is 5 to 11 May 2019, and it’s happening in France. I’ll tell you all about it tomorrow.

October 16, 2018

Volve: not open after all

October 16, 2018/ Matt Hall

Back in June, Equinor made the bold and exciting decision to release all its data from the decommissioned Volve oil field in the North Sea. Although the intent of the release seemed clear, the dataset did not carry a license of any kind. Since you cannot use unlicensed content without permission, this was a problem. I wrote about this at the time.

To its credit, Equinor listened to the concerns from me and others, and considered its options. Sensibly, it chose an off-the-shelf license. It announced its decision a few days ago, and the dataset now carries a Creative Commons Attribution-NonCommercial-ShareAlike license.

Unfortunately, this license is not ‘open’ by any reasonable definition. The non-commercial stipulation means that a lot of people, perhaps most people, will not be able to legally use the data (which is why non-commercial licenses are not open licenses). And the ShareAlike part means that we’re in for some interesting discussion about what derived products are, because any work based on Volve will have to carry the CC BY-NC-SA license too.

Non-commercial licenses are not open

Here are some of the problems with the non-commercial clause:

NC licenses are not 'open'. They do not meet the Open Definition, because open data must be available for anyone to use, for any purpose. For example, here’s a quote from Hagedorn et al (2011):

NC licenses come at a high societal cost: they provide a broad protection for the copyright owner, but strongly limit the potential for re-use, collaboration and sharing in ways unexpected by many users

NC licenses are incompatible with CC-BY-SA. This means that the data cannot be used on Wikipedia, SEG Wiki, or AAPG Wiki, or in any openly licensed work carrying that license.
NC-licensed data cannot be used commercially. This is obvious, but far-reaching. It means, for example, that nobody can use the data in a course or event for which they charge a fee. It means nobody can use the data as a demo or training data in commercial software. It means nobody can use the data in a book that they sell.
The boundaries of the license are unclear. It's arguable whether any business can use the data for any purpose at all, because many of the boundaries of the scope have not been tested legally. What about a course run by AAPG or SEG? What about a private university? What about a government, if it stands to realize monetary gain from, say, a land sale? All of these uses would be illiegal, because it’s the use that matters, not the commercial status of the user.

Now, it seems likely, given the language around the release, that Equinor will not sue people for most of these use cases. They may even say this. Goodness knows, we have enough nudge-nudge-wink-wink agreements like that already in the world of subsurface data. But these arrangements just shift the onus onto the end use and, as we’ve seen with GSI, things can change and one day you wake up with lawsuits.

ShareAlike means you must share too

Creative Commons licenses are, as the name suggests, intended for works of creativity. Indeed, the whole concept of copyright, depends on creativity: copyright protects works of creative expression. If there’s no creativity, there’s no basis for copyright. So for example, a gamma-ray log is unlikely to be copyrightable, but seismic data is (follow the GSI link above to find out why). Non-copyrightable works are not covered by Creative Commons licenses.

All of which is just to help explain some of the language in the CC BY-NC-SA license agreement, which you should read. But the key part is in paragraph 4(b):

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under the terms of this License

What’s a ‘derivative work’? It’s anything ‘based upon’ the licensed material, which is pretty vague and therefore all-encompassing. In short, if you use or show Volve data in your work, no matter how non-commercial it is, then you must attach a CC BY-NC-SA license to your work. This is why SA licenses are sometimes called ‘viral’.

By the way, the much-loved F3 and Penobscot datasets also carry the ShareAlike clause, so any work (e.g. a scientific paper) that uses them is open-access and carries the CC BY-SA license, whether the author of that work likes it or not. I’m pretty sure no-one in academic publishing knows this.

By the way again, everything in Wikipedia is CC BY-SA too. Maybe go and check your papers and presentations now :)

What should Equinor do?

My impression is that Equinor is trying to satisfy some business partner or legal edge case, but they are forgetting that they have orders of magnitude more capacity to deal with edge cases than the potential users of the dataset do. The principle at work here should be “Don’t solve problems you don’t have”.

Encumbering this amazing dataset with such tight restrictions effectively kills it. It more or less guarantees it cannot have the impact I assume they were looking for. I hope they reconsider their options. The best choice for any open data is CC-BY.

June 17, 2018

Big open data... or is it?

June 17, 2018/ Matt Hall

Huge news for data scientists and educators. Equinor, the company formerly known as Statoil, has taken a bold step into the open data arena. On Thursday last week, it 'disclosed' all of its subsurface and production data for the Volve oil field, located in the North Sea.

What's in the data package?

A lot! The 40,000-file package contains 5TB of data, that's 5,000GB!

This collection is substantially larger, both deeper and broader, than any other open subsurface dataset I know of. Most excitingly, Equinor has released a broad range of data types, from reports to reservoir models: 3D and 4D seismic, well logs and real-time drilling records, and everything in between. The only slight problem is that the seismic data are bundled in very large files at the moment; we've asked for them to be split up.

Questions about usage rights

Regular readers of this blog will know that I like open data. One of the cornerstones of open data is access, and there's no doubt that Equinor have done something incredible here. It would be preferable not to have to register at all, but free access to this dataset — which I'm guessing cost more than USD500 million to acquire — is an absolutely amazing gift to the subsurface community.

Another cornerstone is the right to use the data for any purpose. This involves the owner granting certain privileges, such as the right to redistribute the data (say, for a class exercise) or to share derived products (say, in a paper). I'm almost certain that Equinor intends the data to be used this way, but I can't find anything actually granting those rights. Unfortunately, if they aren't explicitly granted, the only safe assumption is that you cannot share or adapt the data.

For reference, here's the language in the CC-BY 4.0 licence:

Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:

reproduce and Share the Licensed Material, in whole or in part; and
produce, reproduce, and Share Adapted Material.

You can dig further into the requirements for open data in the Open Data Handbook.

The last thing we need is yet another industry dataset with unclear terms, so I hope Equinor attaches a clear licence to this dataset soon. Or, better still, just uses a well-known licence such as CC-BY (this is what I'd recommend). This will clear up the matter and we can get on with making the most of this amazing resource.

More about Volve

The Volve field was discovered in 1993, but not developed until 15 years later. It produced oil and gas for 8.5 years, starting on 12 February 2008 and ending on 17 September 2016, though about half of that came in the first 2 years (see below). The facility was the Maersk Inspirer jack-up rig, standing in 80 m of water, with an oil storage vessel in attendance. Gas was piped to Sleipner A. In all, the field produced 10 million Sm³ (63 million barrels) of oil, so is small by most standards, with a peak rate of 56,000 barrels per day.

Volve production over time in standard m³ (i.e. at 20°C). Multiply by 6.29 for barrels.

The production was from the Jurassic Hugin Formation, a shallow-marine sandstone with good reservoir properties, at a depth of about 3000 m. The top reservoir depth map from the discovery report in the data package is shown here. (I joined Statoil in 1997, not long after this report was written, and the sight of this page brings back a lot of memories.)

The top reservoir depth map from the discovery report. The Volve field (my label) is the small closure directly north of Sleipner East, with 15/9-19 well on it.

Get the data

To explore the dataset, you must register in the 'data village', which Equinor has committed to maintaining for 2 years. It only takes a moment. You can get to it through this link.

Let us know in the comments what you think of this move, and do share what you get up to with the data!

Blog