Equinor should change its open data licence

This is an open letter to Equinor to appeal for a change to the licence used on Volve, Northern Lights, and other datasets. If you wish to co-sign, please add a supportive comment below. (Or if you disagree, please speak up too!)


Open data has had huge impact on science and society. Whether the driving purpose is innovation, transparency, engagement, or something else, open data can make a difference. Underpinning the dataset itself is its licence, which grants permission to others to re-use and distribute open data. Open data licences are licences that meet the Open Definition.

In 2018, Equinor generously released a very large dataset from the decommissioned field Volve. Initially it was released with no licence. Later in 2018, a licence was added but it was a non-open licence, CC BY-NC-SA (open licences cannot be limited to non-commercial use, which is what the NC stands for). Then, in 2020, the licence was changed to a modified CC BY licence, which you can read here.

As far as I know, Volve and other projects still carry this licence. I’ll refer to this licence as “the Equinor licence”. I assume it applies to the collection of data, and to the contents of the collection (where applicable).

There are 3 problems with the licence as it stands:

  1. The licence is not open.

  2. Modified CC licences have issues.

  3. The licence is not clear and exposes licencees to risk of infringement.

Let's look at these in turn.

The licence is not open

The Equinor licence is not an open licence. It does not meet the Open Definition, section 2.1.2 of which states:

The license must allow redistribution of the licensed work, including sale, whether on its own or as part of a collection made from works from different sources.

The licence does not allow sale and therefore does not meet this criterion. Non-open licences are not compatible with open licences, therefore these datasets cannot be remixed and re-used with open content. This greatly limits the usefulness of the dataset.

Modified CC licences have issues

The Equinor licence states:

This license is based on CC BY 4.0 license 

I interpret this to mean that it is intended to act as a modified CC BY licence. There are two issues with this:

  1. The copyright lawyers at Creative Commons strongly advises against modifying (in particular, adding restrictions to) their licences.

  2. If you do modify one, you may not refer to it as a CC BY licence or use Creative Commons trademarks; doing so violates their trademarks.

Both of these issues are outlined in the Creative Commons Wiki. According to that document, these issues arise because modified licences confuse the public. In my opinion (and I am not a lawyer, etc), the Equinor licence is confusing, and it appears to violate the Creative Commons organization's trademark policy.

Note that 'modify' really means 'add restrictions to' here. It is easier to legally and clearly remove restrictions from CC licences, using the CCPlus licence extension pattern

The licence is not clear

The Equinor licence contains five restrictions:

  1. You may not sell the Licensed Material.

  2. You must give Equinor and the Volve license partners credit, and provide a link to these terms and conditions, as well as a copyright notice if applicable.

  3. You may not share Adapted Material under a license that prevents recipients from complying with these terms and conditions.

  4. You shall not use the Licensed Material in a manner that appears misleading nor present the Licensed Material in a distorted or incorrect manner. 

  5. The license covers all data in the dataset whether or not it is by law covered by copyright.

Looking at the points in turn:

Point 1 is, I believe, the main issue for Equinor. For some reason, this is paramount for them.

Point 2 seems like a restatement of the BY restriction that is the main feature of the CC-BY licence and is extensively described in Section 3.a of that licence

Point 3 is already covered by CC BY in Section 3.a.4.

Point 4 is ambiguous and confusing. Who is the arbiter of this potentially subjective criterion? How will it be applied? Will Equinor examine every use of the data? The scenario this point is trying to prevent seems already to be covered by standard professional ethics and 'errors and omissions'. It's a bit like saying you can't use the data to commit a crime — it doesn't need saying because commiting crimes is already illegal. 

Point 5 is strange. I don’t know why Equinor wants to licence material that no-one owns, but licences are legal contracts, and you can bind people into anything you can agree on. One note here — the rights in the database (so-called 'database rights') are separate from the rights in the contents: it is possible in many jurisdictions to claim sui generis rights in a collection of non-copyrightable elements; maybe this is what was intended? Importantly, Sui generis database rights are explicitly covered by CC BY 4.0.

Finally, I recently received an email communication from Equinor that stated the following:

[...] nothing in our present licencing inhibits the fair and widespread use of our data for educational, scientific, research and commercial purposes. You are free to download the Licensed Material for non-commercial and commercial purposes. Our only requirement is that you must add value to the data if you intend to sell them on.

The last sentence (“Our only requirement…”) states that there is only one added restriction. But, as I just pointed out, this is not what the licence document states. The Equinor licence states that one may not sell the licensed material, period. The email states that I can sell it if I add value. Then the questions are, "What does 'add value' mean?", and "Who decides?". (It seems self-evident to me that it would be very hard to sell open material if one wasn't adding value!)

My recommendations

In its current state, I would not recommend anyone to use the Volve or Northern Lights data for any purpose. I know this sounds extreme, but it’s important to appreciate the huge imbalance in the relationship between Equinor and its licensees. If Equinor's future counsel — maybe in a decade — decides that lots of people have violated this licence, what happens next could be quite unjust. Equinor can easily put a small company out of business with a lawsuit. I know that might seem unlikely today, but I urge you to read about GSI's extensive lawsuits in Canada — this is a real situation that cost many companies a lot of money. You can read about it in my blog post, Copyright and seismic data.

When it comes to licences, and legal contracts in general, I believe that less is more. Taking a standard licence and adding words to solve problems you don’t have but can imagine having — and lawyers have very good imaginations — just creates confusion.

I therefore recommend the following:

  • Adopt an unmodifed CC BY 4.0 licence for the collection as a whole.

  • Adopt an unmodifed CC BY 4.0 licence for the contents of the collection, where copyrightable.

  • Include copyright notices that clearly state the copyright owners, in all relevant places in the collection (e.g. data folders, file headers) and at least at the top level. This way, it's clear how attribution should be done.

  • Quell the fear of people selling the dataset by removing as many possible barriers to using the free version as possible, and generally continuing to be a conspicuous champion for open data.

If Equinor opts to keep a version of the current licence, I recommend at least removing any mention of CC BY, it only adds to the confusion. The Equinor licence is not a CC BY licence, and mentioning Creative Commons violates their policy. We also suggest simplifying the licence if possible, and clarifying any restrictions that remain. Use plain language, give examples, and provide a set of Frequently Asked Questions.

The best path forward for fostering a community around these wonderful datasets that Equinor has generously shared with the community, is to adopt a standard open licence as soon as possible.

Which open licence should I choose?

I’ve written about open data a few times recently. And not-so-recently. And there’s been quite a bit of chat about open subsurface benchmarks in the Software Underground recently. As more people consider openly releasing data — or code, or other content — one question comes up fairly often is: Which licence should I choose?

I’ll start at the beginning, and I am not a lawyer, but this is going to be very high level. So do click on the links to read more.

What is copyright?

You automatically own the copyright to anything original that you create. You don’t have to register it, but the thing you made — and it must be a thing, you can’t copyright ideas — must be original. It could be a photo, a song, or a seismic interpretation. Physical measurements with no creative input, such as well logs, are not copyrightable… but a database consisting of such data is (so-called database rights). Your rights are exclusive, worldwide, and last until some years after you die (it varies).

If someone wants to use your work, even if they just found it on the Internet, they must either claim Fair Use, or seek permission from you. Giving permission means granting a licence; it can be as restrictive and arcane as you want.

If you don’t want people bothering you about licences, or if you want to actively encourage people to use and adapt your work, you can preemptively grant an open licence.

What is openness?

Before you start thinking about licences, there are two more big things to learn about:

  1. What is open? Not all licences, not even all Creative Commons licences, meet the Open Definition. In brief, this states that “Open data and content can be freely used, modified, and shared by anyone for any purpose” — you can’t restrict people based on their use case or location. So licences that forbid commercial application are not open.

  2. What is permissiveness? Once you’ve decided to go open, you need to decide where you stand on permissiveness. Some licences, notably those advocated by the GNU Free Software Movement, compel licensees (users) to preserve the openness of the work in any future redistribution. This ‘viral’ condition is sometimes called copyleft.

In some circles, a near-religious war smoulders on the permissiveness issue. You need to make up your own mind where you stand, or at least understand the issues.

By the way, granting a licence does not mean giving up your rights. In fact, you must own the copyright in order to grant the licence. Many scientists don’t realize we’ve been giving away the copyright in our work for decades, as a (completely unnecessary and made up) condition of publication.

Another source of confusion: open licences are also not the same thing as public domain. Public domain means that the work is free from copyright restrictions. In general though, it cannot be applied to a copyrighted work (though CC0 tries to relinquish copyright where possible). For example, On The Origin of Species is public domain, as is most work produced by the United States government (for example, by the USGS).

One last thing: an often overlooked aspect of licensing is protection for you, the licensor. All common licences include language that indemnifies you from misuse or misinterpretation of your work. So be careful about putting your stuff ‘out there’ with anything other than a standard licence: you may be leaving yourself open to liability issues later.

Open licences

Rather than writing a lot of stuff that’s been written by smarter people than me, I thought I’d draw a diagram to try to explain the differences between some common licences (there are certainly a lot more than the ones I mention here).

Just to re-iterate: there are a lot more licences than the ones mentioned here, these are just examples.

What do I recommend?

For content, my personal belief is that CC-BY most aptly captures the way science works. Scientists 'build on the shoulders of giants' by re-using the work of others with fastidious attribution, usually by citation. Accordingly, the CC-BY protects the licensor, ensures attribution, and that's it. If you prefer copyleft licences, the equivalent licence is CC-BY-SA.

But Creative Commons recommend against using CC licences for source code, so what should you do then?

For code, the permissive licence closest to CC-BY is the MIT/BSD/Apache family of licences, of which only the Apache 2.0 licence offers some specific protections with respect to patents (in particular, it protects licensees from ‘upstream’ patent infringements). The equivalent copyleft licences are the GPL (for applications) and LGPL (for libraries).

For data, I tend to use CC-BY, but there are some specialist data licences (beware, they are poorly named in my opinion: the seemingly ‘vanilla’ ODbL is copyleft; the permissive equivalent is ODC-By).

What about mixed content, like a Jupyter Notebook? You have to be practical; maybe it depends on whether you consider your notebooks to be 'content' or 'source code'. I sometimes put at the bottom of a notebook something like Open source content. Text is CC-BY, code is Apache 2.0 and I think this makes my intent clear.

Tools

There are some tools around to help you make a choice of licence:

Last thing

Note that open licences are just one piece of the jigsaw puzzle of reproducible science and reusable content. You also need to think about open and accessible data formats (e.g. CSV not XLS), accessible content (DOIs and open indexes), and documentation.

Although insufficient, open licences are a necessary component though. And while licences can be changed, they cannot be revoked… so it’s worth putting some thought into your choices before you start pushing your content out into the world.

If it seems hard to navigate, do get in touch, we’d be happy to help if and where we can (notwithstadning IANAL). If your situation is at all complicated I recommend seeking professional legal advice — but do go out of your way to find one who understands both the motivation for, and the legal issues around, open licensing.

An update on Volve

Writing about the new almost-open dataset at Groningen yesterday reminded me that things have changed a little on Equinor’s Volve dataset in Norway. Illustrating the principle that there are more ways to get something wrong than to get them right, here’s the situation there.


In 2018, Equinor generously released a very large dataset from the decommissioned field Volve. The data is undoubtedly cool, but initially it was released with no licence. Later in 2018, a licence was added but it was a non-open licence, CC BY-NC-SA. Then, earlier this year, the licence was changed to a modified CC BY licence. Progress, sort of.

I think CC BY is an awesome licence for open data. But modifying licences is always iffy and in this case the modifications mean that the licence can no longer be called ‘open’, because the restrictions they add are not permitted by the Open Definition. For me, the problematic clauses in the modification are:

  • You can’t sell the dataset. This is almost as ambiguous as the previous “non-commercial” clause. What if it’s a small part of a bigger offering that adds massive value, for example as demo data for a software package? Or as one piece in a large data collection? Or as the basis for a large and expensive analysis? Or if it was used to train a commercial neural network?

  • The license covers all data in the dataset whether or not it is by law covered by copyright. It's a bit weird that this is tucked away in a footnote, but okay. I don't know how it would work in practice because CC licenses depend on copyright. (The whole point of uncopyrightable content is that you can't own rights in it, nevermind license it.)

It’s easy to say, “It’s fine, that’s not what Equinor meant.” My impression is that the subsurface folks in Equinor have always said, "This is open," and their motivation is pure and good, but then some legal people get involved and so now we have what we have. Equinor is an enormous company with (compared to me) infinite resources and a lot of lawyers. Who knows how their lawyers in a decade will interpret these terms, and my motivations? Can you really guarantee that I won’t be put in an awkward situation, or bankrupted, by a later claim — like some of GSI’s clients were when they decided to get tough on their seismic licenses?

Personally, I’ve decided not to touch Volve until it has a proper open licence that does not carry this risk.

A big new almost-open dataset: Groningen

Open data enthusiasts rejoice! There’s a large new openly licensed subsurface dataset. And it’s almost awesome.

The dataset has been released by Dutch oil and gas operator Nederlandse Aardolie Maatschappij (NAM), which is a 50–50 joint venture between Shell and ExxonMobil. They have operated the giant Groningen gas field since 1963, producing from the Permian Rotliegend Group, a 50 to 225 metre-thick sandstone with excellent reservoir properties. The dataset consists of a static geological model and its various components: data from over [edit: 6000 well logs], a prestack-depth migrated seismic volume, plus seismic horizons, and a large number of interpreted faults. It’s 4.4GB in total — not ginormous.

Induced seismicity

There’s a great deal of public interest in the geology of the area: Groningen has been plagued by induced seismicity for over 30 years. The cause has been identified as subsidence resulting from production, and became enough of a concern that the government took steps to limit production in 2014, and has imposed a plan to shut down the field completely by 2030. There are also pressure maintenance measures in place, as well as a lot of monitoring. However, the earthquakes continue, and have been as large as magnitude 3.6 — a big worry for people living in the area. I assume this issue is one of the major reasons for NAM releasing the data.*

In the map of the Top Rotliegendes (right, from Kortekaas & Jaarsma 2017), the elevation varies from –2442 m (red) to –3926 m. Major faults are shown in blue, along with seismic events of local magnitude 1.3 to 3.6. The Groningen field outline is shown in red.

Rotliegendes_at_Groningen.png

Can you use the data? Er, maybe.

Anyone can access the data. NAM and Utrecht University, who have published the data, have selected a Creative Commons Attribution 4.0 licence, which is (in my opinion) the best licence to use. And unlike certain other data owners (see below!) they have resisted the temptation to modify the licence and confuse everyone. (It seems like they were tempted though, as the metadata contains the plea, “If you intend on using the model, please let us know […]”, but it’s not a requirement.)

However, the dataset does not meet the Open Definition (see section 1.4). As the owners themselves point out, there’s a rather major flaw in their dataset:

 

This model can only be used in combination with Petrel software • The model has taken years of expert development. Please use only if you are a skilled Petrel user.

 

I’ll assume this is a statement of fact, as opposed to a formal licence restriction. It’s clear that requiring (de facto or otherwise) the use of proprietary software (let alone software costing more than USD 100,000!) is not ‘open’ at all. No normal person has access to Petrel, and the annoying thing is that there’s absolutely no reason to make the dataset this inconvenient to use. The obvious format for seismic data is SEG-Y (although there is a ZGY reader out now), and there’s LAS 2 or even DLIS for wireline logs. There are no open standard formats for seismic horizons or formation tops, but some sort of text file would be fine. All of these formats have open source file readers, or can be parsed as text. Admittedly the geomodel is a tricky one; I don’t know about any open formats. [UPDATE: see the note below from EPOS-NL.]

petrel_logo.png

Happily, even if the data owners do nothing, I think this problem will be remedied by the community. Some kind soul with access to Petrel will export the data into open formats, and then this dataset really will be a remarkable addition to the open subsurface data family. Stay tuned for more on this.


References

NAM (2020). Petrel geological model of the Groningen gas field, the Netherlands. Open access through EPOS-NL. Yoda data publication platform Utrecht University. DOI 10.24416/UU01-1QH0MW.

M Kortekaas & B Jaarsma (2017). Improved definition of faults in the Groningen field using seismic attributes. Netherlands Journal of Geosciences — Geologie en Mijnbouw 96 (5), p 71–85, 2017 DOI 10.1017/njg.2017.24.


UPDATE on 7 December 2020

* According to Henk Kombrink’s sources, the dataset release is “an initiative from NAM itself, driven primarily by a need from the research community for a model of the field.” Check out Henk’s article about the dataset:

Kombrink, H (2020). Static model giant Groningen field publicly available. Article in Expro News. https://expronews.com/technology/static-model-giant-groningen-field-publicly-available/


UPDATE 10 December 2020

I got the following information from EPOS-NL:

EPOS-NL and NAM are happy to see the enthusiasm for this most recent data publication. Petrel is one of the most commonly used software among geologists in both academia and industry, and so provides a useful platform for many users worldwide. For those without a Petrel license, the data publication includes a RESCUE 3d grid data export of the model. RESCUE data can be read by a number of open source software. This information was not yet very clearly provided in the data description, so thanks for pointing this out. Finally, the well log data and seismic data used in the Petrel model are also openly accessible, without having to use Petrel software, on the NLOG website (https://www.nlog.nl/en/data), i.e. the Dutch oil and gas portal. Hope this helps!

Volve: not open after all

Back in June, Equinor made the bold and exciting decision to release all its data from the decommissioned Volve oil field in the North Sea. Although the intent of the release seemed clear, the dataset did not carry a license of any kind. Since you cannot use unlicensed content without permission, this was a problem. I wrote about this at the time.

To its credit, Equinor listened to the concerns from me and others, and considered its options. Sensibly, it chose an off-the-shelf license. It announced its decision a few days ago, and the dataset now carries a Creative Commons Attribution-NonCommercial-ShareAlike license.

Unfortunately, this license is not ‘open’ by any reasonable definition. The non-commercial stipulation means that a lot of people, perhaps most people, will not be able to legally use the data (which is why non-commercial licenses are not open licenses). And the ShareAlike part means that we’re in for some interesting discussion about what derived products are, because any work based on Volve will have to carry the CC BY-NC-SA license too.

Non-commercial licenses are not open

Here are some of the problems with the non-commercial clause:

NC licenses come at a high societal cost: they provide a broad protection for the copyright owner, but strongly limit the potential for re-use, collaboration and sharing in ways unexpected by many users

  • NC licenses are incompatible with CC-BY-SA. This means that the data cannot be used on Wikipedia, SEG Wiki, or AAPG Wiki, or in any openly licensed work carrying that license.

  • NC-licensed data cannot be used commercially. This is obvious, but far-reaching. It means, for example, that nobody can use the data in a course or event for which they charge a fee. It means nobody can use the data as a demo or training data in commercial software. It means nobody can use the data in a book that they sell.

  • The boundaries of the license are unclear. It's arguable whether any business can use the data for any purpose at all, because many of the boundaries of the scope have not been tested legally. What about a course run by AAPG or SEG? What about a private university? What about a government, if it stands to realize monetary gain from, say, a land sale? All of these uses would be illiegal, because it’s the use that matters, not the commercial status of the user.

Now, it seems likely, given the language around the release, that Equinor will not sue people for most of these use cases. They may even say this. Goodness knows, we have enough nudge-nudge-wink-wink agreements like that already in the world of subsurface data. But these arrangements just shift the onus onto the end use and, as we’ve seen with GSI, things can change and one day you wake up with lawsuits.

ShareAlike means you must share too

Creative Commons licenses are, as the name suggests, intended for works of creativity. Indeed, the whole concept of copyright, depends on creativity: copyright protects works of creative expression. If there’s no creativity, there’s no basis for copyright. So for example, a gamma-ray log is unlikely to be copyrightable, but seismic data is (follow the GSI link above to find out why). Non-copyrightable works are not covered by Creative Commons licenses.

All of which is just to help explain some of the language in the CC BY-NC-SA license agreement, which you should read. But the key part is in paragraph 4(b):

You may distribute, publicly display, publicly perform, or publicly digitally perform a Derivative Work only under the terms of this License

What’s a ‘derivative work’? It’s anything ‘based upon’ the licensed material, which is pretty vague and therefore all-encompassing. In short, if you use or show Volve data in your work, no matter how non-commercial it is, then you must attach a CC BY-NC-SA license to your work. This is why SA licenses are sometimes called ‘viral’.

By the way, the much-loved F3 and Penobscot datasets also carry the ShareAlike clause, so any work (e.g. a scientific paper) that uses them is open-access and carries the CC BY-SA license, whether the author of that work likes it or not. I’m pretty sure no-one in academic publishing knows this.

By the way again, everything in Wikipedia is CC BY-SA too. Maybe go and check your papers and presentations now :)

problems-dont-have.png

What should Equinor do?

My impression is that Equinor is trying to satisfy some business partner or legal edge case, but they are forgetting that they have orders of magnitude more capacity to deal with edge cases than the potential users of the dataset do. The principle at work here should be “Don’t solve problems you don’t have”.

Encumbering this amazing dataset with such tight restrictions effectively kills it. It more or less guarantees it cannot have the impact I assume they were looking for. I hope they reconsider their options. The best choice for any open data is CC-BY.

Why I don't flout copyright

Lots of people download movies illegally. Or spoof their IP addresses to get access to sports fixtures. Or use random images they found on the web in publications and presentations (I've even seen these with the watermark of the copyright owner on them!). Or download PDFs for people who aren't entitled to access (#icanhazpdf). Or use sketchy Russian paywall-crumbling hacks. It's kind of how the world works these days. And I realize that some of these things don't even sound illegal.

This might surprise some people, because I go on so much about sharing content, open geoscience, and so on. But I am an annoying stickler for copyright rules. I want people to be able to re-use any content they like, without breaking the law. And if people don't want to share their stuff, then I don't want to share it.

Maybe I'm just getting old and cranky, but FWIW here are my reasons:

  1. I'm a content producer. I would like to set some boundaries to how my stuff is shared. In my case, the boundaries amount to nothing more than attribution, which is only fair. But still, it's my call, and I think that's reasonable, at least until the material is, say, 5 years old. But some people don't understand that open is good, that shareable content is better than closed content, that this is the way the world wants it. And that leads to my second reason:
  2. I don't want to share closed stuff as if it was open. If someone doesn't openly license their stuff, they don't deserve the signal boost — they told the world to keep their stuff secret. Why would I give them the social and ethical benefits of open access while they enjoy the financial benefits of closed content? This monetary benefit comes from a different segment of the audience, obviously. At least half the people who download a movie illegally would not, I submit, have bought the movie at a fair price.

So make a stand for open content! Don't share stuff that the creator didn't give you permission to share. They don't deserve your gain filter.