A forensic audit of seismic data

The SEG-Y “standard” is famously non-standard. (Those air quotes are actually part of the “standard”.)

For example, the inline and crossline location of a given trace — two things that you must have in order to load the data vaguely properly — are “recommended” (remember, it’s a “standard”) to be given in the trace’s header, at byte locations 189 and 193 respectively. Indeed, they might well be there. Or 1 and 5 (well, 5 or 9). Or somewhere else. Or not there at all.

Don Robinson at Resolve told me recently that he has seen more than 180 byte-location combinations, and he said another service company had seen more than 300.

All this can make loading seismic data really, really annoying.

I’d like to propose that the community performs a kind of forensic audit of SEG-Y files. I have 5 main questions:

  1. What proportion of files claim to be Rev 0, Rev 1, and Rev 2? And what standard are they actually? (If any!)

  2. What proportion of files in the wild use IBM vs IEEE floats? What about integers?

  3. What proportion of files in the wild use little-endian vs big-endian byte order. (Please tell me there's no middle-endian data out there!)

  4. What proportion of files in the wild use EBCDIC vs ASCII encoded textual file headers? (Again, I would hope there are no other encodings in use, but I bet there are.)

  5. What proportion of files use the Strongly recommended and Recommended byte locations for trace numbers, sample counts, sample interval, coordinates and inline–crossline numbers?

For each of these <things> it would also be interesting to know:

  • How does <thing> vary with the other things? That is, what's the cross-correlation matrix?

  • How does <thing> vary with the age of the file? Is there a temporal trend?

  • How does <thing> vary with the provenance of the file? What's the geographic trend? (For example, Don told me that the prevalence of PC-based interpretation packages in Canada led to widespread early adoption of IEEE floats and little-endian byte order; indeed, he says that 90% of the SEG-Y he sees in the wild is still IBM ormatted floats!)

While we’re at it, I'd also like in some more esoteric things:

  • How many files have cornerpoints in the text header, and/or trace locations in trace headers?

  • How many files have an unambiguous CRS in the text header?

  • How many files have information about the processing sequence in the text header? (E.g. imaging details, filters, etc.)

  • How many files have incorrect information in the headers (e.g. locations, sample interval, byte format, etc)

  • How many processors bother putting useful things like elevation, filters, sweeps, fold at target, etc, in the trace headers?

I don’t quite know how such a survey would happen. Most of these things are obviously detectable from the files themselves. Perhaps some of the many seismic data management systems already track these things. Or maybe you’re a data manager and you have some anecdotal data you can share.

What do you think? I’d love to hear your thoughts in the comments. Maybe there’s a good hackathon project here!

x lines of Python: load curves from LAS

Welcome to the latest x lines of Python post, in which we have a crack at some fundamental subsurface workflows... in as few lines of code as possible. Ideally, x < 10.

We've met curves once before in the series — in the machine learning edition, in which we cheated by loading the data from a CSV file. Today, we're going to get it from an LAS file — the popular standard for wireline log data.

Just as we previously used the pandas library to load CSVs, we're going to save ourselves a lot of bother by using an existing library — lasio by Kent Inverarity. Indeed, we'll go even further by also using Agile's library welly, which uses lasio behind the scenes.

The actual data loading is only 1 line of Python, so we have plenty of extra lines to try something more ambitious. Here's what I go over in the Jupyter notebook that goes with this post:

  1. Load an LAS file with lasio.
  2. Look at its header.
  3. Look at its curve data.
  4. Inspect the curves as a pandas DataFrame.
  5. Load the LAS file with welly.
  6. Look at welly's Curve objects.
  7. Plot part of a curve.
  8. Smooth a curve.
  9. Export a set of curves as a matrix.
  10. BONUS: fix some broken things in the file header.

Each one of those steps is a single line of Python. Together, I think they cover many of the things we'd like to do with well data once we get our hands on it. Have a play with the notebook and explore what you can do.

Next time we'll take things a step further and dive into some seismic petrophysics.

x lines of Python: read and write CSV

A couple of weeks ago, in Murphy's Law for Excel, I wrote about the dominance of spreadsheets in applied analysis, and how they may be getting out of hand. Then in Organizing spreadsheets I wrote about how — if you are going to store data in spreadsheets — to organize your data so that you do the least amount of damage. The general goal being to make your data machine-readable. Or, to put it another way, to allow you to save your data as comma-separated values or CSV files.

CSV is the de facto standard way to store data in text files. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in your analysis tool of choice. In our case, this is the Python language. So today I present a few different ways to get at data stored in CSV files.

How many ways can I read thee?

In the accompanying Jupyter Notebook, we read a CSV file into Python in six different ways:

  1. Using the pandas data analysis library. It's the easiest way to read CSV and XLS data into your Python environment...
  2. ...and can happily consume a file on the web too. Another nice thing about pandas. It also writes CSV files very easily.
  3. Using the built-in csv package. There are a couple of standard ways to do this — csv.reader...
  4. ...and csv.DictReader. This library is handy for when you don't have (or don't want) pandas.
  5. Using numpy, the numeric library for Python. If you just have a CSV full of numbers and you want an array in the end, you can skip pandas.
  6. OK, it's not really a CSV file, but for the finale we read a spreadsheet directly from Google Sheets.

I usually count my lines diligently in these posts, but not this time. With pandas you're looking at a one-liner to read your data:

df = pd.read_csv("myfile.csv")

and a one-liner to write it out again. With csv.DictReader you're looking at 3 lines to get a list of dicts (but watch out: your numbers will be strings). Reading a Google Doc is a little more involved, not least because you'll need to set up an app and get an API key to handle authentication.

That's all there is to CSV files. Go forth and wield data like a pro! 

Next time in the xlines of Python series we'll look at reading seismic station data from the web, and doing a bit of time-series analysis on it. No more stuff about spreadsheets and CSV files, I promise :)


The thumbnail image is based on the possibly apocryphal banksy image of an armed panda, and one of texturepalace.com's CC-BY textures.

Organizing spreadsheets

A couple of weeks ago I alluded to ill-formed spreadsheets in my post Murphy's Law for Excel. Spreadsheets are clearly indispensable, and are definitely great for storing data and checking CSV files. But some spreadsheets need to die a horrible death. I'm talking about spreadsheets that look like this (click here for the entire sheet):

Bad_spreadsheet_3.png

This spreadsheet has several problems. Among them:

  • The position of a piece of data changes how I interpret it. E.g. a blank row means 'new sheet' or 'new well'.
  • The cells contain a mixture of information (e.g. 'Site' and the actual data) and appear in varying units.
  • Some information is encoded by styles (e.g. using red to denote a mineral species). If you store your sheet as a CSV (which you should), this information will be lost.
  • Columns are hidden, there are footnotes, it's just a bit gross.

Using this spreadsheet to make plots, or reading it with software, with be a horrible experience. I will probably swear at my computer, suffer a repetitive strain injury, and go home early with a headache, cursing the muppet that made the spreadsheet in the first place. (Admittedly, I am the muppet that made this spreadsheet in this case, but I promise I did not invent these pathologies. I have seen them all.)

Let's make the world a better place

Consider making separate sheets for the following:

  • Raw data. This is important. See below.
  • Computed columns. There may be good reasons to keep these with the data.
  • Charts.
  • 'Tabulated' data, like my bad spreadsheet above, with tables meant for summarization or printing.
  • Some metadata, either in the file properties or a separate sheet. Explain the purpose of the dataset, any major sources, important assumptions, and your contact details.
  • A rich description of each column, with its caveats and assumptions.

The all-important data sheet has its own special requirements. Here's my guide for a pain-free experience:

  • No computed fields or plots in the data sheet.
  • No hidden columns.
  • No semantic meaning in formatting (e.g. highlighting cells or bolding values).
  • Headers in the first row, only data in all the other rows.
  • The column headers should contain only a unique name and [units], e.g. Depth [m], Porosity [v/v].
  • Only one type of data per column: text OR numbers, discrete categories OR continuous scalars.
  • No units in numeric data cells, only quantities. Record depth as 500, not 500 m.
  • Avoid keys or abbreviations: use Sandstone, Limestone, Shale, not Ss, Ls, Sh.
  • Zero means zero, empty cell means no data.
  • Only one unit per column. (You only use SI units right?)
  • Attribution! Include a citation or citations for every record.
  • If you have two distinct types or sources of data, e.g. grain size from sieve analysis and grain size from photomicrographs, then use two different columns.
  • Personally, I like the data sheet to be the first sheet in the file, but maybe that's just me.
  • Check that it turns into a valid CSV so you can use this awesome format.

      After all that, here's what we have (click here for the entire sheet):

    The same data as the first image, but improved. The long strings in columns 3 and 4 are troublesome, but we can tolerate them. Click to enlarge.

    Maybe the 'clean' analysis-friendly sheet looks boring to you, but to me it looks awesome. Above all, it's easy to use for SCIENCE! And I won't have to go home with a headache.


    The data in this post came from this Cretaceous shale dataset [XLS file] from the government of Manitoba. Their spreadsheet is pretty good and only breaks a couple of my golden rules. Here's my version with the broken and fixed spreadsheets shown here. Let me know if you spot something else that should be fixed!

    x lines of Python: read and write a shapefile

    Shapefiles are a sort-of-open format for geospatial vector data. They can encode points, lines, and polygons, plus attributes of those objects, optionally bundled into groups. I say 'sort-of-open' because the format is well-known and widely used, but it is maintained and policed, so to speak, by ESRI, the company behind ArcGIS. It's a slightly weird (annoying) format because 'a shapefile' is actually a collection of files, only one of which is the eponymous SHP file. 

    Today we're going to read a SHP file, change its Coordinate Reference System (CRS), add a new attribute, and save a new file in two different formats. All in x lines of Python, where x is a small number. To do all this, we need to add a new toolbox to our xlines virtual environment: geopandas, which is a geospatial flavour of the popular data management tool pandas.

    Here's the full rundown of the workflow, where each item is a line of Python:

    1. Open the shapefile with fiona (i.e. not using geopandas yet).
    2. Inspect its contents.
    3. Open the shapefile again, this time with geopandas.
    4. Inspect the resulting GeoDataFrame in various ways.
    5. Check the CRS of the data.
    6. Change the CRS of the GeoDataFrame.
    7. Compute a new attribute.
    8. Write the new shapefile.
    9. Write the GeoDataFrame as a GeoJSON file too.

    By the way, if you have not come across EPSG codes yet for CRS descriptions, they are the only way to go. This dataset is initially in EPSG 4267 (NAD27 geographic coordinates) but we change it to EPSG 26920 (NAD83 UTM20N projection).

    Several bits of our workflow are optional. The core part of the code, items 3, 6, 7, and 8, are just a few lines of Python:

        import geopandas as gpd
        gdf = gpd.read_file('data_in.shp')
        gdf = gdf.to_crs({'init': 'epsg:26920'})
        gdf['seafl_twt'] = 2 * 1000 * gdf.Water_Dept / 1485
        gdf.to_file('data_out.shp')

    That's it! 

    As in all these posts, you can follow along with the code in the Jupyter Notebook.

    What should national data repositories do?

    Right now there's a conference happening in Stavanger, Norway: National Data Repository 2017. My friend David Holmes of Dell EMC, a long time supporter of Agile's recent hackathons and general geocomputing infrastructure superhero, is there. He's giving a talk, I think, and chairing at least one session. He asked a question today on Software Underground:

    If anyone has any thoughts or ideas as to what the regulators should be doing differently now is a good time to speak up :)

    My response

    For me it's about raising their aspirations. Collectively, they are sitting on one of the most valuable — or invaluable — datasets in the world, comparable to Hubble, or the LHC. Better yet, the data are (in most cases) already open and they actually want to share it. And the community (us) is better tooled than ever, and perhaps also more motivated, to get cracking. So the possibility is there to see a revolution in subsurface science and exploration (in the broadest sense of the word) and my challenge to them is:

    Can they now create the conditions for this revolution in earth science?

    Some things I think they can do right now:

    • Properly fund the development of an open data platform. I'll expand on this topic below.
    • Don't get too twisted off on formats (go primitive), platforms (pick one), licenses (go generic), and other busy work that committees love to fret over. Articulate some principles (e.g. public first, open source, small footprint, no lock-in, componentize, no single provider, let-users-choose, or what have you), and stay agile. 
    • Lobby NOCs and IOCs hard to embrace integrated and high-quality open data as an advantage that society, as well as industry, can share in. It's an important piece in the challenge we face to modernize the industry. Not so that it can survive for survival's sake, but so that it can serve society for as long as it's needed. 
    • Get involved in the community: open up their processes and collaborate a lot more with the technical societies — like show up and talk about their programs. (How did I not hear about the CDA's unstructured data challenge — a subject I'm very much into — till it was over? How many other potential participants just didn't know about it?)

    An open data platform

    The key piece here is the open data platform. Here are the features I'd like to see of such a platform:

    • Optimized for users, not the data provider, hosting provider, or system administrator.
    • Clear rights: well-known, documented, obvious, clearly expressed open licenses for re-use.
    • Meaningful levels of access that are free of charge for most users and most use cases.
    • Access for humans (a nice mappy web interface) with no awkward or slow registration processes.
    • Access for machines (a nice API, perhaps even a couple of libraries expressing it).
    • Tools for query, discovery, and retrieval; ideally with user feedback paths ('more like this, less like that').
    • Ways to report, or even fix, problems in the data. This relieves you of "the data's not ready" procrastination.
    • Good documentation of all of this, ideally in a wiki or something that people can improve.
    • Support for a community of users and developers that want to do things with the data.

    Building this platform is not trivial. There is massive file storage, database back end, web front end, licensing, and so on. Then there's the community of developers and users to engage and support. It will take years, and never be finished. It sounds hard... but people are doing it. Prototypes for seismic data exist already, and there are countless models in other verticals (just check out the Registry of Research Data Repositories, or look at the list on PLOS). 

    The contract to build data infrostructure is often awarded to the likes of Schlumberger, Halliburton or CGG. In theory, these companies have the engineering depth to pull them off (though this too is debatable, especially in today's web-first, native-never world). But they completely lack the culture required: there's no corporate understanding of what 'open' means. So the model is broken in subtle but fatal ways and the whole experiment fails. 

    I'm excited to hear what comes out of this conference. If you're there, please tell!

    Hard things that look easy

    After working on a few data science (aka data analytics aka machine learning) problems with geoscientific data, I think we've figured out the 10-step workflow. I'm happy to share it with you now:

    1. Look at all these cool problems, machine learning can solve all of these! I just need to figure out which model to use, parameterize it, and IT'S GONNA BE AWESOME, WE'LL BE RICH. Let's just have a quick look at the data...
    2. Oh, there's no data.
    3. Three months later: we have data! Oh, the data's a bit messy.
    4. Six months later: wow, cleaning the data is gross and/or impossible. I hate my life.
    5. Finally, nice clean data. Now, which model do I choose? How do I set parameters? At least you expected these problems. These are well-known problems.
    6. Wait, maybe there are physical laws governing this natural system... oh well, the model will learn them.
    7. Hmm, the results are so-so. I guess it's harder to make predictions than I thought it would be.
    8. Six months later: OK, this sort of works. And people think it sounds cool. They just need a quick explanation.
    9. No-one understands what I've done.
    10. Where is everybody?

    I'm being facetious of course, but only a bit. Modeling natural systems is really hard. Much harder for the earth than for, say, the human body, which is extremely well-known and readily available for inspection. Even the weather is comparitively easy.

    Coupled with the extreme difficulty of the problem, we have a challenging data environment. Proprietary, heterogeneous, poor quality, lost, non-digital... There are lots of ways the data goblins can poop on the playground of machine learning.

    If the machine learning lark is so hard, why not just leave it to non-artificial intelligence — humans. We already learned how to interpret data, right? We know the model takes years to train. Of course, but I don't accept that we couldn't use some of the features of intelligently applied big data analytics: objectivity, transparency, repeatability (by me), reproducibility (by others), massive scale, high speed... maybe even error tolerance and improved decisions, but those seem far off right now.

    I also believe that AI models, like any software, can encode the wisdom of professionals — before they retire. This seems urgent, as the long-touted Great Crew Change is finally underway.

    What will we work on?

    There are lots of fascinating and tractable problems for machine learning to attack in geoscience — I hope many of them get attacked at the hackathon in June — and the next 2 to 3 years are going to be very exciting. There will be the usual marketing melée to wade through, but it's up to the community of scientists and data analysts to push their way through that with real results based on open data and, ideally with open code.

    To be sure, this is happening already — we've had over 25 entrants publishing their solutions to the SEG machine learning contest already, and there will be more like this. It's the only way to building transparent problem-solving systems that we can all participate in and, ultimately, trust.

    What machine learning problems are most pressing in geoscience?
    I'm collecting ideas for projects to tackle in the hackathon. Please visit this Tricider question and contribute your comments, opinions, or ideas of your own. Help the community work on the problems you care about.

    Welly to the wescue

    I apologize for the widiculous title.

    Last week I described some headaches I was having with well data, and I introduced welly, an open source Python tool that we've built to help cure the migraine. The first versions of welly were built — along with the first versions of striplog — for the Nova Scotia Department of Energy, to help with their various data wrangling efforts.

    Aside — all software projects funded by government should in principle be open source.

    Today we're using welly to get data out of LAS files and into so-called feature vectors for a machine learning project we're doing for Canstrat (kudos to Canstrat for their support for open source software!). In our case, the features are wireline log measurements. The workflow looks something like this:

    1. Read LAS files into a welly 'project', which contains all the wells. This bit depends on lasio.
    2. Check what curves we have with the project table I showed you on Thursday.
    3. Check curve quality by passing a test suite to the project, and making a quality table (see below).
    4. Fix problems with curves with whatever tricks you like. I'm not sure how to automate this.
    5. Export as the X matrix, all ready for the machine learning task.

    Let's look at these key steps as Python code.

    1. Read LAS files

     
    from welly import Project
    p = Project.from_las('data/*.las')

    2. Check what curves we have

    Now we have a project full of wells and can easily make the table we saw last week. This time we'll use aliases to simplify things a bit — this trick allows us to refer to all GR curves as 'Gamma', so for a given well, welly will take the first curve it finds in the list of alternatives we give it. We'll also pass a list of the curves (called keys here) we are interested in:

    The project table. The name of the curve selected for each alias is selected. The mean and units of each curve are shown as a quick QC. A couple of those RHOB curves definitely look dodgy, and they turned out to be DRHO correction curves.

    The project table. The name of the curve selected for each alias is selected. The mean and units of each curve are shown as a quick QC. A couple of those RHOB curves definitely look dodgy, and they turned out to be DRHO correction curves.

    3. Check curve quality

    Now we have to define a suite of tests. Lists of test to run on each curve are held in a Python data structure called a dictionary. As well as tests for specific curves, there are two special test lists: Each and All, which are run on each curve encountered, and on all curves together, respectively. (The latter is required to, for example, compare the curves to each other to look for duplicates). The welly module quality contains some predefined tests, but you can also define your own test functions — these functions take a curve as input, and return either True (for a test pass) for False.

     
    import welly.quality as qty
    tests = {
        'All': [qty.no_similarities],
        'Each': [qty.no_monotonic],
        'Gamma': [
            qty.all_positive,
            qty.mean_between(10, 100),
        ],
        'Density': [qty.mean_between(1000,3000)],
        'Sonic': [qty.mean_between(180, 400)],
        }
    
    html = p.curve_table_html(keys=keys, alias=alias, tests=tests)
    HTML(html)
    the green dot means that all tests passed for that curve. Orange means some tests failed. If all tests fail, the dot is red. The quality score shows a normalized score for all the tests on that well. In this case, RHOB and DT are failing the 'mean_b…

    the green dot means that all tests passed for that curve. Orange means some tests failed. If all tests fail, the dot is red. The quality score shows a normalized score for all the tests on that well. In this case, RHOB and DT are failing the 'mean_between' test because they have Imperial units.

    4. Fix problems

    Now we can fix any problems. This part is not yet automated, so it's a fairly hands-on process. Here's a very high-level example of how I fix one issue, just as an example:

     
    def fix_negs(c):
        c[c < 0] = np.nan
        return c
    
    # Glossing over some details, we give a mnemonic, a test
    # to apply, and the function to apply if the test fails.
    fix_curve_if_bad('GAM', qty.all_positive, fix_negs)

    What I like about this workflow is that the code itself is the documentation. Everything is fully reproducible: load the data, apply some tests, fix some problems, and export or process the data. There's no need for intermediate files called things like DT_MATT_EDIT or RHOB_DESPIKE_FINAL_DELETEME. The workflow is completely self-contained.

    5. Export

    The data can now be exported as a matrix, specifying a depth step that all data will be interpolated to:

     
    X, _ = p.data_as_matrix(X_keys=keys, step=0.1, alias=alias)

    That's it. We end up with a 2D array of log values that will go straight into, say, scikit-learn*. I've omitted here the process of loading the Canstrat data and exporting that, because it's a bit more involved. I will try to look at that part in a future post. For now, I hope this is useful to someone. If you'd like to collaborate on this project in the future — you know where to find us.

    * For more on scikit-learn, don't miss Brendon Hall's tutorial in October's Leading Edge.


    I'm happy to let you know that agilegeoscience.com and agilelibre.com are now served over HTTPS — so connections are private and secure by default. This is just a matter of principle for the Web, and we go to great pains to ensure our web apps modelr.io and pickthis.io are served over HTTPS. Find out more about SSL from DigiCert, the provider of Squarespace's (and Agile's) certs, which are implemented with the help of the non-profit Let's Encrypt, who we use and support with dollars.

    Well data woes

    I probably shouldn't be telling you this, but we've built a little tool for wrangling well data. I wanted to mention it, becase it's doing some really useful things for us — and maybe it can help you too. But I probably shouldn't because it's far from stable and we're messing with it every day.

    But hey, what software doesn't have a few or several or loads of bugs?

    Buggy data?

    It's not just software that's buggy. Data is as buggy as heck, and subsurface data is, I assert, the buggiest data of all. Give units or datums or coordinate reference systems or filenames or standards or basically anything at all a chance to get corrupted in cryptic ways, and they take it. Twice if possible.

    By way of example, we got a package of 10 wells recently. It came from a "data management" company. There are issues... Here are some of them:

    • All of the latitude and longitude data were in the wrong header fields. No coordinate reference system in sight anywhere. This is normal of course, and the only real side-effect is that YOU HAVE NO IDEA WHERE THE WELL IS.
    • Header chaos aside, the files were non-standard LAS sort-of-2.0 format, because tops had been added in their own little completely illegal section. But the LAS specification has a section for stuff like this (it's called OTHER in LAS 2.0).
    • Half the porosity curves had units of v/v, and half %. No big deal...
    • ...but a different half of the porosity curves were actually v/v. Nice.
    • One of the porosity curves couldn't make its mind up and changed scale halfway down. I am not making this up.
    • Several of the curves were repeated with other names, e.g. GR and GAM, DT and AC. Always good to have a spare, if only you knew if or how they were different. Our tool curvenam.es tries to help with this, but it's far from perfect.
    • One well's RHOB curve was actually the PEF curve. I can't even...

    The remarkable thing is not really that I have this headache. It's that I expected it. But this time, I was out of paracetamol.

    Cards on the table

    Our tool welly, which I stress is very much still in development, tries to simplify the process of wrangling data like this. It has a project object for collecting a lot of wells into a single data structure, so we can get a nice overview of everything: 

    Click to enlarge.

    Our goal is to include these curves in the training data for a machine learning task to predict lithology from well logs. The trained model can make really good lithology predictions... if we start with non-terrible data. Next time I'll tell you more about how welly has been helping us get from this chaos to non-terrible data.

    Copyright and seismic data

    Seismic company GSI has sued a lot of organizations recently for sharing its copyrighted seismic data, undermining its business. A recent court decision found that seismic data is indeed copyrightable, but Canadian petroleum regulations can override the copyright. This allows data to be disclosed by the regulator and copied by others — made public, effectively.


    Seismic data is not like other data

    Data is uncopyrightable. Like facts and ideas, data is considered objective, uncreative — too cold to copyright. But in an important ruling last month, the Honourable Madam Justice Eidsvik established at the Alberta Court of the Queen's Bench that seismic data is not like ordinary data. According to this ruling:

     

    ...the creation of field and processed [seismic] data requires the exercise of sufficient skill and judgment of the seismic crew and processors to satisfy the requirements of [copyrightability].

     

    These requirements were established in the case of accounting firm CCH Canadian Limited vs The Law Society of Upper Canada (2004) in the Supreme Court of Canada. Quoting from that ruling:

     

    What is required to attract copyright protection in the expression of an idea is an exercise of skill and judgment. By skill, I mean the use of one’s knowledge, developed aptitude or practised ability in producing the work. By judgment, I mean the use of one’s capacity for discernment or ability to form an opinion or evaluation by comparing different possible options in producing the work.

     

    Interestingly:

     

    There exist no cases expressly deciding whether Seismic Data is copyrightable under the American Copyright Act [in the US].

     

    Fortunately, Justice Eidsvik added this remark to her ruling — just in case there was any doubt:

     

    I agree that the rocks at the bottom of the sea are not copyrightable.

    It's really worth reading through some of the ruling, especially sections 7 and 8, entitled Ideas and facts are not protected and Trivial and purely mechanical respectively. 

    Why are we arguing about this?

    This recent ruling about seismic data was the result of an action brought by Geophysical Service Incorporated against pretty much anyone they could accuse of infringing their rights in their offshore seismic data, by sharing it or copying it in some way. Specifically, the claim was that data they had been required to submit to regulators like the C-NLOPB and the C-NSOPB was improperly shared, undermining its business of shooting seismic data on spec.

    You may not have heard of GSI, but the company has a rich history as a technical and business innovator. The company was the precursor to Texas Instruments, a huge player in the early development of computing hardware — seismic processing was the 'big data' of its time. GSI still owns the largest offshore seismic dataset in Canada. Recently, however, the company seems to have focused entirely on litigation.

    The Calgary company brought more than 25 lawsuits in Alberta alone against corporations, petroleum boards, and others. There have been other actions in other jurisdictions. This ruling is just the latest one; here's the full list of defendants in this particular suit (there were only 25, but some were multiple entities):

    • Devon Canada Corporation
    • Statoil Canada Ltd.
    • Anadarko Petroleum Corporation
    • Anadarko US Offshore Corporation
    • NWest Energy Corp.
    • Shoal Point Energy Ltd.
    • Vulcan Minerals Inc.
    • Corridor Resources Inc.
    • CalWest Printing and Reproductions
    • Arcis Seismic Solutions Corp.
    • Exploration Geosciences (UK) Limited
    • Lynx Canada Information Systems Ltd.
    • Olympic Seismic Ltd.
    • Canadian Discovery Ltd.
    • Jebco Seismic UK Limited
    • Jebco Seismic (Canada) Company
    • Jebco Seismic, LP
    • Jebco/Sei Partnership LLC
    • Encana Corporation
    • ExxonMobil Canada Ltd.
    • Imperial Oil Limited
    • Plains Midstream Canada ULC
    • BP Canada Energy Group ULC
    • Total S.A.
    • Total E&P Canada Ltd.
    • Edison S.P.A.
    • Edison International S.P.A.
    • ConocoPhillips Canada Resources Corp.
    • Canadian Natural Resources Limited
    • MGM Energy Corp
    • Husky Oil Limited
    • Husky Oil Operations Limited
    • Nalcor Energy – Oil and Gas Inc.
    • Suncor Energy Inc.
    • Murphy Oil Company Ltd.
    • Devon ARL Corporation

    Why did people share the data?

    According to Section 101 Disclosure of Information of the Canada Petroleum Resources Act (1985) , geophysical data should be released to regulators — and thus, effectively, the public — five years after acquisition:

     

    (2) Subject to this section, information or documentation is privileged if it is provided for the purposes of this Act [...]
    (2.1) Subject to this section, information or documentation that is privileged under subsection 2 shall not knowingly be disclosed without the consent in writing of the person who provided it, except for the purposes of the administration or enforcement of this Act [...]

    (7) Subsection 2 does not apply in respect of the following classes of information or documentation obtained as a result of carrying on a work or activity that is authorized under the Canada Oil and Gas Operations Act, namely, information or documentation in respect of

    (d) geological work or geophysical work performed on or in relation to any frontier lands,
        (i) in the case of a well site seabed survey [...], or
        (ii) in any other case, after the expiration of five years following the date of completion of the work;

     

    As far as I can tell, this does not necessarily happen, by the way. There seems to be a great deal of confusion in Canada about what 'seismic data' actually is — companies submit paper versions, sometimes with poor processing, or perhaps only every 10th line of a 3D. But the Canada Oil and Gas Geophysical Operations Regulations are quite clear. This is from the extensive and pretty explicit 'Final Report' requirements: 

     

    (j) a fully processed, migrated seismic section for each seismic line recorded and, in the case of a 3-D survey, each line generated from the 3-D data set;

     

    The intent is quite clear: the regulators are entitled to the stacked, migrated data. The full list is worth reading, it covers a large amount of data. If this is enforced, it is not very rigorous. If these datasets ever make it into the hands of the regulators, and I doubt it ever all does, then it's still subject to the haphazard data management practices that this industry has ubiquitously adopted.

    GSI argued that 'disclosure', as set out in Section 101 of the Act, does not imply the right to copy, but the court was unmoved:

     

    Nonetheless, I agree with the Defendants that [Section 101] read in its entirety does not make sense unless it is interpreted to mean that permission to disclose without consent after the expiry of the 5 year period [...] must include the ability to copy the information. In effect, permission to access and copy the information is part of the right to disclose.

     

    So this is the heart of the matter: the seismic data was owned and copyrighted by GSI, but the regulations specify that seismic data must be submitted to regulators, and that they can disclose that data to others. There's obvious conflict between these ideas, so which one prevails? 

    The decision

    There is a principle in law called Generalia Specialibus Non Derogant. Quoting from another case involving GSI:

     

    Where two provisions are in conflict and one of them deals specifically with the matter in question while the other is of more general application, the conflict may be avoided by applying the specific provision to the exclusion of the more general one. The specific prevails over the more general: it does not matter which was enacted first.

     

    Quoting again from the recent ruling in GSI vs Encana et al.:

     

    Parliament was aware of the commercial value of seismic data and attempted to take this into consideration in its legislative drafting. The considerations balanced in this regard are the same as those found in the Copyright Act, i.e. the rights of the creator versus the rights of the public to access data. To the extent that GSI feels that this policy is misplaced, its rights are political ones – it is not for this Court to change the intent of Parliament, unfair as it may be to GSI’s interests.

     

    Finally:

     

    [...the Regulatory Regime] is a complete answer to the suggestion that the Boards acted unlawfully in disclosing the information and documentation to the public. The Regulatory Regime is also a complete answer to whether the copying companies and organizations were entitled to receive and copy the information and documentation for customers. For the oil companies, it establishes that there is nothing unlawful about accessing or copying the information from the Boards [...]

     

    So that's it: the data was copyright, but the regulations override the copyright, effectively. The regulations were legal, and — while GSI might find the result unfair — it must operate under them. 

    The decision must be another step towards the end of this ugly matter. Maybe it's the end. I'm sure those (non-lawyers) involved can't wait for it to be over. I hope GSI finds a way back to its technical core and becomes a great company again. And I hope the regulators find ways to better live up to the fundamental idea behind releasing data in the first place: that the availability of the data to the public should promote better science and better decisions for Canada's offshore. As things stand today, the whole issue of 'public subsurface data' in Canada is, frankly, a mess.