Openness is a two-way street

Last week the Data Analysis Study Group of the SPE Gulf Coast Section announced a new machine learning contest (I’m afraid registration is now closed, even though the contest has not started yet). The task is to predict shear-wave sonic from other logs, similar to the SPWLA PDDA contest last year. This is a valuable problem in the subsurface, because shear sonic log is essential for computing elastic properties of rocks and therefore in predicting rock and fluid properties or processing seismic. Indeed, TGS have built a business on predicted logs with their ARLAS product. There’s money in log prediction!

The task looks great, but there’s one big problem: the dataset is not open.

Why is this a problem?

Before answering that, let’s look at some context.

What’s a machine learning contest?

Good question. Typically, an organization releases a dataset (financial timeseries, Netflix viewer data, medical images, or whatever). They invite people to predict some valuable property (when to sell, which show to recommend, how to treat the illness, or whatever). And they pick the best, measured against known labels on a hidden dataset.

Kaggle is one of the largest platforms hosting such challenges, and they often attract thousands of participants — competing for large prizes. TGS ran a seismic salt-picking contest on the platform, attracting almost 74,000 submissions from 3220 teams with a $100k prize purse. Other contests are more grass-roots, like the one I ran with Brendon Hall in 2016 on lithology prediction, and like this SPE contest. It’s being run by a team of enthusiasts without a lot of resources from SPE, and the prize purse is only $1000 — representing about 3 hours of the fully loaded G&A of an oil industry professional.

What has this got to do with reproducibility?

Contests that award a large prize in return for solving a hard problem are essentially just a kind of RFP-combined-with-consulting-job. It’s brutally inefficient: hundreds or even thousands of people spend hours on the problem for free, and a handful are financially rewarded. These contests attract a lot of attention, but I’m not that interested in them.

Community-oriented events like this SPE contest — and the recent FORCE one that Xeek hosted — are more interesting and I believe they are more impactful. They have lots of great outcomes:

  • Lots of people have fun working on a hard problem and connecting with each other.

  • Solutions are often shared after, or even during, the contest, so that everyone learns and grows their toolbox.

  • A new open dataset that might even become a much-needed benchmark for the task in hand.

  • Researchers can publish what they did, or do later. (The SEG ML contest tutorial and results article have 136 citations between them, largely from people revisiting the dataset to show new solutions.)

A lot of new open-source machine learning code is always exciting, but if the data is not open then the work is by definition not reproducible. It seems especially unfair — cheeky, even — to ask participants to open-source their code, but to keep the data proprietary. For sure TGS is interested in how these free solutions compare to their own product.

Well, life’s not fair. Why is this a problem?

The data is being shared with the contest participants on the condition that they may not share it. In other words it’s proprietary. That means:

  • Participants are encumbered with the liability of a proprietary dataset. Sure, TGS is sharing this data in good faith today, but who knows how future TGS lawyers will see it after someone accidentally commits it to their GitHub repo? TGS is a billion-dollar company, they will win a legal argument with you. (Having said that, there’s no NDA or anything, just a checkbox in a form. I don’t know how binding it really is… but I don’t want to be the one that finds out.)

  • Participants can’t publish reproducible papers on their own work. They can publish classic oil-indsutry, non-reproducible work — I did this thing but no-one can check it because I can’t give you the data — but do we really need more of that? (In the contest introductory Zoom, someone asked about publishing plots of the data. The answer: “It should be fine.” Are we really still this naive about data?)

If anyone from TGS is reading this and thinking, “Come on, we’re not going to sue anyone — we’re not GSI! — it’s fine :)” then my response is: Wonderful! In that case, why not just formalize everything by releasing the data under an open licence — preferably Creative Commons Attribution 4.0? (Unmodified! Don’t make the licensing mistakes that Equinor and NAM have made recently.) That way, everyone knows their rights, everyone can safely download the data, and the community can advance. And TGS looks pretty great for contributing an awesome dataset to the subsurface machine learning community.

I hope TGS decides to release the data with an open licence. If they don’t, it feels like a rather one-sided deal to me. And with the arrangement as it stands, there’s no way I would enter this contest.

x lines of Python: load curves from LAS

Welcome to the latest x lines of Python post, in which we have a crack at some fundamental subsurface workflows... in as few lines of code as possible. Ideally, x < 10.

We've met curves once before in the series — in the machine learning edition, in which we cheated by loading the data from a CSV file. Today, we're going to get it from an LAS file — the popular standard for wireline log data.

Just as we previously used the pandas library to load CSVs, we're going to save ourselves a lot of bother by using an existing library — lasio by Kent Inverarity. Indeed, we'll go even further by also using Agile's library welly, which uses lasio behind the scenes.

The actual data loading is only 1 line of Python, so we have plenty of extra lines to try something more ambitious. Here's what I go over in the Jupyter notebook that goes with this post:

  1. Load an LAS file with lasio.
  2. Look at its header.
  3. Look at its curve data.
  4. Inspect the curves as a pandas DataFrame.
  5. Load the LAS file with welly.
  6. Look at welly's Curve objects.
  7. Plot part of a curve.
  8. Smooth a curve.
  9. Export a set of curves as a matrix.
  10. BONUS: fix some broken things in the file header.

Each one of those steps is a single line of Python. Together, I think they cover many of the things we'd like to do with well data once we get our hands on it. Have a play with the notebook and explore what you can do.

Next time we'll take things a step further and dive into some seismic petrophysics.

The Rock Property Catalog again

Do you like data? Data about rocks? Open, accessible data that you can use for any purpose without asking? Read on.

After writing about anisotropy back in February, and then experimenting with storing rock properties in SubSurfWiki later that month, a few things happened:

  • The server I run the wiki on — legacy Amazon AWS infrastructure — crashed, and my backup strategy turned out to be <cough> flawed. It's now running on state-of-the-art Amazon servers. So my earlier efforts were mostly wiped out... Leaving the road clear for a new experiment!
  • I came across an amazing resource called Mudrock Anisotropy, or — more appealingly — Mr Anisotropy. Compiled by Steve Horne, it contains over 1000 records of rocks, gathered from the literature. It is also public domain and carries only a disclaimer. But it's a spreadsheet, and emailing a spreadsheet around is not sustainable.
  • The Common Ground database that was built by John A. Scales, Hans Ecke and Mike Batzle at Colorado School of Mines in the late 1990s, is now defunct and has been officially discontinued, as of about two weeks ago. It contains over 4000 records, and is public domain. The trouble is, you have to restore a SQLite database to use it.

All this was pointing towards a new experiment. I give you: the Rock Property Catalog again! This time it contains not 66 rocks, but 5095 rocks. Most of them have \(V_\mathrm{P}\), \(V_\mathrm{S}\) and  \(\rho\). Many of them have Thomsen's parameters too. Most have a lithology, and they all have a reference. Looking for Cretaceous shales in North America to use as analogs on your crossplots? There's a rock for that.

As before, you can query the catalog in various ways, either via the wiki or via the web API. Let's say we want to find shales with a velocity over 5000 m/s. You have a few options:

  1. Go to the semantic search form on the wiki and type [[lithology::shale]][[vp::>5000]]
  2. Make a so-called inline query on your own wiki page (you need an account for this).
  3. Make a query via the web API with a rather long URL: http://www.subsurfwiki.org/api.php?action=ask&query=[[RPC:%2B]][[lithology::shale]][[Vp::>5000]]|%3FVp|%3FVs|%3FRho&format=jsonfm

I updated the Jupyter Notebook I published last time with a new query. It's pretty hacky. I'll work on this to produce a more robust method, with some error handling and cleaner code — stay tuned.

The database supports lots of properties, including:

  • Citation and reference
  • Description, lithology, colour (you can have pictures if you want!)
  • Location, lat/lon, basin, age, depth
  • Vp, Vs, \(\rho\), as well as \(\rho_\mathrm{dry}\) and \(\rho_\mathrm{grain}\)
  • Thomsen's \(\epsilon\), \(\delta\), and \(\gamma\)
  • Static and dynamic Young's modulus and Poisson ratio
  • Confining pressure, pore pressure, effective stress, axial stress
  • Frequency
  • Fluid, saturation type, saturation
  • Porosity, permeability, temperature
  • Composition

There is more from the Common Ground data to add, especially photographs. But for now, I'd love some feedback: is this the right set of properties? Do we need more? I want this to be useful — what kind of data and metadata would you like to see? 

I'll end with the usual appeal — I'm open to any kind of suggestions or help with this. Perhaps you can contribute new rocks, or a paper containing data? Or maybe you have some wiki skills, or can help write bots to improve the data? What can you bring? 

Submitting assumptions for meaningful answers

The best talk of the conference was Ran Bachrach's on seismics for unconventionals. He enthusiastically described the physics to his spectators with conviction and duty, and explained why they should care. Isotropic, VTI, and orthorhombic media anisotropy models are used not because they are right, but because they are simple. If the assumptions you bring to the problem are reasonable, the answers can be considered meaningful. If you haven't considered and tested your assumptions, you haven't subscribed to reason. In a sense, you haven't held up your end of the bargain, and there will never be agreement. This talk should be mandatory viewing for anyone working seismic for unconventionals. Advocacy for reason. Too bad it wasn't recorded.

I am both privileged and obliged to celebrate such nuggets of awesomeness. That's a big reason why I blog. And on the contrary, we should call out crappy talks when we see them to raise the bar. Indeed, to quote Zen Faulkes, "...we should start creating more of an expectation that scientific talks will be reviewed and critiqued. And names will be named."

The talk from HEF Petrophysical entitled, Towards modelling three-dimensional oil sands permeability distribution using borehole image logs, drew me in. I was curious enough to show up. But as the talk unfolded, my curiosity was left unsatisfied. A potentially interesting workflow of transforming high-resolution resistivity measurements into flow permeability was obfuscated with a pointless upscaling step. The meat of anything like this is in the transform itself, but it was missing. It's also the most trivial bit; just cross-plot one property with another and show people. So I am guessing they didn't have any permeability data. If that was the case, how can you stand up and talk about permeability? It was a sandwich without the filling. The essential thing that defines a piece of work is the creativity. The thing you add that wasn't there before. I was disappointed. Disappointed that it was accepted, and that no one else piped up. 

I will paraphrase a conversation I had with Ran at the coffee break: Some are not aware, some choose to ignore, and some forget that works of geoscience are problems of extreme complexity. In fact, the only way we can cope with complexity is to make certain assumptions that make our problem solvable. If all you do is say "here is my solution", you suck. But if instead you ask, "Have I convinced you that my assumptions are reasonable?", it entirely changes the conversation. It entirely changes the specialist's role. Only when you understand your assumptions can we talk about whether the results are reasonable. 

Have you ever felt conflicted on whether or not you should say something?

Interpreting spectral gamma-ray logs

Before you can start interpreting spectral gamma-ray logs (or, indeed, any kind of data), you need to ask about quality.

Calibrate your tool...

The main issues affecting the quality of the logs are tool calibration and drilling mud composition. I think there's a tendency to assume that delivered logs have been rigorously quality checked, but... they haven't. The only safe assumption is that nobody cares about your logs as much as you. (There is a huge opportunity for service companies here — but in my experience they tend to be focused on speed and quantity, not quality.)

Calibration is critical. The measurement device in the tool consists of a thallium-laced NaI crystal and a photomultiplier. Both of these components are sensitive to temperature, so calibration is especially important when the temperature of the tool is changing often. If the surface temperature is very different from the downhole—winter in Canada—calibrate often.

Drilling mud containing KCl (to improve borehole stability) increases the apparent potassium content of the formation, while barite acts as a gamma-ray absorber and reduces the count rates, especially in the low energies (potassium).

One of the key quality control indicators is negative readings on the uranium log. A few negative values are normal, but many zero-crossings may indicate that the tool was improperly calibrated. It is imperative to quality control all of the logs, for bad readings and pick-up effects, before doing any quantitative work.

...and your interpretation

Most interpretations of spectral-gamma ray logs focus on the relationships between the three elemental concentrations. In particular, Th/K and Th/U are often used for petrophysical interpretation and log correlation. In calculating these ratios, Schlumberger uses the following cut-offs: if uranium < 0.5 then uranium = 0.5; if potassium < 0.004 then potassium = 0.001 (according to my reference manual for the natural gamma tool).

In general, high K values may be caused by the presence of potassium feldspars or micas. Glauconite usually produces a spike in the K log. High Th values may be associated with the presence of heavy minerals, particularly in channel deposits. Increased Th values may also be associated with an increased input of terrigenous clays. Increases in U are frequently associated with the presence of organic matter. For example, according to the ODP, particularly high U concentrations (> 5 ppm) and low Th/U ratios (< 2) often occur in black shale deposits.

The logs here, from Kansas Geological Survey open file 90-27 by Macfarlane et al. shows a quite overt interpretive approach, with the Th/K log labelled with minerals (feldspar, mica, illite–smectite) and the Th/U log in uranium 'fixedness', a proxy for organic matter.

Sounds useful. But really, you can probably find just a paper to support just about any interpretation you want to make. Which isn't to say that spectral gamma-ray is no use — it's just not diagnostic on its own. You need to calibrate it to your own basin and your own stratigraphy. This means careful, preferably quantitative, comparison of core and logs. 

Further reading 

What is spectral gamma-ray?

The spectral gamma-ray log is a measure of the natural radiation in rocks. The amplitude of the signal from the gamma-ray tool, which is just a sensor with no active source, is proportional to the energy of the gamma-ray photons it encounters. Being able to differentiate between photons of different energies turns out to be very handy Compared to the ordinary gamma-ray log, which ignores the energies and only counts the photons, it's like seeing in colour instead of black and white.

Why do we care about gamma radiation?

First, what are gamma rays? Highly energetic photons: electromagnetic radiation with very short wavelengths. 

Being able to see different energies, or 'colours', means we can differentiate between the radioactive decay of different elements. Elements decay by radiating energy, and the 'colour' of that energy is characteristic of that element (actually, of each isotope). So, we can tell by looking at the energy of a photon if we are seeing a potassium atom (40K) or a uranium atom (238U) decay. These are very different isotopes, with very different habits. We can do geology!

In fact, all sorts of radioisotopes occur naturally in the earth. By far the most abundant are potassium 40K, thorium 232Th and uranium 238U. Of these, potassium is the most abundant in sedimentary rocks, but thorium and uranium are present in small quantities, and have particular sedimentological implications.

What exactly are we measuring?

Potassium 40K decays to argon about 10% of the time, with γ-emission at 1.46 MeV (the other 90% of the time it decays to calcium). However, all of the decay in the 232Th and 238U decay series occurs by α- and β-particle decay, which don't always result in photon emission. The tool in fact measures γ-radiation from the decay of thallium 208Tl in the 232Th series (right), and from bismuth 214Bi in the 238U series. The spectral gamma-ray tool must be calibrated to known samples to give concentrations of 232Th and 238U from its readings. Proper calibration is vital, and is temperature-sensitive (of note in Canada!).

The concentrations of the three elements are estimated from the spectral measure­ments. The concentration of potassium is usually measured in percent (%) or per mil (‰), or sometimes in kilograms per tonne, which is equivalent to per mil. The other two elements are measured in parts per million (ppm).

Here is the gamma-ray spectrum from a single sample from 509 m below the sea-floor at ODP Site 1201. The final spectrum (heavy black line) is shown after removing the background spectrum (gray region) and applying a three-point mean boxcar filter. The thin black line shows the raw spectrum. Vertical lines mark the interval boundaries defined by Peter Blum (an ODP scientist at Texas A&M). Prominent energy peaks relating to certain elements are identified at the top of the figure. The inset shows the spectrum for energies >1500 keV at an expanded scale. 

We wouldn't normally look at these spectra. Instead, the tool provides logs for K, Th, and U. Next time, I'll look at the logs.

Spectrum illustration by Wikipedia user Inductiveload, licensed GFDL; decay chain by Wikipedia user BatesIsBack, licensed CC-BY-SA.

Rocks, pores and fluids

At an SEG seismic rock physics conference in China several years ago, I clearly remember a catch phrase used by one of the presenters, "It's all about rocks, pores, and fluids." He used it several times throughout his talk as an invocation for geophysicists to translate their seismic measurements of the earth into terms that are more appealing to others. Nobody cares about the VP/VS ratio in a reservoir. Even though I found the repetition slightly off-putting, he succeeded—the phrase stuck. It's all about rock, pores, and fluids.

Fast forward to the SEG IQ Earth Forum a few months ago. The message reared its head again, but in a different form. After dinner one evening, I was speaking with Ran Bachrach about advances in seismic rock physics technology: the glamour and the promise of the state-of-the-art. It was a topic right up his alley, but suprisingly, he seemed ambivalent and under-enthused. Which was unusual for him. "More often than not," he said, "we can get all the information we need from the triple combo." 

What is the triple combo? 

I felt embarrased that I had never heard of the term. Like I had been missing something this whole time. The triple combo is the standard set of measurements used in formation evaluation and wireline logging: gamma-ray, porosity, and resistivity. Simply put, the triple combo tells us about rocks, pores, and fluids. 

I find it curious that the very things we are interested in are impossible to measure directly. For example:

  • A gamma-ray log measures naturally occuring radioactive minerals. We use this to make inferences about lithology.
  • A neutron log measures Compton scattering in proportion to the number of hydrogen atoms. This is a proxy for pores.
  • A resistivity log measures the conductivity of electrical current. We use this to tell us about fluid type and saturation.

Subsurface geotechnology isn't only about recording the earth's constituents in isolation. Some measurements, the sonic log for instance, are useful because of the fact that they are an aggregate of all three.

The well log is a section of the Thebaud_E-74 well available from the offshore Nova Scotia Play Fairway Analysis.

Cope don't fix

Some things genuinely are broken. International financial practices. Intellectual property law. Most well tie software. 

But some things are the way they are because that's how people like them. People don't like sharing files, so they stash their own. Result: shared-drive cancer — no, it's not just your shared drive that looks that way. The internet is similarly wild, chaotic, and wonderful — but no-one uses Yahoo! Directory to find stuff. When chaos is inevitable, the only way to cope is fast, effective search

So how shall we deal with the chaos of well log names? There are tens of thousands — someone at Schlumberger told me last week that they alone have over 50,000 curve and tool names. But these names weren't dreamt up to confound the geologist and petrophysicist — they reflect decades of tool development and innovation. There is meaning in the morasse.

Standards are doomed

Twelve years ago POSC had a go at organizing everything. I don't know for sure what became of the effort, but I think it died. Most attempts at standardization are doomed. Standards are awash with compromise, so they aren't perfect for anything. And they can't keep up with changes in technology, because they take years to change. Doomed.

Instead of trying to fix the chaos, cope with it.

A search tool for log names

We need a search tool for log names. Here are some features it should have:

  • It should be free, easy to use, and fast
  • It should contain every log and every tool from every formation evaluation company
  • It should provide human- and machine-readable output to make it more versatile
  • You should get a result for every search, never drawing a blank
  • Results should include lots of information about the curve or tool, and links to more details
  • Users should be able to flag or even fix problems, errors, and missing entries in the database

To my knowledge, there are only two tools a little like this: Schlumberger's Curve Mnemonic Dictionary, and the SPWLA's Mnemonics Data Search. Schlumberger's widget only includes their tools, naturally. The SPWLA database does at least include curves from Baker Hughes and Halliburton, but it's at least 10 years out of date. Both fail if the search term is not found. And they don't provide machine-readable output, only HTML tables, so it's difficult to build a service on them.

Introducing fuzzyLAS

We don't know how to solve this problem, but we're making a start. We have compiled a database containing 31,000 curve names, and a simple interface and web API for fuzzily searching it. Our tool is called fuzzyLAS. If you'd like to try it out, please get in touch. We'd especially like to hear from you if you often struggle with rogue curve mnemonics. Help us build something useful for our community.

The digital well scorecard

In my last post, I ranted about the soup of acronyms that refer to well log curves; a too-frequent book-keeping debacle. This pain, along with others before it, has motivated me to design a solution. At this point all I have is this sketch, a wireframe of should-be software that allows you visualize every bit of borehole data you can think of:

The goal is, show me where the data is in the domain of the wellbore. I don't want to see the data explicitly (yet), just its whereabouts in relation to all other data. Data from many disaggregated files, reports, and so on. It is part inventory, part book-keeping, part content management system. Clear the fog before the real work can begin. Because not even experienced folks can see clearly in a fog.

The scorecard doesn't yield a number or a grade point like a multiple choice test. Instead, you build up a quantitative display of your data extents. With the example shown above, I don't even have to look at the well log to tell you that you are in for a challenging well tie, with the absence of sonic measurements in the top half of the well. 

The people that I showed this to immediately undestood what was being expressed. They got it right away, so that bodes well for my preliminary sketch. Can you imagine using a tool like this, and if so, what features would you need? 

Swimming in acronym soup

In a few rare instances, an abbreviation can become so well-known that it is adopted into everyday language; more familar than the words it used to stand for. It's embarrasing, but I needed to actually look up LASER, and you might feel the same way with SONAR. These acronyms are the exception. Most are obscure barriers to entry in technical conversations. They can be constructs for wielding authority and exclusivity. Welcome to the club, if you know the password.

No domain of subsurface technology is riddled with more acronyms than well log analysis and formation evaluation. This is a big part of — perhaps too much of a part of — why petrophysics is hard. Last week, I came across a well. It has an extended suite of logs, and I wanted make a synthetic. Have a glance at the image and see which curve names you recognize (the size represents the frequency the names are encountered across many files of the same well).

I felt like I was being spoken to by some earlier deliquent: I got yer well logs right here buddy. Have fun sorting this mess out.

The log ASCII standard (*.LAS file) file format goes a long way to exposing descriptive information in the header. But this information is often incomplete, missing, and says nothing about the quality or completeness of the data. I had to scan 5 files to compile this soup. A micro-travesty and a failure, in my opinion. How does one turn this into meaningful information for geoscience?

Whose job is it to sort this out? The service company that collected the data? The operator that paid for it? A third party down the road?

What I need is not only an acronym look-up table, but also a data range tool to show me what I've got in the file (or files), and at which locations and depths I've got it. A database to give me more information about these acronyms would be nice too, and a feature that allows me to compare multiple files, wells, and directories at once. It would be like a life preserver. Maybe we should build it.

I made the word cloud by pasting text into wordle.net. I extracted the text from the data files using the wonderful LASReader written by Warren Weckesser. Yay, open source!