An attribute analysis primer

A question on Stack Exchange the other day reminded me of the black magic feeling I used to have about attribute analysis. It was all very meta: statistics of combinations of attributes, with shifted windows and crazy colourbars. I realized I haven't written much about the subject, despite the fact that many of us spend a lot of time trying to make sense of attributes.

Time slices, horizon slices, and windows

One of the first questions a new attribute-analyser has is, "Where should the window be?" Like most things in geoscience: it depends. There are lots of ways of doing it, so think about what you're after...

  • Timeslice. Often the most basic top-down view is a timeslice, because they are so easy to make. This is often where attribute analysis begins, but since timeslices cut across stratigraphy, not usually where it ends.
  • Horizon. If you're interested in the properties of a strong reflector, such as a hard, karsted unconformity, maybe you just want the instantaneous attribute from the horizon itself.
  • Zone. If the horizon was hard to interpret, or is known to be a gradual facies transition, you may want to gather statistics from a zone around it. Or perhaps you couldn't interpret the thing you really wanted, but only that nice strong reflection right above it... maybe you can bootstrap yourself from there. 
  • Interval. If you're interested in a stratigraphic interval, you can bookend it with existing horizons, perhaps with a constant shift on one or both of them.
  • Proportional. If seismic geomorphology is your game, then you might get the most reasonable inter-horizon slices from proportionally slicing in between stratigraphic surface. Most volume interpretation software supports this. 

There are some caveats to simply choosing the stratigraphic interval you are after. Beware of choosing an interval that strong reflectors come into and out of. They may have an unduly large effect on most statistics, and could look 'geological'. And if you're after spectral attributes, do remember that the Fourier transform needs time! The only way to get good frequency resolution is to provide long windows: a 100 ms window gives you frequency information every 10 Hz.

Extraction depends on sample interpolation

When you extract an attribute, say amplitude, from a trace, it's easy to forget that the software has to do some approximation to give you an answer. This is because seismic traces are not continuous curves, but discrete series, with samples typically every 1, 2, or 4 milliseconds. Asking for the amplitude at some arbitrary time, like the point at which a horizon crosses a trace, means the software has to interpolate between samples somehow. Different software do this in different ways (linear, spline, polynomial, etc), and the methods give quite different results in some parts of the trace. Here are some samples interpolated with a spline (black curve) and linearly (blue). The nearest sample gives the 'no interpolation' result.

As well as deciding how to handle non-sampled parts of the trace, we have to decide how to represent attributes operating over many samples. In a future post, we'll give some guidance for using statistics to extract information about the entire window. What options are available and how do we choose? Do we take the average? The maximum? Something else?

There's a lot more to come!

As I wrote this post, I realized that this is a massive subject. Here are some aspects I have not covered today:

  • Calibration is a gaping void in many published workflows. How can we move past "that red blob looks like a point bar so I drew a line around it in PowerPoint" to "there's a 70% chance of finding reservoir quality sand at that location"?
  • This article was about  single-trace attributes at single instants or over static windows. Multi-trace and volume attributes, like semblance, curvature, and spectral decomposition, need a post of their own.
  • There are a million attributes (though only a few that count, just ask Art Barnes) so choosing which ones to use can be a challenge. Criteria range from what software licenses you have to what is physically reasonable.
  • Because there are a million attributes, the art of combining attributes with statistical methods like principal component analysis or multi-linear regression needs a look. This gets into seismic inversion.

We'll return to these ideas over the next few weeks. If you have specific questions or workflows to share, please leave a comment below, or get in touch by email or Twitter.

To view and run the code that I used in creating the figures for this post, grab the iPython/Jupyter Notebook.

Corendering more attributes

My recent post on multi-attribute data visualization painted two seismic attributes from on a timeslice. Let's look now at corendering attributes extracted on a seismic horizon. I'll reproduce the example Matt gave in his post on colouring maps.

Although colour choices come down to personal preference, there are some points to keep in mind:

  • Data that varies relatively gradually across the canvas — e.g. elevation here — should use a colour scale that varies monotonically in hue and luminance, e.g. CubeHelix or Matteo Niccoli's colourmaps.
  • Data that varies relatively quickly across the canvas — e.g. my similarity data, (a member of the family that includes coherencesemblance, and so on) — should use a monochromatic colour scale, e.g. black–white. 
  • If we've chosen our colourmaps wisely, there should be some unused hues for rendering other additional attributes. In this case, there are no red hues in the elevation colourmap, so we can map redness to instantaneous amplitude.

Adding a light source

Without wanting to get too gimmicky, we can sometimes enliven the appearance of an attribute, accentuating its texture, by simulating a bumpy surface and shining a virtual light onto it. This isn't the same as casting a light source on the composite display. We can make our light source act on only one of our attributes and leave the others unchanged. 

Similarity attribute Displayed using a Greyscale Colourbar (left). Bump mapping of similarity attribute using a lightsource positioned at azimuth 350 degrees, inclination 20 degrees. 

The technique is called hill-shading. The terrain doesn't have to be a physical surface; it can be a slice. And unlike physical bumps, we're not actually making a new surface with relief, we are merely modifying the surface's luminance from an artificial light source. The result is a more pronounced texture.

One view, two dimensions, three attributes

Constructing this display takes a bit of trial an error. It wasn't immediately clear where to position the light source to get the most pronounced view. Furthermore, the amplitude extraction looked quite noisy, so I softened it a little bit using a Gaussian filter. Plus, I wanted to show only the brightest of the bright spots, so that all took a bit of fiddling.

Even though 3D data visualization is relatively common, my assertion is that it is much harder to get 3D visualization right, than for 2D. Looking at the 3 colour-bars that I've placed in the legend. I'm reminded of this difficulty of adding a third dimension; it's much harder to produce a colour-cube in the legend than a series of colour-bars. Maybe the best we can achieve is a colour-square like last time, with a colour-bar for the overlay on the side.

Check out the IPython notebook for the code used to create these figures.

Corendering attributes and 2D colourmaps

The reason we use colourmaps is to facilitate the human eye in interpreting the morphology of the data. There are no hard and fast rules when it comes to choosing a good colourmap, but a poorly chosen colourmap can make you see features in your data that don't actually exist. 

Colourmaps are typically implemented in visualization software as 1D lookup tables. Given a value, what colour should I plot it? But most spatial data is multi-dimensional, and it's useful to look at more than one aspect of the data at one time. Previously, Matt asked, "how many attributes can a seismic interpreter show with colour on a single display?" He did this by stacking up a series of semi-opaque layers, each one assigned its own 1D colourbar. 

Another way to add more dimensions to the display is corendering. This effectively adds another dimension to the colourmap itself: instead of a 1D colour line for a single attribute, for two attributes we're defining a colour square; for 3 attributes, a colour cube, and so on.

Let's illustrate this by looking at a time-slice through a portion of the F3 seismic volume. A simple way of displaying two attributes is to decrease the opacity of one, and lay it on top of the other. In the figure below, I'm setting the opacity of the continuity to 75% in the third panel. At first glance, this looks pretty good; you can see both attributes, and because they have different hues, they complement each other without competing for visual bandwidth. But the approach is flawed. The vividness of each dataset is diminished; we don't see the same range of colours as we do in the colour palette shown above.

Overlaying one map on top of the other is one way to look at multiple attributes within a scene. It's not ideal however.

Overlaying one map on top of the other is one way to look at multiple attributes within a scene. It's not ideal however.

Instead of overlaying maps, we can improve the result by modulating the lightness of the amplitude image according to the magnitude of the continuity attribute. This time the corendered result is one image, instead of two. I prefer it, because it preserves the original colours we see in the amplitude image. If anything, it seems to deepen the contrast:

The lightness value of the seismic amplitude time slice has been modulated by the continuity attribute. 

The lightness value of the seismic amplitude time slice has been modulated by the continuity attribute. 

Such a composite display needs a two-dimensional colormap for a legend. Just as a 1D colourbar, it's also a lookup table; each position in the scene corresponds to a unique pair of values in the colourmap plane.

We can go one step further. Say we want to emphasize only the largest discontinuities in the data. We can modulate the opacity with a non-linear function. In this example, I'm using a sigmoid function:

In order to achieve this effect in most conventional software, you usually have to copy the attribute, colour it black, apply an opacity curve, then position it just above the base amplitude layer. Software companies call this workaround a 'workflow'. 

Are there data visualizations you want to create, but you're stuck with software limitations? In a future post, I'll recreate some cool co-rendering effects; like bump-mapping, and hill-shading.

To view and run the code that I used in creating the images for this post, grab the iPython/Jupyter Notebook.


You can do it too!

If you're in Calgary, Houston, New Orleans, or Stavanger, listen up!

If you'd like to gear up on coding skills and explore the benefits of scientific computing, we're going to be running the 2-day version of the Geocomputing Course several times this fall in select cities. To buy tickets or for more information about our courses, check out the courses page.

None of these times or locations good for you? Consider rounding up your colleagues for an in-house training option. We'll come to your turf, we can spend more than 2 days, and customize the content to suit your team's needs. Get in touch.

The curse of hunting rare things

What are the chances of intersecting features with a grid of cross-sections? I often wonder about this when interpreting 2D seismic data, but I think it also applies to outcrops, or any other transects. I want to know:

  1. If there are only a few of these features, how many should I see?
  2. What's the probability of the lines missing them all? 
  3. Conversely, if I interpret x of them, then how many are there really?
  4. How is the detectability affected by the reliability of the data or my skills?

I used to have a spreadsheet for computing all this stuff, but spreadsheets are dead to me so here's an IPython Notebook :)

An example

I'm interpreting seep locations on 2D data at the moment. So I'm looking for subvertical pipes and chimneys, mud volcanos, seafloor pockmarks and pingos, that sort of thing (see Løseth et al., 2009 for a great overview). Here are some similar features from the Norwegian continental shelf from Hustoft et al., 2010:

Figure 3 from hustoft et al. (2010) showing the 3D expression of some hydrocarbon leakage features in Norway. © The Authors.

As Hustoft et al. show, these can be rather small features — most pockmarks are in the 100–800 m diameter range, so let's call it 500 m. The dataset I have is an orthogonal grid of decent quality 2D lines with a 3 km spacing. The area is about 120,000 km². For the sake of argument (and a forward model), let's imagine there are 120 features I'm interested in — one per 1000 km². Here's a zoomed-in view showing a subset of the problem:

Zoomed-in view of part of my example. A grid of 2D seismic lines, 3 km apart, and randomly distributed features, each 500 m in diameter. If a feature's centre falls inside a grey square, then the feature is not intersected by the data. The grey squa…

Zoomed-in view of part of my example. A grid of 2D seismic lines, 3 km apart, and randomly distributed features, each 500 m in diameter. If a feature's centre falls inside a grey square, then the feature is not intersected by the data. The grey squares are 2.5 km across.

According to my calculations...

  1. Of the 120 features in the area, we expect 37 to be intersected by the data. Of course, some of those intersections might be very subtle, if they are right at the edge of the feature.
  2. The probability of intersecting a given feature is 0.31. There are 120 features, so the probability of the whole dataset intersecting at least one is essentially 1 (certain). That's good! Conversely, the probability of missing them all is effectively 0. (If there were only 5 features, then there'd be about a 16% chance of missing them all.)
  3. Clearly, if I interpret 37 features, there are about 120 in total (that was my a priori). It's a linear relationship, so if I interpret 10 features, I can expect there to be about 33 altogether, and if I see 100 then I can expect that there are almost 330 in total. (I think the probability distribution would be log-normal, but would appreciate others' insights here.)
  4. Reliability? That sounds like a job for Bayes' theorem...

It's far from certain that I will interpret everything the data intersects, for all sorts of reasons:

  • I am human and therefore inconsistent, biased, and fallible.
  • The feature may be cryptic in the section , because of how it was intersected.
  • The data may be poor quality at that point, or everywhere.

Let's assume that if a feature has been intersected by the data, then I have a 75% chance of actually interpreting it. Bayes' theorem tells us how to update the prior probability of 0.31 (for a given feature; point 2 above) to get a posterior probability. Here's the table:

Interpreted Not interpreted
Intersected by a 2D line 28 9
Not intersected by any lines 21 63

What do the numbers mean?

  • Of the 37 intersected features, I interpret 28.
  • I fail to interpret 9 features that are intersected by the data. These are Type II errors, false negatives.
  • I interpret another 21 features which are not real! These are Type I errors: false positives. 
  • Therefore I interpret 48 features, of which only 57% are real. This seems like a lot, but it's a function of my imperfect reliability (75%) and the poor sampling, resulting in a large number of 'missed' features.

Interestingly, my 75% reliability translates into a 57% chance of being right about the existence of a feature. We've seen this effect before — it's the curse of hunting rare things: with imperfect knowledge, we are often wrong


References

Hustoft, S, S Bünz, and J Mienart (2010). Three-dimensional seismic analysis of the morphology and spatial distribution of chimneys beneath the Nyegga pockmark field, offshore mid-Norway. Basin Research 22, 465–480. DOI 10.1111/j.1365-2117.2010.00486.x 

Løseth, H, M Gading, and L Wensaas (2009). Hydrocarbon leakage interpreted on seismic data. Marine & Petroleum Geology 26, 1304–1319. DOI 10.1016/j.marpetgeo.2008.09.008 

Test driven development geoscience

Sometimes I wonder how much of what we do in applied geoscience is really science. Is it really about objective enquiry? Do we form hypotheses, then test them? The scientific method is largely a caricature — science is more accidental and more fun than a step-by-step recipe — but I think our field sometimes falls short of even basic rigour. Go and sit through a conference session on seismic attribute analysis some time and you'll see what I mean. Let's just say there's a lot of arm-waving and shape-ology. 

Learning from geeks

We've written before about the virtues of the software engineering community. Innovation has been so rapid recently, that I think it's a great place to find interpretation hacks like pair picking. Learning about and experiencing the amazing productivity of programmers is one of the reasons I think all scientists should learn to program (but not learn to be a programmer). You'll find out about concepts like version control, user-centered design, and test-driven development. Programmers embrace these ideas to a greater or lesser degree, depending on their goals and those of the project they're working on. But all programmers know them.

I'm especially into test-driven development at the moment. The idea is that before implementing a new module or feature, you write a test — a short program that gives the new thing some input, inspects the output, and compares it to a known answer. The first version of the code will likely fail the test. The idea is to refactor the code until it passes the test. Then you add that test to a suite that runs every time you build anything in the same project, so you know your new thing doesn't get broken by something else later. And you aren't tempted to implement features that weren't part of the test.

Fail — Refactor — Pass

Imagine test-driven development geology (or any other kind of geoscience). What would that look like?

  • When planning wells, we often do write tests — they're called prognoses. But the comparison with the result is rarely formalized or quantified, especially outside the target zone. Once the well is drilled, it becomes data and we move on. No-one likes to dwell on the poorly understood or error-prone, but naturally that's where the greatest room for improvement is.  
  • When designing a new seismic attribute, or embarking on a seismic processing project, we often have a vague idea of success in our heads, and that's about it. What if we explicitly defined an input test dataset, some wells or bits of wells, and set 'passing' performance criteria on those? "I won't interpret the reprocessed seismic until it improves those synthetic correlation coefficients by 40%."
  • When designing a seismic survey, we could establish acceptable criteria for trace density, minimum offset, azimuth distribution, and recording time, then use these as a cost function to find the best possible survey for our dollars. Wait, perhaps we actually do this one. Is seismic acquisition unusually scientific? Or is it an inherently more linear problem?

What do you think? Can you see ways to define 'success' before you begin, then somewhat quantitatively compare your results with that? Ideas wanted!

Why don't people use viz rooms?

Matteo Niccoli asked me why I thought the use of immersive viz rooms had declined. Certainly, most big companies were building them in about 1998 to 2002, but it's rare to see them today. My stock answer was always "Linux workstations", but of course there's more to it than that.

What exactly is a viz room?

I am not talking about 'collaboration rooms', which are really just meeting rooms with a workstation and a video conference phone, a lot of wires, and wireless mice with low batteries. These were one of the collaboration technologies that replaced viz rooms, and they seem to be ubiquitous (and also under-used).

The Viz Lab at Wisconsin–Madison. Thanks to Harold Tobin for permission.A 'viz room', for our purposes here, is a dark room with a large screen, at least 3 m wide, probably projected from behind. There's a Crestron controller with greasy fingerprints on it. There's a week-old coffee cup because not even the cleaners go in there anymore. There's probably a weird-looking 3D mouse and some clunky stereo glasses. There might be some dusty haptic equipment that would work if you still had an SGI.

Why did people stop using them?

OK, let's be honest... why didn't most people use them in the first place?

  1. The rise of the inexpensive Linux workstation. The Sun UltraSPARC workstations of the late 1990s couldn't render 3D graphics quickly enough for spinning views or volume-rendered displays, so viz rooms were needed for volume interpretation and well-planning. But fast machines with up to 16GB of RAM and high-end nVidia or AMD graphics cards came along in about 2002. A full dual-headed set-up cost 'only' about $20k, compared to about 50 times that for an SGI with similar capabilities (for practical purposes). By about 2005, everyone had power and pixels on the desktop, so why bother with a viz room?
  2. People never liked the active stereo glasses. They were certainly clunky and ugly, and some people complained of headaches. It took some skill to drive the software, and to avoid nauseating spinning around, so the experience was generally poor. But the real problem was that nobody cared much for the immersive experience, preferring the illusion of 3D that comes from motion. You can interactively spin a view on a fast Linux PC, and this provides just enough immersion for most purposes. (As soon as the motion stops, the illusion is lost, and this is why 3D views are so poor for print reproduction.)
  3. They were expensive. Early adoption was throttled by expense  (as with most new technology). The room renovation might cost $250k, the SGI Onyx double that, and the projectors were $100k each. But  even if the capex was affordable, everyone forgot to include operating costs — all this gear was hard to maintain. The pre-DLP cathode-ray-tube projectors needed daily calibration, and even DLP bulbs cost thousands. All this came at a time when companies were letting techs go and curtailing IT functions, so lots of people had a bad experience with machines crashing, or equipment failing.
  4. Intimidation and inconvenience. The rooms, and the volume interpretation workflow generally, had an aura of 'advanced'. People tended to think their project wasn't 'worth' the viz room. It didn't help that lots of companies made the rooms almost completely inaccessible, with a locked door and onerous booking system, perhaps with a gatekeeper admin deciding who got to use it.
  5. Our culture of PowerPoint. Most of the 'collaboration' action these rooms saw was PowerPoint, because presenting with live data in interpretation tools is a scary prospect and takes practice.
  6. Volume interpretation is hard and mostly a solitary activity. When it comes down to it, most interpreters want to interpret on their own, so you might as well be at your desk. But you can interpret on your own in a viz room too. I remember Richard Beare, then at Landmark, sitting in the viz room at Statoil, music blaring, EarthCube buzzing. I carried on this tradition when I was at Landmark as I prepared demos for people, and spent many happy hours at ConocoPhillips interpreting 3D seismic on the largest display in Canada.  

What are viz rooms good for?

Don't get me wrong. Viz rooms are awesome. I think they are indispensable for some workflows: 

  • Well planning. If you haven't experienced planning wells with geoscientists, drillers, and reservoir engineers, all looking at an integrated subsurface dataset, you've been missing out. It's always worth the effort, and I'm convinced these sessions will always plan a better well than passing plans around by email. 
  • Team brainstorming. Cracking open a new 3D with your colleagues, reviewing a well program, or planning the next year's research projects, are great ways to spend a day in a viz room. The broader the audience, as long as it's no more than about a dozen people, the better. 
  • Presentations. Despite my dislike of PowerPoint, I admit that viz rooms are awesome for presentations. You will blow people away with a bit of live data. My top tip: make PowerPoint slides with an aspect ratio to fit the entire screen: even PowerPoint haters will enjoy 10-metre-wide slides.

What do you think? Are there still viz rooms where you work? Are there 'collaboration rooms'? Do people use them? Do you?

10 ways to improve your data store

When I look at the industry's struggle with the data mess, I see a parallel with science's struggle with open data. I've written lots about that before, but the basic idea is simple: scientists need discoverable, accessible, documented, usable data. Does that sound familiar?

I wrote yesterday that I think we have to get away from the idea that we can manage data like we might manage a production line. Instead, we need to think about more organic, flexible strategies that cope with and even thrive on chaos. I like, or liked until yesterday, the word 'curation', because it implies ongoing care and a focus on the future. But my friend Eric Marchand was right in his comment yesterday — the dusty connotation is too strong, and dust is bad for data. I like his supermarket analogy: packaged, categorized items, each with a cost of production and a price. A more lively, slightly chaotic market might match my vision better — multiple vendors maintaining their own realms. One can get carried away with analogies, but I like this better than a library or museum.

The good news is that lots of energetic and cunning people have been working on this idea of open data markets. So there are plenty of strategies we can try, alongside the current strategy of giving giant service companies millions of dollars for their TechCloud® Integrated ProSIGHT™ Data Management Solutions.

Serve your customer:

  • Above all else, build what people need. It's amazing that this needs to be said, but ask almost anyone what they think of IT at their company and you will know that it is not how it works today. Everything you build should be in response to the business pulling. 
  • This means you have to get out of the building and talk to your customers. In person, one-one-one. Watch them use your systems. Listen to them. Respond to them. 

Knock down the data walls:

  • Learn and implement open data practices inside the organization. Focus on discoverability, accessiblity, documentation of good-enough data, not on building The One True Database. 
  • Encourage and fund open data practices among providers of public data. There is a big role here for our technical societies, I believe, but I don't think they have seen it yet.

I've said it before: hire loads of geeks:

  • The web (well, intranet) is your pipeline. Build and maintain proper machine interfaces (APIs and web APIs) for data. What, you don't know how to do this? I know; it means hiring web-savvy data-obsessed programmers.
  • Bring back the hacker technologists that I think I remember from the nineties. Maybe it's a myth memory, but sprinkled around big companies there used to be super-geeks with degrees in astrophysics, mad UNIX skills, and the Oracle admin password. Now it's all data managers with Petroleum Technology certificates who couldn't write an awk script if your data depended on it (it does). 
  • Institute proper data wrangling and analysis training for scientists. I think this is pretty urgent. Anecdotal evidence: the top data integration tools in our business is PowerPoint... or an Excel chart with two y-axes if we're talking about engineers. (What does E&P mean?)

Three more things:

  • Let data live where it wants to live (databases, spreadsheets, wikis, SharePoint if you must). Focus on connecting data with APIs and data translators. It's pointless trying to move data to where you want it to be — you're just making it worse. ("Oh, you moved my spreadsheet? Fine, I will copy my spreadsheet.")
  • Get out of the company and find out what other people are doing. Not the other industry people struggling with data — they are just as clueless as we are. Find out what the people who are doing amazing things with data are doing: Google, Twitter, Facebook, data.gov, Wikipedia, Digital Science, The New York Times, The Guardian,... there are so many to choose from. We should invite these people to our conferences; they can help us.
  • If you only do one thing, fix search in your company. Stop tinkering with semantic ontologies and smart catalogs, just buy Google Search Appliance and fix it. You can get this one done by Christmas.

Last thing. If there's one mindset that will really get in the way, it's the project mindset. If we want to go beyond coping with the data mess, far beyond it to thriving on it, then we have to get comfortable with the idea that this is not a project. The word is banned, along with 'initiative', 'governance', and Gantt charts. The requirements you write on the back of a napkin with three colleagues will be just as useful as the ones you get back from three months of focus groups.

No, this is the rest of your career. This is never done, next year there are better ideas, more flexible libraries, faster hardware, and new needs. It's like getting fit: this ain't an 8-week get-fit program, it's an eternity of crunches.

The photograph of Covent Market in London, Ontario is from Boris Kasimov on Flickr.

Data management fairy tales

On Tuesday I read this refreshing post in LinkedIn by Jeffrey Maskell of Westheimer Energy Consultants. It's a pretty damning assessment of the current state of data management in the petroleum industry:

The fact is that no major technology advances have been seen in the DM sector for some 20 years. The [data management] gap between acquisition and processing/interpretation is now a void and is impacting the industry across the board...

I agree with him. But I don't think he goes far enough on the subject of what we should do about it. Maskell is, I believe, advocating more effort (and more budget) developing what the data management crowd have been pushing for years. In a nutshell:

I agree that standards, process, procedures, workflows, data models are all important; I also agree that DM certification is a long term desirable outcome. 

These words make me sad. I'd go so far as to say that it's the pursuit of these mythical ideas that's brought about today's pitiful scene. If you need proof, just look around you. Go look at your shared drive. Go ask someone for a well file. Go and (a) find then (b) read your IT policies and workflow documents — they're all fairy tales.

Maskell acknowledges at least that these are not enough; he goes on:

However I believe the KEY to achieving a breakthrough is to prove positively that data management can make a difference and that the cost of good quality data management is but a very small price to pay...

No, the key to achieving a breakthrough is a change of plan. Another value of information study just adds to the misery.

Here's what I think: 'data management' is an impossible fiction. A fairy tale.

You can't manage data

I'm talking to you, big-company-data-management-person.

Data is a mess, and it's distributed across your organization (and your partners, and your government, and your data vendors), and it's full of inconsistencies, and people store local copies of everything because of your broken SharePoint permissions, and... data is a mess.

The terrible truth you're fighting is that subsurface data wants to be a mess. Subsurface geoscience is not accounting. It's multi-dimensional. It's interdependent. Some of it is analog. There are dozens, maybe hundreds of formats, many of which are proprietary. Every single thing is unquantifiably uncertain. There are dozens of units. Interpretation generates more data, often iteratively. Your organization won't fund anything to do with IT properly — "We're an oil company, not a technology company!" — but it's OK because VPs only last 2 years. Well, subsurface facts last for decades.

You can't manage data. Try something else.

The principle here is: cope don't fix.

People earnestly trying to manage data reminds me of Yahoo trying to catalog the Internet in 1995. Bizarrely, they're still doing it... for 3 more months anyway. But we all know there's only one way to find things on the web today: search. Search transcends the catalog. 

So what transcends data management? I've got my own ideas, but first I really, really want to know what you think. What's one thing we could do — or stop doing — to make things a bit better?

Not picking parameters

I like socks. Bright ones. I've liked bright socks since Grade 6. They were the only visible garment not governed by school uniform, or at least not enforced, and I think that was probably the start of it. The tough boys wore white socks, and I wore odd red and green socks. These days, my favourites are Cole & Parker, and the only problem is: how to choose?

Last Tuesday I wrote about choosing parameters for geophysical algorithms — window lengths, velocities, noise levels, and so on. Like choosing socks, it's subjective, and it's hard to find a pair for every occasion. The comments from Matteo, Toastar, and GuyM raised an interesting question: maybe the best way to pick parameters is to not pick them? I'm not talking about automatically optimizing parameters, because that's still choosing. I'm talking about not choosing at all.

How many ways can we think of to implement this non-choice? I can think of four approaches, but I'm not 100% sure they're all different, or if I can even describe them...

Is the result really optimal, or just a hard-to-interpret patchwork?

Adaptivity

Well, okay, we still choose, but we choose a different value everywhere depending on local conditions. A black pair for a formal function, white for tennis, green for work, and polka dots for special occasions. We can adapt to any property (rather like automatic optimization), along any dimension of our data: spatially, azimuthally,  temporally, or frequentially (there's a word you don't see every day).

Imagine computing seismic continuity. At each sample, we might evaluate some local function — such as contrast — for a range of window sizes. Or, when smoothing, we might specifiy some minimum signal loss compared to the original. We end up using a different value everywhere, and expect an optimal result.

One problem is that we still have to choose a cost function. And to be at all useful, we would need to produce two new data products, besides the actual result: a map of the parameter's value, and a map of the residual cost, so to speak. In other words, we need a way to know what was chosen, and how satisfactory the choice was.

Stochastic shotgun

We could fall back on that geostatistical favourite and pick the parameter values stochastically, grabbing socks at random out of the drawer. This works, but I need a lot of socks to have a chance of getting even a local maximum. And we run into the old problem of really not knowing what to do with all the realizations. Common approaches are to take the P50, P10, and P90, or to average them. Both of these approaches make me want to ask: Why did I generate all those realizations?

Experimental design methods

The design of experiments is a big deal in the life sciences,  but for some reason rarely (never?) talked about in geoscience. Applying a cost function, or even just visual judgment, to a single parameter is perhaps trivial, but what if you have two variables? Three? What if they are non-linear and covariant? Then the optimization process amounts to a sticky inverse problem.

Fortunately, lots of clever people have thought about these problems. I've even seen them implemented in subsurface software. Cool-sounding combinatorial reduction techniques like Greco-Latin squares, or Latin hypercubes offer ways to intelligently sample the parameter space and organize the results. We could do the same with socks, evaluating pattern and toe colour separately...

The mixing board

There is another option: the mixing board. Like a music producer, a film editor, or the Lytro camera, I can leave the raw data in place, and always work from the masters. Given the right tools, I can make myself just the right pair of socks whenever I like.

This way we can navigate the parameter space, applying views, processes, or other tools on the fly. Clearly this would mean changing everything about the way we work. We'd need a totally different approach not just to interpretation, but to the entire subsurface characterization workflow.

Are there other ways to avoid choosing? What are people using in other industries, or other sciences? I think we need to invite some experimental design and machine learning people to SEG...

Cole & Parker socks are awesomeThe quilt image is by missvancamp on Flickr and licensed CC-BY. The spools are by surfzone on Flickr, licensed CC-BY. Many thanks to Cole & Parker for permission to use the sock images, despite not knowing what on earth I was going to do with them. Buy their socks! They're Canadian and everything.

Picking parameters

One of the reasons I got interested in programming was to get smarter about broken workflows like this one from a generic seismic interpretation tool (I'm thinking of Poststack-PAL, but does that even exist any more?)...

  1. I want to make a coherence volume, which requires me to choose a window length.
  2. I use the default on a single line and see how it looks, then try some other values at random.
  3. I can't remember what I did so I get systematic: I try 8 ms, 16 ms, 32 ms, and 64 ms, saving each one as a result with _XXms appended so I can remember what I did
  4. I display them side by side but the windows are annoying to line up and resize, so instead I do it once, display them one at a time, grab screenshots, and import the images into PowerPoint because let's face it I'll need that slide eventually anyway
  5. I can't decide between 16 ms and 32 ms so I try 20 ms, 24 ms, and 28 ms as well, and do it all again, and gaaah I HATE THIS STUPID SOFTWARE

There has to be a better way.

Stumbling towards optimization

Regular readers will know that this is the time to break out the IPython Notebook. Fear not: I will focus on the outcomes here — for the real meat, go to the Notebook. Or click on these images to see larger versions, and code.

Let's run through using the Canny edge detector in scikit-image, a brilliant image processing Python library. The algo uses the derivative of a Gaussian to compute gradient, and I have to choose 3 parameters. First, we'll try to optimize 'sigma', the width of the Gaussian. Let's try the default value of 1:

Clearly, there is too much noise in the result. Let's try the interval method that drove me crazy in desktop software:

Well, I think something between 8 and 16 might work. I could compute the average intensity of each image, choose a value in between them, and then use the sigma that gives that result. OK, it's a horrible hack, but turns out to be 10:

But the whole point of scientific computing is the efficient application of informed human judgment. So let's try adding some interactivity — then we can explore the 3D parameter space in a near-parallel instead of purely serial way:

I finally feel like we're getting somewhere... But it still feels a bit arbitrary. I still don't know I'm getting the optimal result.

What can I try next? I could try to extend the 'goal seek' option, and come up with a more sophisticated cost function. If I could define something well enough — for edge detection, like coherence, I might be interested in contrast — then I could potentially just find the best answers, in much the same way that a digital camera autofocuses (indeed, many of them look for the highest contrast image). But goal seeking, if the cost function is too literal, in a way begs the question. I mean, you sort of have to know the answer — or something about the answer — before you find it.

Social machines

Social machines are the hot new thing in computing (Big Data is so 2013). Perhaps instead I can turn to other humans, in my social and professional networks. I could...

  • Ask my colleagues — perhaps my company has a knowledge sharing network I can go to.
  • Ask t'Internet — I could ask Twitter, or my friends on Facebook, or a seismic interpretation group in LinkedIn. Better yet, Earth Science Stack Exchange!
  • What if the software I was using just told me what other people had used for these parameters? Maybe this is only one step up from the programmer's default... especially if most people just use the programmer's default.
  • But what if people could rate the outcome of the algorithm? What if their colleagues or managers could rate the outcome? Then I could weight the results with these ratings.
  • What if there was a game that involved optimizing images (OK, maybe a bit of a stretch... maybe more like a Mechanical Turk). Then we might have a vast crowd of people all interested in really pushing the edge of what is intuitively reasonable, and maybe exploring the part of the parameter space I'm most interested in.

What if I could combine the best of all these approaches? Interactive exploration, with guided optimization, constrained by some cost function or other expectation. That could be interesting, but unfortunately I have absolutely no idea how that would work. I do think the optimization workflow of the future will contain all of these elements.

What do you think? Do you have an awesome way to optimize the parameters of seismic attributes? Do you have a vision for how it could be better? It occurs to me this could be a great topic for a future hackathon...

ipynb_icon.png
Click here for an IPython Notebook version of this blog post. If you don't have it, IPython is easy to install. The easiest way is to install all of scientific Python, or use Canopy or Anaconda.