# Test driven development geoscience

Sometimes I wonder how much of what we do in applied geoscience is really science. Is it really about objective enquiry? Do we form hypotheses, then test them? The scientific method is largely a caricature — science is more accidental and more fun than a step-by-step recipe — but I think our field sometimes falls short of even basic rigour. Go and sit through a conference session on seismic attribute analysis some time and you'll see what I mean. Let's just say there's a lot of arm-waving and shape-ology.

### Learning from geeks

We've written before about the virtues of the software engineering community. Innovation has been so rapid recently, that I think it's a great place to find interpretation hacks like pair picking. Learning about and experiencing the amazing productivity of programmers is one of the reasons I think all scientists should learn to program (but not learn to be a programmer). You'll find out about concepts like version control, user-centered design, and test-driven development. Programmers embrace these ideas to a greater or lesser degree, depending on their goals and those of the project they're working on. But all programmers know them.

I'm especially into test-driven development at the moment. The idea is that before implementing a new module or feature, you write a test — a short program that gives the new thing some input, inspects the output, and compares it to a known answer. The first version of the code will likely fail the test. The idea is to refactor the code until it passes the test. Then you add that test to a suite that runs every time you build anything in the same project, so you know your new thing doesn't get broken by something else later. And you aren't tempted to implement features that weren't part of the test.

### Fail — Refactor — Pass

Imagine test-driven development geology (or any other kind of geoscience). What would that look like?

• When planning wells, we often do write tests — they're called prognoses. But the comparison with the result is rarely formalized or quantified, especially outside the target zone. Once the well is drilled, it becomes data and we move on. No-one likes to dwell on the poorly understood or error-prone, but naturally that's where the greatest room for improvement is.
• When designing a new seismic attribute, or embarking on a seismic processing project, we often have a vague idea of success in our heads, and that's about it. What if we explicitly defined an input test dataset, some wells or bits of wells, and set 'passing' performance criteria on those? "I won't interpret the reprocessed seismic until it improves those synthetic correlation coefficients by 40%."
• When designing a seismic survey, we could establish acceptable criteria for trace density, minimum offset, azimuth distribution, and recording time, then use these as a cost function to find the best possible survey for our dollars. Wait, perhaps we actually do this one. Is seismic acquisition unusually scientific? Or is it an inherently more linear problem?

What do you think? Can you see ways to define 'success' before you begin, then somewhat quantitatively compare your results with that? Ideas wanted!

1 Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Fitting a model to data

In studying the earth, we can't afford to take enough observations, and they will never be free of noise. So if you say you do geoscience, I hereby challenge you to formulate your work as a mathematical inverse problem. Inversion is a question: given the data, the physical equations, and details of the experiment, what is the distribution of physical properties? To answer this question we must address three more fundamental ones (Scales, Smith, and Treitel, 2001):

• How accurate is the data? Or what does fit mean?
• How accurately can we model the response of the system? Have we included all the physics that can contribute signifcantly to the data?
• What is known about the system independent of the data? There must be a systematic procedure for rejecting unreasonable models that fit the data as well.

Setting up an inverse problem means coming up with the equations that contain the physics and geometry of the system under study. The method for solving it depends on the nature of the system of equations. The simplest is the minimum norm solution, and you've heard of it before, but perhaps under a different name.

### To fit is to optimize a system of equations

For problems where the number of observations is greater than the number of unknowns, we want to find which unknowns fit the best. One case you're already familiar with is the method of least squares — you've used it fitting a line of through a set of points. A line is unambiguously described by only two parameters: slope a and y-axis intercept b. These are the unknowns in the problem, they are the model m that we wish to solve for. The problem of line-fitting through a set of points can be written out like this,

$\large \mathbf{d}=\begin{pmatrix}y_{1} \\ y_{2} \\ ... \\ y_{n} \end{pmatrix}, \mathbf{G}=\begin{pmatrix} x_{1} & 1\\ x_{2} & 1\\ ... & ...\\ x_{n} & 1 \end{pmatrix}, \mathbf{m}=\begin{pmatrix} a\\ b \end{pmatrix}$

As I described in a previous post, the system of the problem takes the form d = Gm, where each row links a data point to an equation of a line. The model vector m (M × 1), is smaller than the data d (N × 1) which makes it an over-determined problem, and G is a N × M matrix holding the equations of the system.

Why cast a system of equations in this matrix form? Well, it turns out that the the best-fit line is precisely,

$\large \mathbf{m}=(G^{T}G)^{-1}G^{T}\mathbf{d}$

which are trivial matrix operations, once you've written out G.  T means to take the transpose, and –1 means the inverse, the rest is matrix multiplication. Another name for this is the minimum norm solution, because it defines the model parameters (slope and intercept) for which the lengths (vector norm) between the data and the model are a minimum.

One benefit that comes from estimating a best-fit model is that you get the goodness-of-fit for free. Which is convenient because making sense of the earth doesn't just mean coming up with models, but also expressing their uncertainty, in terms of the errors with which they are linked.

I submit to you that every problem in geology can be formulated as a mathematical inverse problem. The benefit of doing so is not just to do math for math's sake, but it is only through quantitatively portraying ambiguous inferences and parameterizing non-uniqueness that we can do better than interpreting or guessing.