# The order of stratigraphic sequences

Much of stratigraphic interpretation depends on a simple idea:

Usually, geologists are faced with only the stratigraphic picture, and are challenged with reconstructing the geographic picture.

One interpretation strategy might be to look at which rocks tend to occur together in the stratigraphy. The idea is that rock types tend to be associated with geographic environments — maybe fine sand on the shoreface, coarse sand on the beach; massive silt in the tidal channel, rhythmically laminated mud in the mud-flats. Since if two rocks tend to occur together, their environments were probably adjacent, we can start to understand associations between the rock types, and thus piece together the geographic picture.

So which rock types tend to occur together, and which juxtapositions are spurious — perhaps the result of allocyclic mechanisms like changes in relative sea-level, or sediment supply? To get at this question, some stratigraphers turn to Markov chain analysis.

### What is a Markov chain?

Markov chains are sequences of events, or states, resulting from a Markov process. Here’s how Wikipedia describes a Markov process:

So if we believe that a stratigraphic sequence (I’m using ‘sequence’ here in the most general sense) can be modeled by a process like this — i.e. that its next state depends substantially on its present state — then perhaps we can model it as a Markov chain.

For example, we might have a hunch that we can model a shallow marine system as a sequence like:

offshore mudstone > lower shoreface siltstone > upper shoreface sandstone > foreshore sandstone

Then we might expect to see these transitions occur more often than other, non-successive transitions. In other words — if we compare the transition frequencies we observe to the transition frquencies we would expect from a random sequence of the same beds in the same proportions, then autocyclic or genetic transitions might happen unusually frequently.

### The Powers & Easterling method

Several workers have gone down this path. The standard approach seems to be that of Powers & Easterling (1982). Here are the steps they describe:

• Count the upwards transitions for each rock type. This results in a matrix of counts. Here’s the transition frequency matrix for the example used in the Powers & Easterling paper, in turn take from Gingerich (1969):

data = [[ 0, 37,  3,  2],
[21,  0, 41, 14],
[20, 25,  0,  0],
[ 1, 14,  1,  0]]
• Compute the expected counts by an iterative process, which usually converges in a few steps. The expected counts represent what Goodman (1968) called a ‘quasi-independence’ model — a random sequence:

array([[ 0. , 31.3,  8.2,  2.6],
[31.3,  0. , 34.1, 10.7],
[ 8.2, 34. ,  0. ,  2.8],
[ 2.6, 10.7,  2.8,  0. ]])
• Now we can compare our observed frequencies with the expected ones in two ways. First, we can inspect the $$\chi^2$$ statistic, and compare it with the $$\chi^2$$ distribution, given the degrees of freedom (5 in this case). In this example, it’s 35.7, which is beyond the 99.999th percentile of the chi-squared distribution. This rejects the hypothesis of quasi-independence. In other words: the succession appears to be organized. Phew!

• Secondly, we can compute a matrix of so-called normalized differences. This lets us compare the observed and expected data. By calculating Z-scores, which are approximately normally distributed; since 95% of the distribution falls between −2 and +2, any value greater in magnitude than 2 is ‘fairly unusual’, in the words of Powers & Easterling. In the example, we can see that the large number of transitions from C (third row) to A (first column) is anomalous:


array([[ 0. ,  1. , -1.8, -0.3],
[-1.8,  0. ,  1.2,  1. ],
[ 4.1, -1.6,  0. , -1.7],
[-1. ,  1. , -1.1,  0. ]])
• The normalized difference matrix can also be interpreted as a directed graph, indicating the ‘strengths’ of the connections (edges) between rock types (nodes):

It would be all too easy to over-interpret this graph — B and D seem to go together, as do A and C, and C tends to pass into A, which tends to pass into a B/D system before passing back into C — and one could get carried away. But as a complement to sedimentological interpretation, knowledge of processes and the succession in hand, perhaps inspecting Markov chains can help understand the stratigraphic story.

One last thing… there is another use for Markov chains. We can also use the model to produce stochastic realizations of stratigraphy. These will share the same statistics as the original data, but are otherwise quite random. Here are 20 random beds generated from our model:

'ABABCBABABCABDABABCA'

The code to build your own Markov chains is all in this notebook. It’s very much a work in progress. Eventually I hope to merge it into the striplog library, but for now it . Stay tuned for more on that.

⇐   Launch the notebook right here in your browser!

### References

Gingerich, PD (1969). Markov analysis of cyclic alluvial sediments. Journal of Sedimentary Petrology, 39, p. 330-332. https://doi.org/10.1306/74D71C4E-2B21-11D7-8648000102C1865D

Goodman, LA (1968), The analysis of cross-classified data: independence, quasi-independence, and interactions in contingency tables with or without missing entries. Journal of American Statistical Association 63, p. 1091-1131. https://doi.org/10.2307/2285873

Powers, DW and RG Easterling (1982). Improved methodology for using embedded Markov chains to describe cyclical sediments. Journal of Sedimentary Petrology 52 (3), p. 0913-0923. https://doi.org/10.1306/212F808F-2B24-11D7-8648000102C1865D

Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# The right writing tools

Scientists write, it's part of the job. If writing feels laborious, it might be because you haven't found the right tools yet.

The wrong tools <cough>Word</cough> feel like a lot of work. You spend a lot of time fiddling with font sizes and not being sure whether to use italic or bold. You're constantly renumbering sections after edits. Everything moves around when you resize a figure. Tables are a headache. Table of contents? LOL.

If this sounds familiar, check out the following tools — arranged more or less in order of complexity.

### Markdown

If you've never experienced writing with a markup language, you're in for a treat. At first it might feel clunky, but it quickly gets out of the way, leaving you to focus on the writing. Markdown was invented by John Gruber in about 2004; it is now almost ubiquitous in tools for developers. It's very lightweight, but compatible with HTML and LaTeX math, so it has plenty of features. Styling is absent from the document itself, being applied enitrely in post-production, as it were. With help from pandoc, you can compile Markdown documents to almost any format (e.g. PDF or Word). As a result, Markdown is sufficient for at least 70% of my writing projects. Here's a sampling of Markdown markup, rendered on the right with no styling:

### Jupyter Notebook

If you've been following along with our X Lines of Python series, or any of our other code-centric content, you'll have come across Jupyter Notebooks. These documents combine Markdown with code (in more or less any language you can think of) and the outputs of code — data, charts, images, etc. More than containing code, a so-called kernel can also run the code: Notebooks are fully computable documents. Not only could you write a paper or book in a Notebook, many people use them to give presentations with fully interactive, live code blocks and widgets.

### LaTeX

I discovered LaTeX in about 1993 and it was love at first sight. I've always been a bit of a typography nerd, and LaTeX — like TeX, around which LaTeX is wrapped — really cares about typography. So you get ligatures, hyphenation, sentence spacing, and kerning for free. It also cares about mathematics, cross-references, bibliographies, page numbering, tables of contents, and everything else you need for publication-ready documents.

You can install LaTeX locally, but there are several ways to use LaTeX online, without installing anything — and you get the best of both worlds: markup with WYSIWYG editing. OverleafShareLaTex (which is merging with Overleaf), Authorea, and Papeeria are all worth a look, especially if you write scientific papers.

### When WYSISYG works

Sometimes you just want a couple of headings and some text, or you need to share a document with others. I often go for WYSISYG in those situations too — Google Docs is the best WYSIWYG editor I've used. When it supports Markdown too, which is surely only a matter of time, it will be perfect.

What about you, do you have a favourite writing tool? Share it in the comments.

7 Comments

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Digitalization... of what?

I've been hearing a lot about 'digitalization', or 'digital transformation', recently. What is this buzzword?

The general idea seems to be: exploit lots and lots of data (which we already have), with analytics and machine learning probably, to do a better job finding and producing fuel and energy safely and responsibly.

At the centre of it all is usually data. Lots of data, usually in a lake. And this is where it all goes wrong. Digitalization is not about data. And it's not about technology either. Or cloud. Or IoT.

Interest in the terms "digital transformation" and "digitalization" since 2004, according to Google Trends. The data reveal a slight preference for the term "digitalization" in central and northern Europe. Google Ngram Viewer indicates that the term "digitalization" has been around for over 100 years, but it is also a medical term connected with the therapeutic use of digitalis. Just to be clear, that's not what we're talking about.

### It's about people

Oh no, here I go with the hand-wavy, apple-pie "people not process" nonsense... well, yes. I'm convinced that it's humans we're transforming, not data or technology. Or clouds. Or Things.

I think it's worth spelling out, because I think most corporations have not grasped the human aspect yet. And I don't think it's unreasonable to say that petroleum has a track record of not putting people at the centre of its activities, so I worry that this will happen again. This would be bad, because it might mean that digitalization not only fails to get traction — which would be bad enough because this revolution is long overdue — but also that it causes unintended problems.

Without people, digital transformation is just another top-down 'push' effort, with too much emphasis on supply. I think it's smarter to create demand, or 'pull', so that professionals are asking for support, and tools, and databases, and are engaged in how those things are created.

# Put technical professionals at the heart of the revolution, and the rest will follow. The inverse is not true.

### Strategies

This is far from an exhaustive list, but here are some ideas for ways to get ahead in digital transformation:

• Make it easy for digitally curious people to dip a toe in. Build a beginner-friendly computing environment, and encourage people to use it. Challenge your IT people to support a culture of experimentation and creativity.
• Give those curious professionals access to professional development channels, whether it's our courses, other courses, online channels like Lynda.com or Coursera, or whatever.
• Build a community of practice for 'scientific computing'. Whether it's a Yammer group or something more formal, be sure to encourage frequent face-to-face meetups, and perhaps an intranet portal.
• Start to connect subsurface professionals with software engineers, especially web programmers and data scientists, elsewhere in the organization. I think the best way is to embed programmers into technical teams.
• Encourage participation in external channels like conferences and publications, data science contests, hackathons, open source projects, and so on. I guarantee you'll see a step change in skills and enthusiasm.

The bottom line is that we're looking for a profound culutral change. It will be slow. More than that, it needs to be slow. It might only take a year or two to get traction for an idea like "digital first". But deeper concepts, like "machine readable microservices" or "data-driven decisions" or "reproducible workflows", must take longer because you can't build that high without a solid foundation. Successfully applying specific technologies like deep learning, augmented reality, or blockchain, will certainly require a high level of technology literacy, and will take years to get right.

What's going on with scientific computing in your organization? Are you 'digitally curious'? Do you feel well-supported? Do you know others in your organization like you?

The circuit board image in the thumbnail for this post is by Carl Drougge, licensed CC-BY-SA.

2 Comments

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# No more rainbows!

"the rainbow color map can significantly reduce a person’s accuracy and efficiency"
Borkin et al. (2011)

File under "Aaarrrrrrgghhhhhhh"

The world has known for at least 20 years that the rainbow colourmap is A Bad Thing, perhaps even A Very Bad Thing. IBM researchers Bernice Rogowitz and Lloyd Treinish — whose research on the subject goes back to the early 90s — wrote their famous article Why should engineers and scientists worry about color? in 1996. Visualization guru Edward Tufte highlighted the problems with it in his 1997 book Visual Explanations (if you haven't read this book, you must buy it immediately).

This isn't a matter of taste, or opinion. We know — for sure, with science! — that the rainbow is a bad choice for the visualization of data. And yet people use it every day, even in peer-reviewed literature. And — purely anecdotally — it seems to be especially rife in geoscience <citation needed>.

### Why are we talking about this?

The rainbow colourmap suffers from a number of severe problems:

• It's been linked to inferior image interpretation by professionals (Borkin et al 2011).
• It introduces ambiguity into the display: are we looking at the data's distribution, or the colourmap's?
• It introduces non-existent structure into the display — notice the yellow and cyan stripes, which manifest as contours:
• Colourblind people cannot read the colours properly — I made this protanopic simulation with Coblis
• It does not have monotonically increasing lightness, so you can't reproduce it in greyscale.
• There's no implicit order to hues, so it's hard to interpret meaning intuitively.
• On a practical note, it uses every available colour, leaving you none for annotation.

For all of these reasons, MATLAB and Matplotlib no longer use rainbow-like colourmaps by default. And neither should you.

### But I like rainbows!

People tend like things that are bad for them. Chris Jackson (Imperial, see here and here) and Bert Bril (dGB, in Slack) have both expressed an appreciation for rainbow-like colourmaps, or at least an indifference. Bert went so far as to say he doesn't like 'perceptual' colourmaps — those that monotonically and linearly increase in brightness.

I don't think indifference is allowed. Research with professional image interpreters has shown us that rainbow colourmaps impair the quality of their work. We know that these colours are hard for colourblind people to use. The practical issues of not being readable in greyscale and leaving no colours for annotation are always present. There's just no way we can ask, "Does it matter?" — at least not without offering some evidence that goes beyond mere anecdote.

I think what people like is the colour variance — it acts like contours, highlighting subtle features in the surface. Some of this extra detail is probably noise, but some is certainly signal, maybe even opportunity.

See what you think of these renderings of the seafloor pick on the Penobscot dataset, offshore Nova Scotia (licensed CC-BY-SA by dGB Earth Sciences and The Government of Nova Scotia). The top row are some rainbow-like colourmaps, all bad. The others are a selection of (more-or-less) perceptually awesome colourmaps. The names under each map are the names of the colourmaps in Python's matplotlib package.

### The solution

We know what kind of colourmaps are good for interpretation: those that increase linearly and monotonically in brightness, with no jumps or stripes of luminance. I've linked to lots of places where you can read about these — see the end of the post. You already know one perceptual colourmap: the humble Greyscale. But there are lots of others, so let's start with one of them.

Next, instead of using something that acts like contours, let's try using contours!

I think that's a big improvement already. Some tips for contouring:

1. Make them thin and black, with opacity at about 0.2 to 0.5. Transparency is essential.
2. Choose a fairly small interval; use index contours if there are more than about 10.
3. Label the contours directly on a large map. State the contour interval in the caption.

Let's try hillshading instead:

Also really nice.

Given that this is a water-bottom horizon, I like the YlGnBu colourmap, which resembles the thing it is modeling. (I think this is also a good basis for selecting a colourmap, by the way, all else being equal.)

I must admit I do find a lot of these perceptual colormaps get too dark at the 'low' end, which can make annotation (or seeing contours) hard. So we will fix that with a function (see the notebook) that generates perceptually linear colourmaps.

Now tell me the spectrum beats a perceptual colourmap...

Let's check that it is indeed colourblind-safe and grey-safe:

There you have it. If you care about your data and your readers, avoid rainbow-like colourmaps in the lab and in publications. Go perceptual!

The Python code and data to generate these images is available on GitHub.

Better yet, click here to play with the data right in your browser!

What do you think? Are rainbow colourmaps here to stay?

### Still not convinced?

For goodness sake, just listen to Kristin Thyng for 20 minutes:

29 Comments

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Not getting hacked

This kind of password is horrible for lots of reasons. The real solution to password madness is a password manager.

The end of the year is a great time to look around at your life and sort stuff out. One of the things you almost certainly need to sort out is your online security. Because if you haven't been hacked already (you probably have), you're just about to be.

Just look at some recent stories from the world of data security:

There are plenty of others; Wired has been keeping track of them — read more here. Or check out Wikipedia's list.

Despite all this, I see hardly anyone using a password manager, and anecdotally I hear that hardly anyone uses two-factor authentication either. This tells me that at least 80% of smart people, inlcuding lots of my friends and relatives, are in daily peril. Oh no!

After reading this post, I hope you do two things:

• Start using a password manager. If you only do one thing, do this.
• Turn on two-factor authentication for your most vulnerable accounts.

### Start using a password manager

Please, right now, download and install LastPass on every device and in every browser you use. It's awesome:

• It stores all your passwords! This way, they can all be different, and each one can be highly secure.
• It generates secure, random passwords for new accounts you create.
• It scores you on the security level of your passwords, and lets you easily change insecure ones.
• The free version is awesome, and the premium version is only \$2/month.

There are other password managers, of course, but I've used this one for years and it's excellent. Once you're set up, you can start changing passwords that are insecure, or re-used on multiple sites... or which are at Uber, Yahoo, or Equifax.

One surprise from using LastPass is being able to count the number of accounts I have created around the web over the years. I have 473 accounts stored in LastPass! That's 473 places to get hacked... how many places are you exposed?

The one catch: you need a bulletproof key for your password manager. Best advice: use a long pass-phrase instead.

The obligatory password cartoon, by xkcd and licensed CC-BY-NC

### Two-factor authentication

Sure, it's belt and braces — but you don't want your security trousers to fall down, right?

Er, anyway, the point is that even with a secure password, your password can still be stolen and your account compromised. But it's much, much harder if you use two-factor authentication, aka 2FA. This requires you to enter a code — from a hardware key or an app, or received via SMS — as well as your password. If you use an app, it introduces still another layer of security, because your phone should be locked.

I use Google's Authenticator app, and I like it. There's a little bit of hassle the first time you set it up, but after that it's plain sailing. I have 2FA turned on for all my 'high risk' accounts: Google, Twitter, Facebook, Apple, AWS, my credit card processor, my accounting software, my bank, my domain name provider, GitHub, and of course LastPass. Indeed, LastPass even lets me specify that logins must originate in Canada.

### What else can you do?

There are some other easy things you can do to make yourself less hackable:

• Install updates on your phones, tablets, and other computers. Keep browsers and operating systems up to date.
• Be on high alert for phishing attempts. Don't follow links to sites like your bank or social media sites — type them into your browser if possible. Be very suspicious of anyone contacting you, especially banks.
• Don't use USB sticks. The cloud is much safer — I use Dropbox myself, it's awesome.

For more tips, check out this excellent article from Motherboard on not getting hacked.

4 Comments

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# x lines of Python: load curves from LAS

Welcome to the latest x lines of Python post, in which we have a crack at some fundamental subsurface workflows... in as few lines of code as possible. Ideally, x < 10.

We've met curves once before in the series — in the machine learning edition, in which we cheated by loading the data from a CSV file. Today, we're going to get it from an LAS file — the popular standard for wireline log data.

Just as we previously used the pandas library to load CSVs, we're going to save ourselves a lot of bother by using an existing library — lasio by Kent Inverarity. Indeed, we'll go even further by also using Agile's library welly, which uses lasio behind the scenes.

The actual data loading is only 1 line of Python, so we have plenty of extra lines to try something more ambitious. Here's what I go over in the Jupyter notebook that goes with this post:

1. Load an LAS file with lasio.
2. Look at its header.
3. Look at its curve data.
4. Inspect the curves as a pandas DataFrame.
5. Load the LAS file with welly.
6. Look at welly's Curve objects.
7. Plot part of a curve.
8. Smooth a curve.
9. Export a set of curves as a matrix.
10. BONUS: fix some broken things in the file header.

Each one of those steps is a single line of Python. Together, I think they cover many of the things we'd like to do with well data once we get our hands on it. Have a play with the notebook and explore what you can do.

Next time we'll take things a step further and dive into some seismic petrophysics.

Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# x lines of Python: read and write CSV

A couple of weeks ago, in Murphy's Law for Excel, I wrote about the dominance of spreadsheets in applied analysis, and how they may be getting out of hand. Then in Organizing spreadsheets I wrote about how — if you are going to store data in spreadsheets — to organize your data so that you do the least amount of damage. The general goal being to make your data machine-readable. Or, to put it another way, to allow you to save your data as comma-separated values or CSV files.

CSV is the de facto standard way to store data in text files. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in your analysis tool of choice. In our case, this is the Python language. So today I present a few different ways to get at data stored in CSV files.

### How many ways can I read thee?

In the accompanying Jupyter Notebook, we read a CSV file into Python in six different ways:

1. Using the pandas data analysis library. It's the easiest way to read CSV and XLS data into your Python environment...
2. ...and can happily consume a file on the web too. Another nice thing about pandas. It also writes CSV files very easily.
3. Using the built-in csv package. There are a couple of standard ways to do this — csv.reader...
4. ...and csv.DictReader. This library is handy for when you don't have (or don't want) pandas.
5. Using numpy, the numeric library for Python. If you just have a CSV full of numbers and you want an array in the end, you can skip pandas.
6. OK, it's not really a CSV file, but for the finale we read a spreadsheet directly from Google Sheets.

I usually count my lines diligently in these posts, but not this time. With pandas you're looking at a one-liner to read your data:

df = pd.read_csv("myfile.csv")

and a one-liner to write it out again. With csv.DictReader you're looking at 3 lines to get a list of dicts (but watch out: your numbers will be strings). Reading a Google Doc is a little more involved, not least because you'll need to set up an app and get an API key to handle authentication.

That's all there is to CSV files. Go forth and wield data like a pro!

Next time in the xlines of Python series we'll look at reading seismic station data from the web, and doing a bit of time-series analysis on it. No more stuff about spreadsheets and CSV files, I promise :)

The thumbnail image is based on the possibly apocryphal banksy image of an armed panda, and one of texturepalace.com's CC-BY textures.

3 Comments

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Organizing spreadsheets

A couple of weeks ago I alluded to ill-formed spreadsheets in my post Murphy's Law for Excel. Spreadsheets are clearly indispensable, and are definitely great for storing data and checking CSV files. But some spreadsheets need to die a horrible death. I'm talking about spreadsheets that look like this (click here for the entire sheet):

This spreadsheet has several problems. Among them:

• The position of a piece of data changes how I interpret it. E.g. a blank row means 'new sheet' or 'new well'.
• The cells contain a mixture of information (e.g. 'Site' and the actual data) and appear in varying units.
• Some information is encoded by styles (e.g. using red to denote a mineral species). If you store your sheet as a CSV (which you should), this information will be lost.
• Columns are hidden, there are footnotes, it's just a bit gross.

Using this spreadsheet to make plots, or reading it with software, with be a horrible experience. I will probably swear at my computer, suffer a repetitive strain injury, and go home early with a headache, cursing the muppet that made the spreadsheet in the first place. (Admittedly, I am the muppet that made this spreadsheet in this case, but I promise I did not invent these pathologies. I have seen them all.)

### Let's make the world a better place

Consider making separate sheets for the following:

• Raw data. This is important. See below.
• Computed columns. There may be good reasons to keep these with the data.
• Charts.
• 'Tabulated' data, like my bad spreadsheet above, with tables meant for summarization or printing.
• Some metadata, either in the file properties or a separate sheet. Explain the purpose of the dataset, any major sources, important assumptions, and your contact details.
• A rich description of each column, with its caveats and assumptions.

The all-important data sheet has its own special requirements. Here's my guide for a pain-free experience:

• No computed fields or plots in the data sheet.
• No hidden columns.
• No semantic meaning in formatting (e.g. highlighting cells or bolding values).
• Headers in the first row, only data in all the other rows.
• The column headers should contain only a unique name and [units], e.g. Depth [m], Porosity [v/v].
• Only one type of data per column: text OR numbers, discrete categories OR continuous scalars.
• No units in numeric data cells, only quantities. Record depth as 500, not 500 m.
• Avoid keys or abbreviations: use Sandstone, Limestone, Shale, not Ss, Ls, Sh.
• Zero means zero, empty cell means no data.
• Only one unit per column. (You only use SI units right?)
• Attribution! Include a citation or citations for every record.
• If you have two distinct types or sources of data, e.g. grain size from sieve analysis and grain size from photomicrographs, then use two different columns.
• Personally, I like the data sheet to be the first sheet in the file, but maybe that's just me.
• Check that it turns into a valid CSV so you can use this awesome format.

After all that, here's what we have (click here for the entire sheet):

The same data as the first image, but improved. The long strings in columns 3 and 4 are troublesome, but we can tolerate them. Click to enlarge.

Maybe the 'clean' analysis-friendly sheet looks boring to you, but to me it looks awesome. Above all, it's easy to use for SCIENCE! And I won't have to go home with a headache.

The data in this post came from this Cretaceous shale dataset [XLS file] from the government of Manitoba. Their spreadsheet is pretty good and only breaks a couple of my golden rules. Here's my version with the broken and fixed spreadsheets shown here. Let me know if you spot something else that should be fixed!

8 Comments

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Murphy's Law for Excel

Where would scientists and engineers be without Excel? Far, far behind where they are now, I reckon. Whether it's a quick calculation, or making charts for a thesis, or building elaborate numerical models, Microsoft Excel is there for you. And it has been there for 32 years, since Douglas Klunder — now a lawyer at ACLU — gave it to us (well, some of us: the first version was Mac only!).

We can speculate about reasons for its popularity:

• It's relatively easy to use, and most people started long enough ago that they don't have to think too hard about it.
• You have access to it, and you know that your collaborators (boss, colleagues, future self) have access to it.
• It's flexible enough that it can do almost anything.

Figure 1 from 'Predicting bed thickness with cepstral decomposition'.

For instance, all the computation and graphics for my two 2006 articles on signal processing were done in Excel (plus the FFT add-on). I've seen reservoir simulators, complete with elaborate user interfaces, in Excel. An infinity of business-critical documents are stored in Excel (I just filled out a vendor registration form for a gigantic multinational in an Excel spreadsheet). John Nelson at ESRI made a heatmap in Excel. You can even play Pac Man.

Maybe it's gone too far:

### So what's wrong with Excel?

Nothing is wrong with it, but it's not the best tool for every number-crunching task. Why?

• Excel files are just that — files. Sometimes you want to do analysis across datasets, and a pool of data (a database) becomes more useful. And sometimes you wish nine different people didn't have nine different versions of your spreadsheet, each emailing their version to nine other people...
• The charts are rather clunky and static. They don't do well with large datasets, or in data you'd like to filter or slice dynamically.
• In large datasets, scrolling around a spreadsheet gets old pretty quickly.
• The tool is so flexible that people get carried away with pretty tables, annotating their sheets in ways that make the printed page look nice, but analysis impossible.

### What are the alternatives?

Excel is a wonder-tool, but it's not the only tool. There are alternatives, and you should at least know about them.

For everyday spreadsheeting needs, I now use Google Sheets. Collaboration is built-in. Being able to view and edit a sheet at the same time as someone else is a must-have (probably Office 365 does this now too, so if you're stuck with Excel I urge you to check). Version control — another thing I'm not sure I can live without — is built in. For real nerds, there's even a complete API. I also really like the native 'webbiness' of Google Docs, for example being able to use web API calls natively, for example getting the current CAD–USD exchange rate with GoogleFinance("CURRENCY:CADUSD").

If it's graphical analysis you want, try Tableau or Spotfire. I'm especially looking at you, reservoir engineers — you are seriously missing out if you're stuck in Excel, especially if you have a lot of columns of different types (time series, categories and continuous variables for example). The good news is that the fastest way to get data into Spotfire is... Excel. So it's easy to get started.

If you're gathering information from people, like registering the financial details of vendors for instance, then a web form is your best bet. You can set one up in Google Forms in minutes, and there are lots of similar services. If you want to use your own servers, no problem: any dev worth their wages can throw one together in a few hours.

If you're doing geoscience in Excel, like my 2006 self — filtering logs, or generating synthetics, or computing spectrums — your mind will be blown by spending a few hours learning a programming language. Your first day in Python (or Julia or Octave or R) will change your quantitative life forever.

Excel is great at some things, but for most things, there's a better way. Take some time to explore them the next time you have some slack in your schedule.

### References

Hall, M (2006). Resolution and uncertainty in spectral decomposition. First Break 24, December 2006, p 43–47.

Hall, M (2006). Predicting stratigraphy with cepstral decomposition. The Leading Edge 25 (2, Special Issue on Spectral Decomposition). doi:10.1190/1.2172313

### UPDATE

As a follow-up example, I couldn't resist sharing this recent story about an artist that draws anime characters in Excel.

3 Comments

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Machine learning meets seismic interpretation

Agile has been reverberating inside the machine learning echo chamber this past week at EAGE. The hackathon's theme was machine learning, Monday's workshop was all about machine learning. And Matt was also supposed to be co-chairing the session on Applications of machine learning for seismic interpretation with Victor Aare of Schlumberger, but thanks to a power-cut and subsequent rescheduling, he found himself double-booked so, lucky me, he invited me to sit in his stead. Here are my highlights, from the best seat in the house.

Before I begin, I must mention the ambivalence I feel towards the fact that 5 of the 7 talks featured the open-access F3 dataset. A round of applause is certainly due to dGB Earth Sciences for their long time stewardship of open data. On the other hand, in the sardonic words of my co-chair Victor Aarre, it would have been quite valid if the session was renamed The F3 machine learning session. Is it really the only quality attribute research dataset our industry can muster? Let's do better.

### Using seismic texture attributes for salt classification

Ghassan AlRegib ruled the stage throughout the session with not one, not two, but three great talks on behalf of himself and his grad students at Georgia Institute of Technology (rather than being a show of bravado, this was a result of problems with visas). He showed some exciting developments in shallow learning methods for predicting facies in seismic data. In addition to GLCM attributes, he also introduced a couple of new (to me anyway) attributes for salt classification. Namely, textural gradient and a thing he called seismic saliency, a metric modeled after the human visual system describing the 'reaction' between relative objects in a 3D scene.

Twelve Seismic attributes used for multi-attribute salt-boundary classification. (a) is RMS Amplitude, (B) to (M) are TEXTURAL attributes. See abstract for details. This figure is copyright of Ghassan AlRegib and licensed CC-BY-SA by virtue of being generated from the F3 dataset of dGB and TNO.

Ghassan also won the speakers' lottery, in a way. Due to the previous day's power outage and subsequent reshuffle, the next speaker in the schedule was a no-show. As a result, Ghassan had an extra 20 minutes to answer questions. Now for most speakers that would be a public-speaking nightmare, but Ghassan hosted the onslaught of inquiring minds beautifully. If we hadn't had to move on to the next next talk, I'm sure he could have entertained questions all afternoon. I find it fascinating how unpredictable events like power outages can actually create the conditions for really effective engagement.

### Salt classification without using attributes (using deep learning)

Matt reported on Anders Waldeland's work a year ago, and it was interesting to see how his research has progressed, as he nears the completion of his thesis.

Anders successfully demonstrated how convolutional neural networks (CNNs) can classify salt bodies in seismic datasets. So, is this a big deal? I think it is. Indeed, Anders's work seems like a breakthough in seismic interpretation, at least of salt bodies. To be clear, I don't think this means that it is time for seismic interpreters to pack up and go home. But maybe we can start looking forward to spending our time doing less tedious things than picking complex salt bodies.

One slice of a 3d seismic volume with two CLASS LABELS: Salt (red) and Not SALT (GREEN). This is the training data. On the right: Extracted 3D salt body in the same dataset, coloured by elevation. Copyright of A Waldeland, used with permission.

He trained a CNN on one manually labeled slice of a 3D cube and used the network to automatically classify the full 3D salt body (on the right in the figure). Conventional algorithms for salt picking, such as that used by AlRegib (see above), typically rely on seismic attributes to define a feature space. This requires professional insight and judgment, and is prone to error and bias. Nicolas Audebert mentioned the same shortcoming in his talk in the workshop Matt wrote about last week. In contrast, the CNN algorithm works directly on the seismic data, learning the most discriminative filters on its own, no attributes needed

### Intuition training

Machine learning isn't just useful for computing in the inverse direction such as with inversion, seismic interpretation, and so on. Johannes Amtmann showed us how machine learning can be useful for ranking the performance of different clustering methods using forward models. It was exciting to see: we need to get back into the habit of forward modeling, each and every one of us. Interpreters build synthetics to hone their seismic intuition. It's time to get insanely good at building forward models for machines, to help them hone theirs.

There were so many fascinating problems being worked on in this session. It was one of the best half-day sessions of technical content I've ever witnessed at a subsurface conference. Thanks and well done to everyone who presented.