# FORCE ML 2019: project round-up

The FORCE Machine Learning Hackathon and Symposium were a great success again this year (read all about last year). Kudos to Peter Bormann of ConocoPhillips Norge, who put the programme together — held over 3 days at the NPD in Stavanger, Norway, together. Here’s a round-up of the projects.

A visualization of how human-generated rock descriptions were distributed with respect to porosity measured from the core plug.

### from.cr.dscrptn.to.clssfctn

The team took up Peter’s challenge of translating abbreviated core descriptions (hence the strange team name) into something useful. Overall, the pipeline was clean > translate > classify. Cleaning was required to deal with a lot of ‘as above’ and other expediencies. As a first pass for translation, they tried simply substituting complete words for abbreviations: sandstone for ss, limestone for ls, and so on, but had more success with a bidirectional LSTM.

### Find it clean it analyse it

Given a pile of undifferentiated well files containing over 40,000 curves including LAS and DLIS, the team wanted to find and analyse image log data, especially FMIs. They successfully read the data they wanted with the new dlisio library from Equinor, then threw some texture analysis at it after interpolating across the data gaps and resampling to 360 bins. They then applied a k-means clustering with 6 clusters, to find some key textures in the data. GitHub repo.

### Just Surf

Using a synthetic dataset, the team (mostly coders from Emerson) set out to use convolutional deep neural networks to check if the structural model seems sensible, quantify the uncertainty, and validate the gridding algorithm used. The team brought 100 realizations for each map, and tried various combinations of single realizations and statistics from the cohort. They found that transfer learning on ResNet-50 did better than training from scratch. They said they looked forward to building on the work to produce tools for quality assurance, and they hope to use seismic data next time.

### Siamese seismic

The team applied a Siamese network, normally used on human faces, to the problem of classifying 3D seismic facies. The method is semi-supervised: the network is trained on the entire dataset, with some labeled subimages. This establises a latent space (a 3D latent space of the F3 seismic data is shown to the right) with semantically meaningful norms (i.e. distance between points means something useful), in which clusters can be found. Classification on unseen subimages is done in the latent space. The team almost had an app working, and also produced the start of a new open dataset of labels for the F3 seismic volume. The team was rewarded with a prize for innovation. GitHub repo.

### Lost Frequencies

This team formed spontaneously at the Tuesday meetup when it looked like there might not be any seismic projects! They set out to estimate attenuation using neural networks. This involved learning to pick maximum frequency from the peak frequency plus the seismic trace. They found that a 1D CNN did best out of all the methods they tried, and that including well logs somehow would likely improve the result quite a bit.

### Rock Pandas

A creenshot from the app the team built. Each circle is a collection of documents that can be filtered dynamically.

Geolocalizing documents is a much-needed task in any pile of PDF files. This team got lots of documents from Peter, with the goal to put them on a map. The characteristically diverse team extracted keywords from an NPD corpus, with preprocessing and regular expressions for well names and so on. They built a nice-looking slippy map app allowing a user to click on a well or field entity, and see the documents associated with the location. Documents hitting multiple keywords were tagged on many entities. The Rock Pandas team won the coveted People's Choice Award, for making a great start on a hard problem, and producing a working app in limited time. GitHub repo.

### Core team

In a reprise of a project last year, the team set out to get grain size from core photos. But then they thought: why not cut out the middle man and go straight for reservoir parameters? So they tried to get permeability from core photos. Using simple models, they got an accuracy of 60% with linear regression, and 69% with a neural network. Although they had some glitches in their approach (using porosity and not using depth, for example), they built a first pipeline for an interesting problem.

Some Unsupervised team members clustering around a problem.

### Somehow Unsupervised

Unsupervised learning has been a theme in a coupe of previous hackathons (Copenhagen and FORCE 2018), and it was good to see another iteration of these exciting ideas. The team used the very nice Geolink dataset. After filtering out poor quality data (based on caliper and local statistics), the team applied dimensionality reduction methods like UMAP and t-SNE (these are conceptually like PCA, but much more effective) to reduce the dataset to just 2 dimensions — allowing them to make lots of crossplots. Coloring points by lithology, sand type, GR, or fluid type allowed them to look at all sorts of trends and patterns. The team won a prize for the amount of ground they covered and the attractive plots. GitHub repo.

### Rock Stars

The Rock Stars took on Peter’s Make me that rock project. He wants an app which provides plausible rock properties and uncertainty for any location, depth, and formation on the Norwegian shelf. This gigantic team (12 of them!) decided to cluster the data first, then build a model for each cluster. They built an app which could indeed provide porosity and permeability given a location and depth. That such a huge team managed to converge on anything was an achievement, and they won a prize for taking on a tough project and getting a good way into it.

That’s it for this year! Thanks to all the participants for a fun week, and thank you to the sponsors (below) for supporting the event. Hope to see you in 2020.

More pictures from the event. Thanks to Alex Schaaf and the others that took photos.

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Superpowers for striplogs

In between recent courses and hackathons, I’ve been chipping away at some new features in striplog. An open-source Python package, striplog handles irregularly sampled data, like lithologic intervals, chronostratigraphic zones, or anything that isn’t regularly sampled like, say, a well log. Instead of defining what is present at every depth location, you define intervals with a top and a base. The interval can contain whatever you like: names of rocks, images, or special core analyses, or anything at all.

You can read about all of the newer features in the changelog, but let’s look at a couple of the more interesting ones…

### Binary morphology filters

Sometimes we’d like to simplify a striplog a bit, for example by ‘weeding out’ the thin beds. The tool has long had a method prune to systematically remove all intervals (e.g. beds) thinner than some cutoff; one can then optionally anneal the gaps, and merge the resulting striplog to combine similar neighbours. The result of this sequence of operations (prune, anneal, merge, or ‘PAM’) is shown below on the left.

If the intervals of a striplog have at least one property of a binary nature — with only two states, like sand and shale, or pay and non-pay — one can also use binary morphological operations. This well-known image processing technique aims to simplify data by eliminating small things. The result of opening vs closing operations is shown above.

### Markov chains

I wrote about Markov chains earlier this year; they offer a way to identify bias in the order of units in a stratigraphic column. I’ve now put all the code into striplog — albeit not in a very fancy way. You can import the Markov_chain class from striplog.markov, then use it in exactly the same way as in the notebook I shared in that Markov chain post:

I started with some pseudorandom data (top) representing a known succession of Mudstone (M), Siltstone (S), Fine Sandstone (F) and coarse sandstone (C). Then I generate a Markov chain model of the succession. The chi-squared test indicates that the succession is highly unlikely to be unordered. We can look at the normalized difference matrix, generate a synthetic sequence of lithologies, or plot the difference matrix as a heatmap or a directed graph. The graph illustrates the order we originally imposed: M-S-F-C.

There is one additional feature compared to the original implementation: multi-step Markov chains. Previously, I was only looking at immediately adjacent intervals (beds or whatever). Now you can look at actual vs expected transition frequencies for next-but-one interval, or next-but-two. Don’t ask me how to interpret that information though…

### Other new things

• New ways to anneal. Now the user can choose whether the gaps in the log are filled in by flooding upwards (that is, by extending the interval below the gap upwards), flooding downwards (extending the upper interval), or flooding symmetrically into the middle from both above and below, meeting in the middle. (Note, you can also fill gaps with another component, using the fill() method.)

• New merging strategies. Now you can merge overlapping intervals by precedence, rather than by blending the contents of the intervals. Precedence is defined however you like; for example, you can choose to keep the thickest interval in all overlaps, or if intervals have a date, you could keep the latest interval.

• Improved bar charts. The histogram is easier to use, and there is a new bar chart summary of intervals. The bars can be sorted by any property you like.

### Try it out and help add new stuff

You can install the latest version of striplog using pip. It’s as easy as:

 pip install striplog

Start by checking out the tutorial notebooks in the repo, especially Striplog_basics.ipynb. Let me know how you get on, or jump on the Software Underground Slack to ask for help.

Here are some things I’d like striplog to support in the future:

• Stratigraphic prediction.

• Well-to-well correlation.

• More interactions with well logs.

What ideas do you have? Or maybe you can help define how these things should work? Either way, do get in touch or check out the Striplog repository on GitHub.

1 Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

### Difficulty rating: Beginner

We'd often like to load images into Python. Once loaded, we might want to treat them as images, for example cropping them, saving in another format, or adjusting brightness and contrast. Or we might want to treat a greyscale image as a two-dimensional NumPy array, perhaps so that we can apply a custom filter, or because the image is actually seismic data.

This image-or-array duality is entirely semantic — there is really no difference between images and arrays. An image is a regular array of numbers, or, in the case of multi-channel rasters like full-colour images, a regular array of several numbers: one for each channel. So each pixel location in an RGB image contains 3 numbers:

In general, you can go one of two ways with images:

1. Load the image using a library that 'knows about' (i.e. uses language related to) images. The preeminent tool here is pillow (which is a fork of the grandparent of all Python imaging solutions, PIL).
2. Load the image using a library that knows about arrays, like matplotlib or scipy. These wrap PIL, making it a bit easier to use, but potentially losing some options on the way.

The Jupyter Notebook accompanying this post shows you how to do both of these things. I recommend learning to use some of PIL's power, but knowing about the easier options too.

Here's the way I generally load an image:

from PIL import Image
im = Image.open("my_image.png")

(One strange thing about pillow is that, while you install it with pip install pillow, you still actually import and use PIL in your code.) This im is an instance of PIL's Image class, which is a data structure especially for images. It has some handy methods, like im.crop(), im.rotate(), im.resize(), im.filter(), im.quantize(), and lots more. Doing some of these operations with NumPy arrays is fiddly — hence PIL's popularity.

But if you just want your image as a NumPy array:

import numpy as np
arr = np.array(im)

Note that arr is a 3-dimensional array, the dimensions being row, column, channel. You can go off with arr and do whatever you need, then cast back to an Image with Image.fromarray(arr).

All this stuff is demonstrated in the Notebook accompanying this post, or you can use one of these links to run it right now in your browser:

Run the accompanying notebook in MyBinder

Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Advice for a new hacker

So you’ve signed up for a hackathon — or maybe you’ve seen an event and you’re still thinking about it.

First thing: I can almost guarantee that you will not regret it, so if you haven’t committed yet, I challenge you to go and sign up now.

But even once you’ve chosen to go, maybe you feel nervous about your skills, or are worried about spending two days with strangers, or aren’t sure about the idea of competitive coding. Someone asked me recently how to prepare — technically and mentally — for the event.

I should say that I’ve only participated in a couple of hackathons, so I definitely don’t know everything there is to know. But I have organized more than 20 hackathons, and helped people skill up for them and (I hope!) enjoy them. Here are the top 10-ish things you can to do to get the most out of the event:

1. Brush up on your coding. Before the event, find out a bit about what kinds of projects are in the offing. If it’s a machine learning theme, brush up on your data science. Maybe image processing or text processing will be needed. Data management skills and database manipulation are always appreciated. Familiarty with a cloud environment, e.g. AWS, will help.

2. Find a friend. Either take someone with you, or find a friendly face when you get there. It’s 100% possible to navigate the experience on your own, but much more fun with a partner.

3. Dive in. You get out of the event what you put in. It’s like most learning experiences. You need an open mind, an enthusiastic demeanour, and a can-do attitude.

4. Contribute. There’s never enough time, so you are a much-needed part of your team, but unless there’s a strong effort to coordinate the project, it’ll be a bit unstructured. You’ll have to take the initiative on things.

5. Use a kanban. To help team members see the big picture and select tasks for themselves, put them on stickies on a nearby board. Make 3 areas: ‘to do’, ‘in progress’ and ‘done’. The goal is to move them from left to right.

6. Ask for help. Every event Agile runs has non-hackers around to help out with stuff — anything from dietary needs to datasets to coding advice. Don’t get stuck on something, find someone to help you.

7. Take breaks. You and your team should go for a short walk every 90 minutes or so. Relax a bit, but also get caught up: get progress reports from everyone, re-evaluate the goals, identify issues. You will find more clarity away from your keyboards.

8. Work backwards from the demo. A good strategy is to outline what would make a killer demo of the project you have selected. Include at least one “Wow” feature if at all possible. Then work out what you need to either fake or build to make that demo. Build what you can, fake the rest.

9. Check in with the other teams. This might not fly at highly competitive events, but at more casual affairs or if everyone is working on different projects, try chatting to some other teams, especially during breaks.

10. Label your equipment. Hackathons are pretty chaotic, and although 99.9% of hackers are awesome, it’s still a roomful of strangers, so label the gear you care about. And of course keep your phone and computer locked.

11. Reciprocate. Almost all these bits of advice have corollaries: be friendly and welcoming, accept contributions from others, give help if asked, and so on. Hackathons are social events as much as technical ones — enjoy meeting and collaborating with others.

If you have signed up for an event — I hope you love it! Do let us know how you get along. Or, if you’ve already been to a hackathon and have some advice to share — leave a comment below.

If you’re looking for an event to go to and you’re in western Europe — here’s one! It’s the FORCE Machine Learning Hackathon in Stavanger, Norway. I recently wrote about it — check it out.

If you’re looking for subsurface or geoscience project ideas, then I have a lot of reading for you. Check out the long list of hackathons reports on this blog. You can also dive into the Software Underground Slack to discuss project ideas there.

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# x lines of Python: Physical units

### Difficulty rating: Intermediate

Have you ever wished you could carry units around with your quantities — and have the computer figure out the best units and multipliers to use?

pint is a nice, compact library for doing just this, handling all your dimensional analysis needs. It can also detect units from strings. We can define our own units, it knows about multipliers (kilo, mega, etc), and it even works with numpy and pandas.

To use it in its typical mode, we import the library then instantiate a UnitRegistry object. The registry contains lots of physical units:

import pint
units = pint.UnitRegistry()
thickness = 68 * units.m

Now thickness is a Quantity object with the value <Quantity(68, 'meter')>, but in Jupyter we see a nice 68 meter (as far as I know, you're stuck with US spelling).

Let's make another quantity and multiply the two:

area = 60 * units.km**2
volume = thickness * area

This results in volume having the value <Quantity(4080, 'kilometer ** 2 * meter')>, which pint can convert to any units you like, as long as they are compatible:

>>> volume.to('pint')
8622575788969.967 pint

More conveniently still, you can ask for 'compact' units. For example, volume.to_compact('pint') returns 8.622575788969966 terapint. (I guess that's why we don't use pints for field volumes!)

There are lots and lots of other things you can do with pint; some of them — dealing with specialist units, NumPy arrays, and Pandas dataframes — are demonstrated in the Notebook accompanying this post. You can use one of these links to run this right now in your browser if you like:

Run the accompanying notebook in MyBinder

Run the notebook in Google Colaboratory (note the install cell at the beginning)

That's it for pint. I hope you enjoy using it in your scientific computing projects. If you have your own tips for handling units in Python, let us know in the comments!

There are some other options for handling units in Python:

1 Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# The hack returns to Norway

Last autumn Agile helped Peter Bormann (ConocoPhillips Norge) and the FORCE consortium host the first geo-flavoured hackathon in Norway. Maybe you were there, or maybe you read about the nine fascinating machine learning projects here on the blog. If so, you’ll know it was a great event, so we’re doing it again!

### Hackthon: 18 and 19 SeptemberSymposium: 20 September

Check out last year’s projects here. Projects included Biostrat!, Virtual Metering, sketch2seis, and AVO ML — a really interesting AVO approach exploiting latent spaces (see image, right). Most of them are on GitHub and could be extended this year.

Part of what I love about these things is that we have no idea what the projects will be. As last year, there’ll be a pre-hackathon meetup in Storhaug the evening before Day 1 (on 17 September) — we’ll figure it all out there. In the meantime, if you have an idea check out the link at the end of this post where you can share and discuss it with others.

The hackathon will be followed by a one-day symposium on machine learning in the subsurface (left). This well attended event was also excellent last year, and promises to deliver again in 2019. Peter did a briliant job of keeping things rooted in real results from real research, so you won’t be subjected to the parade of marketing talks you might have been subjected to at certain other conferences.

Find out more and sign up on NPD.no! Don’t delay; places are limited.

Submit and discuss project ideas on Agile’s Events page. Note that this does not sign you up for the event.

Get on softwareunderground.com/slack to discuss the event in the #force-hack-2019 channel.

See you there!

Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Impact structures in seismic

I saw this lovely tweet from PGS yesterday:

Kudos to them for sharing this. It’s always great to see seismic data and interpretations on Twitter — especially of weird things. And impact structures are just cool. I’ve interpreted them in seismic myself. Then uninterpreted them.

I wish PGS were able to post a little more here, like a vertical profile, maybe a timeslice. I’m sure there would be tons of debate if we could see more. But not all things are possible when it comes to commercial seismic data.

It’s crazy to say more about it without more data (one-line interpretation, yada yada). So here’s what I think.

### Impact craters are rare

There are at least two important things to think about when considering an interpretation:

1. How well does this match the model? (In this case, how much does it look like an impact structure?)

2. How likely are we to see an instance of this model in this dataset? (What’s the base rate of impact structures here?)

Interpreters often forget about the second part. (There’s another part too: How reliable are my interpretations? Let’s leave that for another day, but you can read Bond et al. 2007 as homework if you like.)

The problem is that impact structures, or astroblemes, are pretty rare on Earth. The atmosphere takes care of most would-be meteorites, and then there’s the oceans, weather, tectonics and so on. The result is that the earth’s record of surface events is quite irregular compared to, say, the moon’s. But they certainly exist, and occasionally pop up in seismic data.

In my 2011 post Reliable predictions of unlikely geology, I described how skeptical we have to be when predicting rare things (‘wotsits’). Bayes’ theorem tells us that we must modify our assigned probability (let’s say I’m 80% sure it’s a wotsit) with the prior probability (let’s pretend a 1% a priori chance of there being a wotsit in my dataset). Here’s the maths:

$$\ \ \ P = \frac{0.8 \times 0.01}{0.8 \times 0.01\ +\ 0.2 \times 0.99} = 0.0388$$

In other words, the conditional probability of the feature being a rare wotsit, given my 80%-sure interpretation, is 0.0388 or just under 4%.

As cool as it would be to find a rare wotsit, I probably need a back-up hypothesis. Now, what’s that base rate for astroblemes? (Spoiler: it’s much less than 1%.)

### Just how rare are astroblemes?

First things first. If you’re interpreting circular structures in seismic, you need to read Simon Stewart’s paper on the subject (Stewart 1999), and his follow-up impact crater paper (Stewart 2003), which expands on the topic. Notwithstanding Stewart’s disputed interpretation of the Silverpit not-a-crater structure in the North Sea, these two papers are two of my favourites.

According to Stewart, the probability P of encountering r craters of diameter d or more in an area A over a time period t years is given by:

$$\ \ \ P(r) = \mathrm{e}^{-\lambda A}\frac{(\lambda A)^r}{r!}$$

where

$$\ \ \ \lambda = t n$$

and

$$\ \ \ \log n = - (11.67 \pm 0.21) - (2.01 \pm 0.13) \log d$$

We can use these equations to compute the probability plot on the right. It shows the probability of encountering an astrobleme of a given diameter on a 2400 km² seismic survey spanning the Phanerozoic. (This doesn’t take into account anything to do with preservation or detection.) I’ve estimated that survey size from PGS’s tweet, and I’ve highlighted the 7.5 km diameter they mentioned. The probability is very small: about 0.00025. So Bayes tells us that an 80%-confident interpretation has a conditional probability of about 0.001. One in a thousand.

Here’s the Jupyter notebook I used to make that chart using Python.

### So what?

My point here isn’t to claim that this structure is not an astrobleme. I haven’t seen the data, I’ve no idea. The PGS team mentioned that they considered the possiblity of influence by salt or shale, and fluid escape, and rejected these based on the evidence.

My point is to remind interpreters that when your conclusion is that something is rare, you need commensurately more and better evidence to support the claim. And it’s even more important than usual to have multiple working hypotheses.

Last thing: if I were PGS and this was my data (i.e. not a client’s), I’d release a little cube (anonymized, time-shifted, bit-reduced, whatever) to the community and enjoy the engagement and publicity. With a proper license, obviously.

### References

Hughes, D, 1998, The mass distribution of the crater-producing bodies. In Meteorites: Flux with time and impact effects, Geological Society of London Special Publication 140, 31–42.

Davis, J, 1986, Statistics and data analysis in geology, John Wiley & Sons, New York.

Stewart, SA (1999). Seismic interpretation of circular geological structures. Petroleum Geoscience 5, p 273–285.

Stewart, SA (2003). How will we recognize buried impact craters in terrestrial sedimentary basins? Geology 31 (11), p 929–932.

Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Is your data digital or just pseudodigital?

A rite of passage for a geologist is the making of an original geological map, starting from scratch. In the UK, this is known as the ‘independent mapping project’ and is usually done at the end of the second year of an undergrad degree. I did mine on the eastern shore of the Embalse de Santa Ana, just north of Alfarras in Catalunya, Spain. (I wrote all about it back in 2012.)

The map I drew was about as analog as you can get. I drew it with Rotring Rapidograph pens on drafting film. Mistakes had to be painstakingly scraped away with a razor blade. Colour had to be added in pencil after the map had been transferred onto paper. There is only one map in existence. The data is gone. It is absolutely unreproducible.

### Digitize!

In order to show you the map, I had to digitize it. This word makes it sound like the map is now ‘digital data’, but it’s really not useful for anything scientific. In other words, while it is ‘digital’ in the loosest sense — it’s a bunch of binary bits in the cloud — it is not digital in the sense of organized data elements with semantic meaning. Let’s call this non-useful format palaeodigital. The lowest rung on the digital ladder.

You can get palaeodigital files from many state and national data repositories. For example, it’s how the Government of Nova Scotia stores its offshore seismic ‘data’ files — as TIFF files representing scans of paper sections submitted by operators. Wiggle trace, obviously, making them almost completely useless.

### Protodigital

Nobody draws map by hand anymore, that would be crazy. Adobe Illustrator and (better) Inkscape mean we can produce beautifully rendered maps with about the same amount of effort as the hand-drawn version. But… this still isn’t digital. This is nothing more than a computerized rip-off of the analog workflow. The result is almost as static and difficult to edit as it was on film. (Wish you’d used a thicker line for your fault traces on those 20 maps? Have fun editing those files!)

Let’s call the computerization of analog workflows or artifacts protodigital. I’m thinking of Word and Powerpoint. Email. SeisWorks. Techlog. We can think of data in the same way… LAS files are really just a text-file manifestation of a composite log (plus their headers are often garbage). SEG-Y is nothing more than a bunch of traces with a sidelabel.

Together, palaeodigital and protodigital data might be called pseudodigital. They look digital, but they’re not quite there.

(Just to be clear, I made all these words up. They are definitely silly… but the point is that there’s a lot of room between analog and useful, machine-learning-ready digital.)

### Digital data

So what’s at the top of the digital ladder? In the case of maps, it’s shapefiles or, better yet, GeoJSON. In these files, objects are described in terms of real geographic parameters, such at latitiude and longitude. The file contains the CRS (you know you need that, right?) and other things you might need like units, data provenance, attributes, and so on.

What makes these things truly digital? I think the following things are important:

• They can all be self-documenting

• …and can carry more or less arbitrary amounts of metadata.

• They depend on open formats, some text and some binary, that are widely used.

• There is free, open-source tooling for reading and writing these formats, usually with reference implementations in major languages (e.g. C/C++, Python, Java).

• They are composable. Without too much trouble, you could write a script to process batches of these files, adapting to their content and context.

Here’s how non-digital versions of a document, e.g. a scholoarly article, compare to digital data:

And pseudodigital well logs:

Some more examples:

• Photographs with EXIF data and geolocation.

• GIS tools like QGIS let us make beautiful maps with data.

• Drawing striplogs with a data-driven tool like Python striplog.

• A fully-labeled HDF5 file containing QC’d, machine-learning-ready well logs.

• Structured, metadata-rich documents, perhaps in JSON format.

### Watch out for pseudodigital

Why does all this matter? It matters because we need digital data before we can do any analysis, or any machine learning. If you give me pseudodigital data for a project, I’m going to spend at least 50% of my time, probably more, making it digital before I can even get started. So before embarking on a machine learning project, you really, really need to know what you’re dealing with: digital or just pseudodigital?

1 Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# Training digital scientists

Gulp. My first post in… a while. Life, work, chaos, ideas — it all caught up with me recently. I’ve missed the blog greatly, and felt a regular pang of guilt at letting it gather dust. But I’m back! The 200+ draft posts in my backlog ain’t gonna write themselves. Thank you for returning and reading this one.

Recently I wrote about our continuing adventures in training; since I wrote that post in April, we’ve taught another 166 people. It occurred to me that while teaching scientists to code, we’ve also learned a bit about how to teach, and I wanted to share that too. Perhaps you will be inspired to share your skills, and together we can have exponential impact.

### Wanting to get better

As usual, it all started with not knowing how to do something, doing it anyway, then wanting to get better.

We started teaching in 2014 as rank amateurs, both as coders and as teachers. But we soon discovered the ‘teaching tech’ subculture among computational scientists. In particular, we found Greg Wilson and the Software Carpentry movement he started. By that point, it had been around for many, many years. Incredibly, Software Carpentry has helped more than 34,000 researchers ‘go digital’. The impact on science can’t be measured.

Eager as ever, we signed up for the instructor’s course. It was fantastic. The course, taught by Greg Wilson himself, perfectly modeled the thing it was offering to teach you: “Do what I say, and what I do”. This is, of course, critically important in all things, especially teaching. We accepted the content so completely that I’m not even sure we graduated. We just absorbed it and ran with it, no doubt corrupting it on the way. But it works for us.

I should preface what follows by telling you that I haven’t taken any other courses on the subject of teaching. For all I know, there’s nothing new here. That said, I have never experienced a course like Greg Wilson’s, so either the methods he promotes are not widely known, or they’re widely ignored, or I’ve been really unlucky.

The easiest way to get Greg Wilson’s wisdom is probably to read his book-slash-website, Teaching Tech Together. (It’s free, but you can get a hard copy if you prefer.) It’s really good. You can get the vibe — and much of the most important advice — from the ten Teaching Tech Together rules laid out on the main page of that site (box, right).

As you can probably tell, most of it is about parking your ego, plus most of your knowledge (for now), and orientating everything — every single thing — around the learner.

If you want to go deeper, I also recommend reading the excellent, if rather academic, How Learning Works, by Susan Ambrose (Northeastern University) and others. It’s strongly research-driven, and contains a lot of great advice. In particular, it does a great job of listing the factors that motivate students to learn (and those that demotivate them), and spelling out the various ways in which students acquire mastery of a subject.

### How to practice

It goes without saying that you’ll need to teach. A lot. Not surprisingly, we find we get much better if we teach several courses in a short period. If you’re diligent, take a lot of notes and study them before the next class, maybe it’s okay if a few weeks or months go by. But I highly doubt you can teach once or twice a year and get good at it.

Something it took us a while to get comfortable with is what Evan calls ‘mistaking’. If you’re a master coder, you might not make too many mistakes (but your expertise means you will have other problems). If you’re not a master (join the club), you will make a lot of mistakes. Embracing everything as a learning opportunity is less awkward for you, and for the students — dealing with mistakes is a core competency for all programmers.

Reflective practice means asking for, and then acting on, student feedback — every day. We ask students to write it on sticky notes. Reading these back to the class the next morning is a good way to really read it. One of the many benefits of ‘never teach alone’ is always having someone to give you feedback from another teacher’s perspective too. Multi-day courses let us improve in real time, which is good for us and for the students.

• Keep the student:instructor ratio to no more than ten; seven or eight is better.

• Take a packet of orange and a packet of green Post-It notes. Use them for names, as ‘help me’ flags, and for feedback.

• When teaching programming, the more live coding — from scratch — you can do, the better. While you code, narrate your thought process. This way, students are able to make conections between ideas, code, and mistakes.

• To explain concepts, draw on a whiteboard. Avoid slides whenever possible.

• Our co-teacher John Leeman likes to say, “I just showed you something new, what questions do you have?” This beats “Any questions?” for opening the door to engagement.

• “No-one left behind” is a nice idea, but it’s not always practical. If students can’t devote 100% to the class and then struggle because of it, you owe it to the the others to politely suggest they pick the class up again next time.

• Devote some time to the practical application of the skills you’re teaching, preferably in areas of the participants’ own choosing. In our 5-day class, we devote a whole day to getting students started on their own projects.

• Don’t underestimate the importance of a nice space, natural light, good food, and frequent breaks.

• Recognize everyone’s achievement with a small gift at the end of the class.

• Learning is hard work. Finish early every day.

### Give it a try

If you’re interested in help people learn to code, the most obvious way to start is to offer to assist or co-teach in someone else’s class. Or simply start small, offering a half-day session to a few co-workers. Even if you only recently got started yourself, they’ll appreciate the helping hand. If you’re feeling really confident, or have been coding for a year or two at least, try something bolder — maybe offer a one-day class at a meeting or conference. You will find plenty of interest.

There are few better ways to improve your own skills than to teach. And the feeling of helping people develop a valuable skill is addictive. If you give it a try, let us know how you get on!

Comment

### Matt Hall

Matt is a geoscientist in Nova Scotia, Canada. Founder of Agile Scientific, co-founder of The HUB South Shore. Matt is into geology, geophysics, and machine learning.

# TRANSFORM happened!

How do you describe the indescribable?

Last week, Agile hosted the TRANSFORM unconference in Normandy, France. We were there to talk about the open suburface stack — the collection of open-source Python tools for earth scientists. We also spent time on the state of the Software Underground, a global community of practice for digital subsurface scientists and engineers. In effect, this was the first annual Software Underground conference. This was SwungCon 1.

### The space

I knew the Château de Rosay was going to be nice. I hoped it was going to be very nice. But it wasn’t either of those things. It exceeded expectations by such a large margin, it seemed a little… indulgent, Excessive even. And yet it was cheaper than a Hilton, and you couldn’t imagine a more perfect place to think and talk about the future of open source geoscience, or a more productive environment in which to write code with new friends and colleagues.

It turns out that a 400-year-old château set in 8 acres of parkland in the heart of Normandy is a great place to create new things. I expect Gustave Flaubert and Guy de Maupassant thought the same when they stayed there 150 years ago. The forty-two bedrooms house exactly the right number of people for a purposeful scientific meeting.

This is frustrating, I’m not doing the place justice at all.

### The work

This was most people’s first experience of an unconference. It was undeniably weird walking into a week-long meeting with no schedule of events. But, despite being inexpertly facilitated by me, the 26 participants enthusiastically collaborated to create the agenda on the first morning. With time, we appreciated the possibilities of the open space — it lets the group talk about exactly what it needs to talk about, exactly when it needs to talk about it.

The topics ranged from the governance and future of the Software Underground, to the possibility of a new open access journal, interesting new events in the Software Underground calendar, new libraries for geoscience, a new ‘core’ library for wells and seismic, and — of course — machine learning. I’ll be writing more about all of these topics in the coming weeks, and there’s already lots of chatter about them on the Software Underground Slack (which hit 1500 members yesterday!).

### The food

I can’t help it. I have to talk about the food.

…but I’m not sure where to start. The full potential of food — to satisfy, to delight, to start conversations, to impress, to inspire — was realized. The food was central to the experience, but somehow not even the most wonderful thing about the experience of eating at the chateau. Meals were prefaced by a presentation by the professionals in the kitchen. No dish was repeated… indeed, no seating arrangement was repeated. The cheese was — if you are into cheese — off the charts.

There was a professionalism and thoughtfulness to the dining that can perhaps only be found in France.

Sorry everyone. This was one of those occasions when you had to be there. If you weren’t there, you missed out. I wish you’d been there. You would have loved it.

The good news is that it will happen again. Stay tuned.