Training digital scientists

Gulp. My first post in… a while. Life, work, chaos, ideas — it all caught up with me recently. I’ve missed the blog greatly, and felt a regular pang of guilt at letting it gather dust. But I’m back! The 200+ draft posts in my backlog ain’t gonna write themselves. Thank you for returning and reading this one.


Recently I wrote about our continuing adventures in training; since I wrote that post in April, we’ve taught another 166 people. It occurred to me that while teaching scientists to code, we’ve also learned a bit about how to teach, and I wanted to share that too. Perhaps you will be inspired to share your skills, and together we can have exponential impact.

Wanting to get better

As usual, it all started with not knowing how to do something, doing it anyway, then wanting to get better.

We started teaching in 2014 as rank amateurs, both as coders and as teachers. But we soon discovered the ‘teaching tech’ subculture among computational scientists. In particular, we found Greg Wilson and the Software Carpentry movement he started. By that point, it had been around for many, many years. Incredibly, Software Carpentry has helped more than 34,000 researchers ‘go digital’. The impact on science can’t be measured.

Eager as ever, we signed up for the instructor’s course. It was fantastic. The course, taught by Greg Wilson himself, perfectly modeled the thing it was offering to teach you: “Do what I say, and what I do”. This is, of course, critically important in all things, especially teaching. We accepted the content so completely that I’m not even sure we graduated. We just absorbed it and ran with it, no doubt corrupting it on the way. But it works for us.

What to read

TTT_rules.png

I should preface what follows by telling you that I haven’t taken any other courses on the subject of teaching. For all I know, there’s nothing new here. That said, I have never experienced a course like Greg Wilson’s, so either the methods he promotes are not widely known, or they’re widely ignored, or I’ve been really unlucky.

The easiest way to get Greg Wilson’s wisdom is probably to read his book-slash-website, Teaching Tech Together. (It’s free, but you can get a hard copy if you prefer.) It’s really good. You can get the vibe — and much of the most important advice — from the ten Teaching Tech Together rules laid out on the main page of that site (box, right).

As you can probably tell, most of it is about parking your ego, plus most of your knowledge (for now), and orientating everything — every single thing — around the learner.

If you want to go deeper, I also recommend reading the excellent, if rather academic, How Learning Works, by Susan Ambrose (Northeastern University) and others. It’s strongly research-driven, and contains a lot of great advice. In particular, it does a great job of listing the factors that motivate students to learn (and those that demotivate them), and spelling out the various ways in which students acquire mastery of a subject.

How to practice

It goes without saying that you’ll need to teach. A lot. Not surprisingly, we find we get much better if we teach several courses in a short period. If you’re diligent, take a lot of notes and study them before the next class, maybe it’s okay if a few weeks or months go by. But I highly doubt you can teach once or twice a year and get good at it.

Something it took us a while to get comfortable with is what Evan calls ‘mistaking’. If you’re a master coder, you might not make too many mistakes (but your expertise means you will have other problems). If you’re not a master (join the club), you will make a lot of mistakes. Embracing everything as a learning opportunity is less awkward for you, and for the students — dealing with mistakes is a core competency for all programmers.

Reflective practice means asking for, and then acting on, student feedback — every day. We ask students to write it on sticky notes. Reading these back to the class the next morning is a good way to really read it. One of the many benefits of ‘never teach alone’ is always having someone to give you feedback from another teacher’s perspective too. Multi-day courses let us improve in real time, which is good for us and for the students.

Some other advice:

  • Keep the student:instructor ratio to no more than ten; seven or eight is better.

  • Take a packet of orange and a packet of green Post-It notes. Use them for names, as ‘help me’ flags, and for feedback.

  • When teaching programming, the more live coding — from scratch — you can do, the better. While you code, narrate your thought process. This way, students are able to make conections between ideas, code, and mistakes.

  • To explain concepts, draw on a whiteboard. Avoid slides whenever possible.

  • Our co-teacher John Leeman likes to say, “I just showed you something new, what questions do you have?” This beats “Any questions?” for opening the door to engagement.

  • “No-one left behind” is a nice idea, but it’s not always practical. If students can’t devote 100% to the class and then struggle because of it, you owe it to the the others to politely suggest they pick the class up again next time.

  • Devote some time to the practical application of the skills you’re teaching, preferably in areas of the participants’ own choosing. In our 5-day class, we devote a whole day to getting students started on their own projects.

  • Don’t underestimate the importance of a nice space, natural light, good food, and frequent breaks.

  • Recognize everyone’s achievement with a small gift at the end of the class.

  • Learning is hard work. Finish early every day.

Give it a try

If you’re interested in help people learn to code, the most obvious way to start is to offer to assist or co-teach in someone else’s class. Or simply start small, offering a half-day session to a few co-workers. Even if you only recently got started yourself, they’ll appreciate the helping hand. If you’re feeling really confident, or have been coding for a year or two at least, try something bolder — maybe offer a one-day class at a meeting or conference. You will find plenty of interest.

There are few better ways to improve your own skills than to teach. And the feeling of helping people develop a valuable skill is addictive. If you give it a try, let us know how you get on!

TRANSFORM happened!

transform_sticker.jpg

How do you describe the indescribable?

Last week, Agile hosted the TRANSFORM unconference in Normandy, France. We were there to talk about the open suburface stack — the collection of open-source Python tools for earth scientists. We also spent time on the state of the Software Underground, a global community of practice for digital subsurface scientists and engineers. In effect, this was the first annual Software Underground conference. This was SwungCon 1.

The space

I knew the Château de Rosay was going to be nice. I hoped it was going to be very nice. But it wasn’t either of those things. It exceeded expectations by such a large margin, it seemed a little… indulgent, Excessive even. And yet it was cheaper than a Hilton, and you couldn’t imagine a more perfect place to think and talk about the future of open source geoscience, or a more productive environment in which to write code with new friends and colleagues.

It turns out that a 400-year-old château set in 8 acres of parkland in the heart of Normandy is a great place to create new things. I expect Gustave Flaubert and Guy de Maupassant thought the same when they stayed there 150 years ago. The forty-two bedrooms house exactly the right number of people for a purposeful scientific meeting.

This is frustrating, I’m not doing the place justice at all.

The work

This was most people’s first experience of an unconference. It was undeniably weird walking into a week-long meeting with no schedule of events. But, despite being inexpertly facilitated by me, the 26 participants enthusiastically collaborated to create the agenda on the first morning. With time, we appreciated the possibilities of the open space — it lets the group talk about exactly what it needs to talk about, exactly when it needs to talk about it.

The topics ranged from the governance and future of the Software Underground, to the possibility of a new open access journal, interesting new events in the Software Underground calendar, new libraries for geoscience, a new ‘core’ library for wells and seismic, and — of course — machine learning. I’ll be writing more about all of these topics in the coming weeks, and there’s already lots of chatter about them on the Software Underground Slack (which hit 1500 members yesterday!).

The food

I can’t help it. I have to talk about the food.

…but I’m not sure where to start. The full potential of food — to satisfy, to delight, to start conversations, to impress, to inspire — was realized. The food was central to the experience, but somehow not even the most wonderful thing about the experience of eating at the chateau. Meals were prefaced by a presentation by the professionals in the kitchen. No dish was repeated… indeed, no seating arrangement was repeated. The cheese was — if you are into cheese — off the charts.

There was a professionalism and thoughtfulness to the dining that can perhaps only be found in France.

Sorry everyone. This was one of those occasions when you had to be there. If you weren’t there, you missed out. I wish you’d been there. You would have loved it.

The good news is that it will happen again. Stay tuned.

Feel superhuman: learning and teaching geocomputing

Diego teaching in Houston in 2018.

Diego teaching in Houston in 2018.

It’s five years since we started teaching Python to geoscientists. To be honest, it might have been premature. At the time, Evan and I were maybe only two years into serious, daily use of Python. But the first class, at the Atlantic Geological Society’s annual meeting in February 2014, was free so the pressure was not too high. And it turns out that only being a step or two ahead of your students can be an advantage. Your ‘expert blind spot’ is partially sighted not completely blind, because you can clearly remember being a noob.

Being a noob is a weird, sometimes very uncomfortable, even scary, feeling for some people. Many of us are used to feeling like experts, at least some of the time. Happy, feeling like a noob is a core competency in programming. Learning new things is a more or less hourly experience for coders. Even a mature language like Python evolves fast enough that it’s hard to keep up. Instead of feeling threatened or exhausted by this, I think the best strategy is to enjoy it. You’ll never be done, there are (way) more questions than answers, and you can learn forever!

One of the bootcamp groups at the Copenhagen hackathon in 2018

One of the bootcamp groups at the Copenhagen hackathon in 2018

This week we’re teaching our 40th course. Last year alone we gave digital superpowers to 325 people, mostly geoscientists, Not all of them learned to code, as such — some people already could, and some found out theydidn’t like it… coding really isn’t for everyone. But I think all of them learned something new about technology, and how it can serve them and their science. I hope all of them look at spreadsheets, and Petrel, and websites differently now. I think most of them want, at some point, to learn more. And everyone is excited about machine learning.

The expanding community of quantitative earth scientists

This year we’ve already spent 50 days teaching, and taught 174 people. Imagine that! I get emotional when I think about what these hundreds of new digital geoscientists and engineers will go and do with their new skills. I get really excited when I see what they are already doing — when they come to hackathons, send us screenshots, or write papers with beautiful figures. If the joy of sharing code and collaborating with peers has also rubbed off on them, there’s no telling where it could lead.

Matt teaching in Aberdeen in October 2018

Matt teaching in Aberdeen in October 2018

The last nine months or so have been an adventure. Teaching is not supposed to be what Agile is about. We’re a consulting company, a technology company. But for now we’re mostly a training company — it’s where we’re needed. And it makes sense... Programming is fundamentally about knowledge sharing. Teaching is about helping, collaborating. It’s perfect for us.

Besides, it’s a privilege and a thrill to meet all these fantastically smart, motivated people and to hear about their projects and their plans. Sometimes I wish it didn’t mean leaving my family in Nova Scotia and flying to Houston and London and Kuala Lumpur and Kalamazoo… but mostly I wish we could do more of it. Especially when we get comments like these:

Given how ‘dry’ programming can be, it was DYNAMIC.”
”Excellent teachers with geoscience background.”
”Great instructors, so so approachable, even for newbies like me.”
”Great course [...] Made me realize what could be done in a short time.”
”My only regret was not taking a class like this sooner.”
”Very positive, feel superhuman.

How many times have you felt superhuman at work recently?

The courses we teach are evolving and expanding in scope. But they all come back to the same thing: growing digital skills in our profession. This is critical because using computers for earth science is really hard. Why? The earth is weird. We’ve spent hundreds of years honing conceptual models, understanding deep time, and figuring out complex spatial relationships.

If data science eats the subsurface without us, we’re all going to get indigestion. Society needs to better understand the earth — for all sorts of reasons — and it’s our duty to build and adopt the most powerful analytical tools available so that we can help.


Learning resources

If you can’t wait to get started, here are some suggestions:

Classroom courses are a big investment in dollars and time, but they can get you a long way really quickly. Our courses are built especially for subsurface scientists and engineers. As far as I know, they are the only ones of their kind. If you think you’d like to take one, talk to us, or look out for a public course. You can find out more or sign up for email alerts here >> https://agilescientific.com/training/

Last thing: I suggest avoiding DataCamp, because of sexual misconduct by an executive, compounded by total inaction, dishonest obfuscation, and basically failing spectacularly. Even their own trainers have boycotted them. Steer clear.

The order of stratigraphic sequences

Much of stratigraphic interpretation depends on a simple idea:

Depositional environments that are adjacent in a geographic sense (like the shoreface and the beach, or a tidal channel and tidal mudflats) are adjacent in a stratigraphic sense, unless separated by an unconformity.

Usually, geologists are faced with only the stratigraphic picture, and are challenged with reconstructing the geographic picture.

One interpretation strategy might be to look at which rocks tend to occur together in the stratigraphy. The idea is that rock types tend to be associated with geographic environments — maybe fine sand on the shoreface, coarse sand on the beach; massive silt in the tidal channel, rhythmically laminated mud in the mud-flats. Since if two rocks tend to occur together, their environments were probably adjacent, we can start to understand associations between the rock types, and thus piece together the geographic picture.

So which rock types tend to occur together, and which juxtapositions are spurious — perhaps the result of allocyclic mechanisms like changes in relative sea-level, or sediment supply? To get at this question, some stratigraphers turn to Markov chain analysis.

What is a Markov chain?

Markov chains are sequences of events, or states, resulting from a Markov process. Here’s how Wikipedia describes a Markov process:

A stochastic process that satisfies the Markov property (sometimes characterized as “memorylessness”). Roughly speaking, a process satisfies the Markov property if one can make predictions for the future of the process based solely on its present state just as well as one could knowing the process’s full history, hence independently from such history; i.e., conditional on the present state of the system, its future and past states are independent.

So if we believe that a stratigraphic sequence (I’m using ‘sequence’ here in the most general sense) can be modeled by a process like this — i.e. that its next state depends substantially on its present state — then perhaps we can model it as a Markov chain.

For example, we might have a hunch that we can model a shallow marine system as a sequence like:

offshore mudstone > lower shoreface siltstone > upper shoreface sandstone > foreshore sandstone

Then we might expect to see these transitions occur more often than other, non-successive transitions. In other words — if we compare the transition frequencies we observe to the transition frquencies we would expect from a random sequence of the same beds in the same proportions, then autocyclic or genetic transitions might happen unusually frequently.

The Powers & Easterling method

Several workers have gone down this path. The standard approach seems to be that of Powers & Easterling (1982). Here are the steps they describe:

  • Count the upwards transitions for each rock type. This results in a matrix of counts. Here’s the transition frequency matrix for the example used in the Powers & Easterling paper, in turn take from Gingerich (1969):

 
data = [[ 0, 37,  3,  2],
        [21,  0, 41, 14],
        [20, 25,  0,  0],
        [ 1, 14,  1,  0]]
  • Compute the expected counts by an iterative process, which usually converges in a few steps. The expected counts represent what Goodman (1968) called a ‘quasi-independence’ model — a random sequence:

 
array([[ 0. , 31.3,  8.2,  2.6],
       [31.3,  0. , 34.1, 10.7],
       [ 8.2, 34. ,  0. ,  2.8],
       [ 2.6, 10.7,  2.8,  0. ]])
  • Now we can compare our observed frequencies with the expected ones in two ways. First, we can inspect the \(\chi^2\) statistic, and compare it with the \(\chi^2\) distribution, given the degrees of freedom (5 in this case). In this example, it’s 35.7, which is beyond the 99.999th percentile of the chi-squared distribution. This rejects the hypothesis of quasi-independence. In other words: the succession appears to be organized. Phew!

  • Secondly, we can compute a matrix of so-called normalized differences. This lets us compare the observed and expected data. By calculating Z-scores, which are approximately normally distributed; since 95% of the distribution falls between −2 and +2, any value greater in magnitude than 2 is ‘fairly unusual’, in the words of Powers & Easterling. In the example, we can see that the large number of transitions from C (third row) to A (first column) is anomalous:

 
 
array([[ 0. ,  1. , -1.8, -0.3],
       [-1.8,  0. ,  1.2,  1. ],
       [ 4.1, -1.6,  0. , -1.7],
       [-1. ,  1. , -1.1,  0. ]])
powers_easterling_normdiff.png
  • The normalized difference matrix can also be interpreted as a directed graph, indicating the ‘strengths’ of the connections (edges) between rock types (nodes):

powers_easterling_graph.png

It would be all too easy to over-interpret this graph — B and D seem to go together, as do A and C, and C tends to pass into A, which tends to pass into a B/D system before passing back into C — and one could get carried away. But as a complement to sedimentological interpretation, knowledge of processes and the succession in hand, perhaps inspecting Markov chains can help understand the stratigraphic story.

One last thing… there is another use for Markov chains. We can also use the model to produce stochastic realizations of stratigraphy. These will share the same statistics as the original data, but are otherwise quite random. Here are 20 random beds generated from our model:

 
'ABABCBABABCABDABABCA'

The code to build your own Markov chains is all in this notebook. It’s very much a work in progress. Eventually I hope to merge it into the striplog library, but for now it’s a ‘minimum viable product’. Stay tuned for more on striplog.

Open In Colab   ⇐   Launch the notebook right here in your browser!


References

Gingerich, PD (1969). Markov analysis of cyclic alluvial sediments. Journal of Sedimentary Petrology, 39, p. 330-332. https://doi.org/10.1306/74D71C4E-2B21-11D7-8648000102C1865D

Goodman, LA (1968), The analysis of cross-classified data: independence, quasi-independence, and interactions in contingency tables with or without missing entries. Journal of American Statistical Association 63, p. 1091-1131. https://doi.org/10.2307/2285873

Powers, DW and RG Easterling (1982). Improved methodology for using embedded Markov chains to describe cyclical sediments. Journal of Sedimentary Petrology 52 (3), p. 0913-0923. https://doi.org/10.1306/212F808F-2B24-11D7-8648000102C1865D

Machine learning project review checklist

Imagine being a manager or technical chief whose team has been working on a machine learning project. What questions should you be thinking about when your team tells you about their work?

Here are some suggestions. Some of the questions are getting at reproducibility (for testing, archiving, or sharing the workflow), others at quality assurance. A few of the questions might depend on the particular task in hand, although I’ve tried to keep it pretty generic.

There are a few must-ask questions, highlighted in bold.

High-level questions about the project

  • What question were you trying to answer? How did you frame it as an ML task?

  • What is human-level performance on that task? What level of performance is needed?

  • Is it possible to approach this problem without machine learning?

  • If the analysis focused on deep learning methods, did you try shallow learning methods?

  • What are the ethical and legal aspects of this project?

  • Which domain experts were involved in this analysis?

  • Which data scientists were involved in this analysis?

  • Which tools or framework did you use? (How much of a known quantity is it?)

  • Where is the pipeline published? (E.g. public or internal git repositories.)

  • How thorough is the documentation?

Questions about the data preparation

  • Where did the feature data come from?

  • Where did the labels come from?

  • What kind of data exploration did you do?

  • How did you clean the data? How long did this take?

  • Are the classes balanced? How did the distribution change your workflow?

  • What kind of normalization did you do?

  • What did you do about missing data? E.g. what kind of imputation did you do?

  • What kind of feature engineering did you do?

  • How did you split the data into train, validate and test?

Questions about training and evaluation

  • Which models did you explore and why? Did you also try the simplest models that fit the problem?

  • How did you tune the hyperparameters of the model? Did you try grid search or other methods?

  • What kind of validation did you do? Did you use cross-validation? How did you choose the folds?

  • What evaluation metric are you using? Why is it the most appropriate one?

  • How do training, validation, and test metrics compare?

  • If this was a classification task, how does a dummy classifier score?

  • How are errors/residuals distributed? (Ideally normally distributed and  homoscedastic.)

  • How interpretable is your model? That is, do the learned parameters mean anything, and can we learn from them? E.g. what is the feature importance?

  • If this was a classification task, are probabilities available in your model and did you use them?

  • If this was a regression task, have you checked the residuals for normality and homoscedasticity?

  • Are there benchmarks for this task, and how well does your model do on them?

Next steps for the project

  • How will you improve the model?

  • Would collecting more data help? Can we address the imbalance with more data?

  • Are there human or computing resources you need access to?

  • How will you deploy the model?

Rather than asking them explicitly, a reviewer might check things off while reading a report or listening to a presentation. A thorough review would cover most of the points without being prompted. And I’d go so far as to say that a person or team who has done a rigorous treatment should readily have answers to all of these questions. They aren't supposed to be 'traps' exactly, but they are supposed to get to the heart of the issues the data scientist or team likely faced during their work.

What do you think? Are the questions fair? Are there any you would remove, or others you would add? Let me know in the comments.

Visit a Google Docs version of this checklist.


Thank you to members of the Software Underground Slack channel for discussion of these questions, especially Anton Biryukov, Justin Gosses, and Lukas Mosser.

What makes a good benchmark dataset?

Last week I mentioned that we need more open benchmark datasets in geoscience. I think benchmarks are important for researchers to work on, as a teaching aid, and as a way for us to objectively measure how well we’re doing on a particular problem. How else can we know how we’re doing, or compare Company X’s claim with Company Y’s?

What makes a good benchmark?

I haven’t unearthed any guides from other domains to help answer this question, and we don’t yet have enought experience to know for ourselves. But here’s what I’m thinking:

  • It must address at least one clear machine learning task. The more obviously useful the task, the more useful (and important) the benchmark. The benchmark dataset should be well suited to the task (but does not have to be comprehensive or definitive).

  • It must be open. That means explicitly licensed with an open, and preferably permissive, license. I think we need to avoid non-permissive (so-called ‘copyleft’) licenses, because it’s not clear how the ‘sharealike’ clause would affect works that depended on the dataset. And we definitely need to avoid restrictive non-commercial clauses.

  • It must be discoverable and accessible. In other words, it needs to be easy to find, and anyone should be able to get it, without registering on a website or waiting for an email or doing anything else that slows down the pace of their research. A properly open dataset can be replicated anywhere, so openness should take care of this.

  • It must have enough features to be interesting. This might mean different things for different tasks, but in general we’d like to see a few physical measurements (e.g. seismic, well logs, RockEval, core photos, field observations, flow rates, and so on). The features should be independent — we can always generate derivatives.

  • It must have labels. As well as some interesting features, the dataset must have some interpretive information with high information value (e.g. seismic facies, lithologies, deposotional environment, sequence boundaries, EURs, and so on). Usually, these are expensive to acquire (which is partly why we’d like to be able to predct them).

  • It should name suitable prediction error evaluation methods, with reference implementations, for the intended task. If people are to use it as a score benchmark, they need to know how to score their own implementations of the task.

  • It can be de-localized, but not completely. We don’t need to know the exact whereabouts of the dataset, but if we remove the relative spatial relationships between wells, say, or don’t know which basin we’re in, then the questions we can ask about the data get a lot less interesting, and the whole situation gets much less realistic.

  • It should not be too big. More than about 1GB means unwieldy. It means difficult to download. It means too much room for nuance. And it means it’s probably impossible to explore in the space of a tutorial. It’s also much harder to get a big dataset into shape than a smaller one. A few thousand records, maybe 100,000 in some cases, is probably plenty.

  • It should be clean, but not too clean. No-one wants to spend hours processing a dataset before it can be used, or — worse — be bitten by some esoteric data problem only a domain expert would spot. But, on the other hand, a dataset with no issues at all might be a bit boring. And, in subsurface at least, completely unrepresentative!

  • It should be well documented. The dataset needs to be described to non-technical people, who know little or nothing about the subsurface. Remember that many users will not be proficient programmers either, so…

  • It should have an accompanying demonstration. For example, a script or notebook, preferably in at least a couple of languages, that shows how to load and inspect the data. Ideally this would include a demonstration of how to pose, and answer, a straightforward question as a machine learning task.

I’m not sure we can call this last one a criterion, but maybe in an ideal world…

  • It should be launched with a data science contest. If you’re felling really brave, what better way to attract attention to the new open dataset than with a Kaggle-style contest?

It’s certainly true that there are several datasets around. Unfortunately, the openness criterion eliminates most of them, so they fall at the first hurdle. For example, the very nice dataset that Brendon Hall used in the SEG machine learning contest is not open.

If you know of a dataset that could be coerced into meeting most of these criteria, we’d like to hear about it. I know a small army of people that would love to help get it into the open, and into the hands of machine learning researchers all over the world.


The thumbnail image for this post was adapted from an image by user arg_flickr on Flickr, licensed CC-BY.

Thanks to several people on Software Underground, for the discussion on this topic. In particular, Justin Gosses and Lukas Mosser pointed out the need for transparent error evaluation.

Closing the gap: what next?

I wrote recently about closing the gap between data science and the subsurface domain, naming some strategies that I think will speed up this process of digitalization.

But even after the gap has closed in your organization, you’re really just getting started. It’s not enough to have contact between the two worlds, you need most of your actvity to be there. This means moving it from wherever it is now. This means time, and effort, and mistakes.

Strategies for 2020+

Hardly any organizations have got to this point yet. And I certainly don’t know what it looks like when we get there as a discipline. But nonetheless I think I’m starting to see what’s going to be required to continue to build into the gap. Soon, we’re going to need to think about these things.

  • We’re bad at hiring; we need to be awesome at it*. We need to stop listening to the pop psychology peddled by HR departments (Myers-Briggs, lol) and get serious about hiring brilliant scientific and technical talent. It’s going to take some work because a lot of brilliant scientists and technical talent aren’t that interested in subsurface.

  • You need to get used to the idea that digital scientists can do amazing things quickly. These are not your corporate timelines. There are no weekly meetings. Protoyping a digital technology, or proving a concept, takes days. Give me a team of 3 people and I can give you a prototype this time next week.

  • You don’t have to do everything yourself. In fact, you don’t want to, because that would be horribly slow. For example, if you have a hard problem at hand, Kaggle can get 3000 teams of fantastically bright people to look at it for you. Next week.

  • We need benchmark datasets. If anyone is going to be able to test anything, or believe any claims about machine learning results, then we need benchmark data. Otherwise, what are we to make of claims like “98% accuracy”? (Nothing, it’s nonsense.)

  • We need faster research. We need to stop asking people for static, finished work — that they can only present with PowerPoint — months ahead of a conference… then present it as if it’s bleeding edge. And do you know how long it takes to get a paper into GEOPHYSICS?

  • You need Slack and Stack Overflow in your business. These two tools have revolutionized how technical people communicate and help each other. If you have a large organization, then your digital scientists need to talk to each other. A lot. Skype and Yammer won’t do. Check out the Software Underground if you want to see how great Slack is.

Even if your organization is not quite ready for these ideas yet, you can start laying the groundwork. Maybe your team is ready. You only need a couple of allies to get started; there’s always something you can do right now to bring change a little sooner. For example, I bet you can:

  • List 3 new places you could look for amazing, hireable scientists to start conversations with.

  • Find out who’s responsible for technical communities of practice and ask them about Slack and SO.

  • Find 3 people who’d like to help organize a hackathon for your department before the summer holidays.

  • Do some research about what it takes to organize a Kaggle-style contest.

  • Get with a colleague and list 3 datasets you could potentially de-locate and release publically.

  • Challenge the committe to let you present in a new way at your next technical conference.

I guarantee that if you pick up one of these ideas and run with it for a bit, it’ll lead somewhere fun and interesting. And if you need help at some point, or just want to talk about it, you know where to find us!


* I’m not being flippant here. Next time you’re at a conference, go and talk to the grad students, all sweaty in their suits, getting fake interviews from recruiters. Look at their CVs and resumes. Visit the recruitment room. Go and look at LinkedIn. The whole thing is totally depressing. We’ve trained them to present the wrong versions of themselves.

X lines of Python: Ternary diagrams

Difficulty rating: beginner-friendly

(I just realized that calling the more approachable tutorials ‘easy’ is perhaps not the most sympathetic way to put it. But I think this one is fairly approachable.)

If you’re new to Python, plotting is a great way to get used to data structures, and even syntax, because you get immediate visual feedback. Plots are just fun.

Data loading

The first thing is to load the data, which is contained in a Google Sheets spreadsheet. If you make a sheet public, it’s easy to make a URL that provides a CSV. Happily, the Python data management library pandas can read URLs directly, so loading the data is quite easy — the only slightly ugly thing is the long URL:

    import pandas as pd
    uid = "1r7AYOFEw9RgU0QaagxkHuECvfoegQWp9spQtMV8XJGI"
    url = f"https://docs.google.com/spreadsheets/d/{uid}/export?format=csv"
    df = pd.read_csv(url) 

This dataset contains results from point-counting 51 shallow marine sandstones from the Eocene Sobrarbe Formation. We’re going to plot normalized volume percentages of quartz grains, detrital carbonate grains, and undifferentiated matrix. Three parameters? Two degrees of freedom? Let’s make a ternary plot!

Data exploration

Once you have the data in pandas, and before getting to the triangular stuff, we should have a look at it. Seaborn, a popular statistical plotting library, has a nifty ‘pairplot’ which plots the numerical parameters against each other to help reveal patterns in the data. On the diagonal, it shows kernel density estimations to reveal the distribution of each property:

    import seaborn as sns
    vars = ['Matrix', 'Quartz', 'Carbonate', 'Bioclasts', 'Authigenic']
    sns.pairplot(df, vars=vars, hue='Facies Association')
ternary_data_pairplot.png

Normalization is fairly straightforward. For each column, e.g. df['Carbonate'], we make a new column, e.g. df['C'], which is normalized to the sum of the three components, given by df[cols].sum(axis=1):

cols = ['Carbonate', 'Quartz', 'Matrix']
for col in cols:
    df[col[0]] = df[col] * 100 / df[cols].sum(axis=1)

The ternary plot

For the ternary plot itself I’m using the python-ternary library, which is pretty hands-on in that most plots take quite a bit of code. But the upside of this is that you can do almost anything you want. (Theres one other option for Python, the ever-reliable plotly, and there’s a solid-looking package for R too in ggtern.)

We just need a few lines of plotting code (left) to pull a ternary diagram (right) together.

    fig, tax = ternary.figure(scale=100)
    fig.set_size_inches(5, 4.5)

    tax.scatter(df[['M', 'Q', 'C']].values)
    tax.gridlines(multiple=20)
    tax.get_axes().axis('off')
ternary_tiny.png

But here you see what I mean about this being quite a low-level library: each element of the plot has to be added explicitly. So if we want axis labels, titles, and other annotations, we need more code… all of which is laid out in the accompanying notebook. You can download this from GitHub, or run in right now, right in your browser, with these links:

Binder   Run the accompanying notebook in MyBinder

Open In Colab   Run the notebook in Google Colaboratory (note you need to install python-ternary)

Give it a go, and have fun making your own ternary plots in Python! Share them on LinkedIn or Twitter.

Quartz, carbonate and matrix quantities (normalized to 100%) for 51 calcareous sandstones from the Eocene Sobrarbe Formation. The ternary plot was made with python-ternary library for Python and matplotlib.

Quartz, carbonate and matrix quantities (normalized to 100%) for 51 calcareous sandstones from the Eocene Sobrarbe Formation. The ternary plot was made with python-ternary library for Python and matplotlib.

Closing the analytics–domain gap

I recently figured out where Agile lives. Or at least where we strive to live. We live on the isthmus — the thin sliver of land — between the world of data science and the domain of the subsurface.

We’re not alone. A growing number of others live there with us. There’s an encampment; I wrote about it earlier this week.

Backman’s Island, one of my favourite kayaking destinations, is a passable metaphor for the relationship between machine learning and our scientific domain.

Backman’s Island, one of my favourite kayaking destinations, is a passable metaphor for the relationship between machine learning and our scientific domain.

Closing the gap in your organization

In some organizations, there is barely a connection. Maybe a few rocks at low tide, so you can hop from one to the other. But when we look more closely we find that the mysterious and/or glamorous data science team, and the stories that come out of it, seem distinctly at odds with the daily reality of the subsurface professionals. The VP talks about a data-driven business, deep learning, and 98% accuracy (whatever that means), while the geoscientists and engineers battle with raster logs, giant spreadsheets, and trying to get their data from Petrel into ArcGIS (or, help us all, PowerPoint) so they can just get on with their day.

We’re not going to learn anything from those organizations, except maybe marketing skills.

We can learn, however, from the handful of organizations, or parts of them, that are serious about not only closing the gap, but building new paths, and infrastructure, and new communities out there in the middle. If you’re in a big company, they almost certainly exist somewhere in the building — probably keeping their heads down because they are so productive and don’t want anyone messing with what they’ve achieved.

Here are some of the things they are doing:

  • Blending data science teams into asset teams, sitting machine learning specialists with subsurface scientists and engineers. Don’t make the same mistake with machine learning that our industry made with innovation — giving it to a VP and trying to bottle it. Instead, treat it like Marmite: spread it very thinly on everything.*

  • Treating software like knowledge sharing. Code is, hands down, the best way to share knowledge: it’s unambiguous, tested (we hope anyway), and — above all — you can actually use it. Best practice documents are I’m afraid, not worth the paper they would be printed on if anyone even knew how to find them.

  • Learning to code. OK, I’m biased because we train people… but it seriously works. When you have 300 geoscientists in your organization that embrace computational thinking, that can write a function in Python, that know what a support vector machine is for — that changes things. It changes every conversation.

  • Providing infrastructure for digital science. Once you have people with skills, you need people with powers. The power to install software, instantiate a virtual machine, or recruit a coder. You need people with tools, like version control, continuous integration, and communities of practice.

  • Realizing that they need to look in new places. Those much-hyped conversations everyone is having with Google or Amazon are, admittedly, pretty cool to see in the extractive industries (though if you really want to live on the cutting edge of geospatial analytics, you should probably be talking to Uber). You will find more hope and joy in Kaggle, Stack Overflow, and any given hackathon than you will in any of the places you’ve been looking for ‘innovation’ for the last 20 years.

This machine learning bandwagon we’re on is not about being cool, or giving keynotes, or saying ‘deep learning’ and ‘we’re working with Google’ all the time. It’s about equipping subsurface professionals to make better and safer scientific, industrial, and business decisions with more evidence and more certainty.

And that means getting serious about closing that gap.


I thought about this gap, and Agile’s place in it — along with the several hundred other digital subsurface scientists in the world — after drawing an attempt at drawing the ‘big picture’ of data science on one of our courses recently. Here’s a rendering of that drawing, without further comment. It didn’t quite fit with my ‘sliver of land’ analogy somehow…

On the left, the world of ‘advanced analytics’, on the right, how the disciplines of data science and earth science overlap on and intersect the computational world. We live in the green belt. (yes, we could argue for hours about these terms, but let’s not.)

On the left, the world of ‘advanced analytics’, on the right, how the disciplines of data science and earth science overlap on and intersect the computational world. We live in the green belt. (yes, we could argue for hours about these terms, but let’s not.)


* If you don’t know what Marmite is, it’s not too late to catch up.

The digital subsurface water-cooler

swung_round_orange.png

Back in August 2016 I told you about the Software Underground, an informal, grass-roots community of people who are into rocks and computers. At its heart is a public Slack group (Slack is a bit like Yammer or Skype but much more awesome). At the time, the Underground had 130 members. This morning, we hit ten times that number: there are now 1300 enthusiasts in the Underground!

If you’re one of them, you already know that it’s easily the best place there is to find and chat to people who are involved in researching and applying machine learning in the subsurface — in geoscience, reservoir engineering, and enything else to do with the hard parts of the earth. And it’s not just about AI… it’s about data management, visualization, Python, and web applications. Here are some things that have been shared in the last 7 days:

  • News about the upcoming Software Underground hackathon in London.

  • A new Udacity course on TensorFlow.

  • Questions to ask when reviewing machine learning projects.

  • A Dockerfile to make installing Seismic Unix a snap.

  • Mark Zoback’s new geomechanics course.

It gets better. One of the most interesting conversations recently has been about starting a new online-only, open-access journal for the geeky side of geo. Look for the #journal channel.

Another emerging feature is the ‘real life’ meetup. Several social+science gatherings have happened recently in Aberdeen, Houston, and Calgary… and more are planned, check #meetups for details. If you’d like to organize a meetup where you live, Software Underground will support it financially.

softwareunderground_merch.png

We’ve also gained a website, softwareunderground.org, where you’ll find a link to sign-up in the Slack group, some recommended reading, and fantastic Software Underground T-shirts and mugs! There are also other ways to support the community with a subscription or sponsorship.

If you’ve been looking for the geeks, data-heads, coders and makers in geoscience and engineering, you’ve found them. It’s free to sign up — I hope we see you in there soon!


Slack has nice desktop, web and mobile clients. Check out all the channels — they are listed on the left:

swung_convo.png