A machine learning safety net

A while back, I wrote about machine learning safety measures. I was thinking about how easy it is to accidentally make terrible models (e.g. training a support vector machine on unscaled data), or misuse good models (e.g. forgetting to scale data before making a prediction). I suggested that one solution might be to make tools that help spot these kinds of mistakes:

[We should build] software to support good practice. Many of the problems I’m talking about are quite easy to catch, or at least warn about, during the training and evaluation process. Unscaled features, class imbalance, correlated features, non-IID records, and so on. Education is essential, but software can help us notice and act on them.

Introducing redflag

I’m pleased, and a bit nervous, to introduce redflag, a new Python library to help find the sorts of issues I’m describing. The vision for this tool is as a kind of safety net, or ‘entrance exam for data’ (a phrase Evan coined several years ago). It should be able to look at an array (or Pandas DataFrame), and flag potential issues, perhaps generating a report. And it should be able to sit in your Scikit-Learn pipeline, watching for issues.

The current version, 0.1.9 is still rather rough and experimental. The code is far from optimal, with quite a bit of repetition. But it does a few useful things. For example, suppose we have a DataFrame with a column, Lithology, which contains strings denoting 9 rock types (‘sandstone’, ‘limestone’, etc). We’d like to know if the classes are ‘balanced’ — present in roughly similar numbers — or not. If they are not, we will have to be careful with how we split this dataset up for our model evaluation workflow.

>>> import redflag as rf
>>> rf.imbalance_degree(df['Lithology'])
3.37859304086633
>>> rf.imbalance_ratio([df['Lithology'])
8.347368421052632

The imbalance degree, defined by Ortigosa-Hernandez et al. (2017), tells us that there are 4 minority classes (the next integer above this number), and that the imbalance severity is somewhere in the middle (3.1 would be well balanced, 3.9 would be strongly imbalanced). The simpler imbalance ratio tells us that there’s about 8 times as much of the biggest majority class as of the smallest minority class. Conclusion: depending on the size of this dataset, the class imbalance is probably not a show-stopper, but we need to pay attention.

Our dataset contains well log data. Unless they are very far apart, well log samples are usually not independent — they are correlated in depth — and this means we can’t split the data randomly in our evaluation workflow. Redflag has a function to help detect features that are correlated to themselves in this way:

>>> rf.is_correlated(df['GR'])
True

We need to be careful!

Another function, rf.wasserstein() computes the Wasserstein distance, aka the earth mover’s distance, between distributions. This can help us figure out if our data splits all have similar distributions or not — an important condition of our evaluation workflow. I’ll feed it 3 splits in which I have forgotten to scale the first feature (i.e. the first column) in the X_test dataset:

>>> rf.wasserstein([X_train, X_val, X_test])
array([[32.108,  0.025,  0.043,  0.034],
       [16.011,  0.025,  0.039,  0.057],
       [64.127,  0.049,  0.056,  0.04 ]])

The large distances in the first column are the clue that the distribution of the data in this column varies a great deal between the three datasets. Plotting the distributions make it clear what happened.

Working with sklearn

Since we’re often already working with scikit-learn pipelines, and because I don’t really want to have to remember all these extra steps and functions, I thought it would be useful to make a special redflag pipeline that runs “all the things”. It’s called rf.pipeline and it might be all you need. Here’s how to use it:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipe = make_pipeline(StandardScaler(), rf.pipeline, SVC())

Here’s what this object contains:

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pipeline',
                 Pipeline(steps=[('rf.imbalance', ImbalanceDetector()),
                                 ('rf.clip', ClipDetector()),
                                 ('rf.correlation', CorrelationDetector()),
                                 ('rf.outlier', OutlierDetector()),
                                 ('rf.distributions',
                                  DistributionComparator())])),
                ('svc', SVC())])

Those redflag items in the inner pipeline are just detectors — think of them like smoke alarms — they do not change any data. Some of them acquire statistics during model fitting, then apply them during prediction. For example, the DistributionComparator learns the feature distributions from the training data, then compares the prediction data to them, to help ensure that you aren’t trying to extrapolate with your model. For example, it will warn you if you train a model on low-GR sandstones then try to predict on high-GR shales.

Here’s what happens when I fit my data with this pipeline:

These are just warnings, and it’s up to me to act on them. I can adjust detection thresholds and other aspects of the algorithms under the hood, but the goal is for redflag to wave its little flag, but not to get in the way. Apart from the warnings, this pipeline works exactly as it did before.


If this project sounds interesting or useful to you, please give it a look. The documentation is here, and contains more examples like those above. If you find bugs or want to request enhancements, there’s the GitHub Issues page. And if you use it for anything you can share, I’d love to hear how you get along!

Love Python and seismic? You need xarray

If you use Python on a regular basis, you might have noticed that Pandas is eating everything, not just things made out of bamboo.

But one thing Pandas is not useful for is 2-D or 3-D (or more-D) seismic data. Its DataFrames are implicitly two-dimensional tables, and shine when the columns represent all sorts of different quantities, e.g. different well logs. But, the really powerful thing about a DataFrame is its index. Instead of having row 0, row 1, etc, we can use an index that makes sense for the data. For example, we can use depth for the index and have row 13.1400 and row 13.2924, etc. Much easier than trying to figure out which index goes with 13.1400 metres!

Seismic data, on the other hand, does not fit comfortably in a table. But it would certainly be convenient to have real-world coordinates for the data as well as NumPy indices — maybe inline and crossline numbers, or (UTMx, UTMy) position, or the actual time position of the timeslices in milliseconds. With xarray we can.

In essence, xarray brings you Pandas-like indexing for n-dimensional arrays. Let’s see why this is useful.

Make an xarray.DataArray

The DataArray object is analogous to a NumPy array, but with the Pandas-like indexing added to each dimension via a property called coords. Furthermore, each dimension has a name.

Imagine we have a seismic volume called seismic with shape (inlines, xlines, time) and the inlines and crosslines are both numbered like 1000, 1001, 1002, etc. The time samples are 4 ms apart. All we need is an array of numbers representing inlines, another for the xlines, and another for the time samples. Then we can construct the DataArray like so:

 
import xarray as xr

il, xl, ts = seismic.shape  # seismic is a NumPy array.

inlines = np.arange(il) + 1000
xlines = np.arange(xl) + 1000
time = np.arange(ts) * 0.004

da = xr.DataArray(seismic,
                  name='amplitude',
                  coords=[inlines, xlines, time],
                  dims=['inline', 'xline', 'twt'],
                  )

This object da has some nice superpowers. For example, we can visualize the timeslice at 400 ms with a single line:

da.sel(twt=0.400).plot()

to produce this plot:

Plotting the same thing from the NumPy array would require us to figure out which index corresponds to 400 ms, and then write at least 8 lines of matplotlib code. Nice!

What else?

Why else might we prefer this data structure over a simple NumPy array? Here are some reasons:

  • Extracting amplitude (or whatever) on a horizon. In particular, if the horizon is another DataArray with the same dims as the volume, this is a piece of cake.

  • Extracting amplitude along a wellbore (again, use the same dims for the well path).

  • Making an arbitrary line (a view not aligned with the rows/columns) is easier, because interpolation is ‘free’ with xarray.

  • Exchanging data with Pandas DataFrame objects is easy, making reading from certain formats (e.g. OpendTect horizon export, which gives you inline, xline coords, not a grid) much easier. Even better: use our library gio for this! It makes xarray objects for you.

  • If you have multiple attributes, then xarray.Dataset is a nice data structure to keep things straight. You can think of it as a dictionary of DataArray objects, but it’s a lot more than that.

  • You can store arbitrary metadata in xarray objects, so it’s easy to attach things you’d like to retain like data rights and permissions, seismic attribute parameters, data acquisition details, and so on.

  • Adding more dimensions is easy, e.g. offset or frequency, because you can just add more dims and associated coords. It can get hard to keep these straight in NumPy (was time on the 3rd or 4th axis?).

  • Taking advantage of Dask, which gives you access to parallel processing and out-of-core operations.

All in all, definitely worth a look for anyone who uses NumPy arrays to represent data with real-world coordinates.


Do you already use xarray? What kind of data does it help you with? What are your reasons? Did I miss anything from my list? Let us know in the comments!

Build an app with Python

Do you have an idea for an app?

Or maybe a useful bit of code you want to share with others, but you’re not sure where to start?

Lots of people come to our Geocomputing class — which is for outright beginners — saying, "I want to build an app". Most of them are thinking of a mobile or desktop app, but most beginners don't know much about the alternatives. Getting useful software into other people’s hands doesn’t necessarily mean making a desktop application. Alternatives include programming libraries, command line tools, and web applications with or without public machine interfacecs (so-called APIs) — and it’s hard to discover and learn about things you don’t know exist.

Now, coming up with a streamlined set of questions to figure out which kind of tool might best match your goals for ‘an app’ is probably impossible. So I gave it a try:

There’s a lot of undocumented nuance in this flowchart. For example:

  • There are a lot of other ways to achieve all of the things I mention in the orange boxes. I picked on some examples, but you could also make a web app — with an API — with Flask or Django. You can make a library or CLI (command line interface) tool with modules from the standard library. There are lots of ways to build a desktop app with a GUI (none of them exactly easy). Indeed, you can run a web app on the desktop in various ways.

  • You might be wondering, “where is ‘Build a mobile app’?” It’s not there. I don’t think building native mobile apps is usually the best idea, especially for relative beginners to Python. Web apps are easier to make, they work on any platform, and are easier to maintain. It helps if you’re online of course, but it is possible to write web apps that work offline too.

  • The main idea is to make something. You want to build the easiest or fastest thing that solves the problem for a few important users and use cases. Because if you can make something they will at least test a few times, you can get some awesome feedback for the next iteration — which might be a completely different thing.

So, take it with a large grain of salt! I hope it’s a tiny bit useful, and at least gives you some things to Google the next time you’re thinking about building ‘an app’.


I tweeted about this flowchart. If you want to adapt it for your own purposes, the original is here.

Thank you to Software Undergrounders Rafael, Lukas, Leo, Kent, Rob, Martin and Evan for helping me improve it. I’m still responsible for its many flaws.

Comparing regressors

There are several really nice comparisons between various algorithms in the Scikit-Learn documentation. The most famous, and useful, one is probably the classifier comparison:

A comparison of classification algorithms. Each row is a different dataset; each column (except the first) is a different classifier, each trying to separate the blue and red points. The accuracy score of each classifier is show in the lower right corner of each plot. There’s so much to look at in this one plot!

There’s also a very nice clustering algorithm comparison, and this anomaly detection comparison. As usual with awesome open source software packages like Scikit-Learn, the really wonderful thing is that all the source code is right there so you can hack these things to show your own data.

What about regression?

Regression problems are the other major kind of machine learning task. If the thing you’re trying to predict is not a category (like ‘blue’ or ‘red’, as above) but a continuous property (like porosity, say), then you’re looking at a regression problem.

I wondered what a comparison plot for the various regressors in Scikit-Learn would look like. I couldn’t find one, so I made one. I made up three one-dimensional datasets — one linear, one polynomial, and one periodic. Then I tried predicting each one with various different model types, from linear regression to a deep neural network. Here’s version 1 (well, 0.1 really) of my script; feel free to adapt and improve it!

Here’s the plot it produces:

A comparison of most of the regressors in scikit-learn, made with this script. The red lines are unregularized models; the blue have regularization. The pale points are the validation data. The small numbers in each plot are RMS error (lower is better!).

I think this plot repays careful study. Notice the smoothing effect of regularization. See how tree-based methods result in discretized predictions, and kernel-based ones are pretty horrible at extrapolation.

I’m 100% open to feedback on ways to improve this plot… or please improve it and show me how it goes!

Take one, make one

There’s a teaching method originating in medicine known as “see one, do one, teach one”. I like it because it underscores hands-on practice and knowledge sharing as essential steps in developing a craft — and it works. Today, I want to urge you to take a challenge, then make one for others.

First, what’s the challenge?

A couple of years ago, inspired by the annual Advent of Code challenges, we introduced the kata, a set of coding challenges especially for geoscientists. For a long time we sent them to students in our Geocomputing class, to encourage them to keep coding. Now we just tell everyone about them.

At the time we announced the kata, there were five puzzles. Today, there are 11: four beginner-friendly challenges, four intermediate ones, and three quite hard ones. Topics range from data munging to map indexing, and from digital elevation models to fractures.

💡 If you want to try one, this Colab is the easiest way to get started: https://ageo.co/kata-live

Now make one!

Once you’ve got an idea of how these things work, you might want to try your hand at making one. Once you have an idea for a short task, you need a way to generate a random dataset. For example, for the sample-names challenge, I have a function that generates a random set of sample names, composed of several parts (a number, a basin, a formation, a data, etc).

When you have a dataset, you can ask some questions about it. Start with an one, and build from there. The last question (there can be 3 or 4), should be a somewhat realistic challenge for this kind of data. Each question needs a hint, and each question must have only one possible answer (this is the tricky bit!).

If you fancy trying your hand at it, check out our new kata-dev repository on GitHub. There is a demo challenge there, which is also live on the kata server, so you can see how it all works. Good luck!


Whether or not you try making a challenge for your peers, so let us know how you get on in the #kata-challenges channel on the Software Underground. We’re always ready to answer questions about them.

An open source wish list

After reviewing a few code-dependent scientific papers recently, I’ve been thinking about reproducibility. Is there a minimum requirement for scientific code, or should we just be grateful for any code at all?

The sky’s the limit

Click to enlarge

I’ve come to the conclusion that there are a few things that are essential if you want anyone to be able to do more than simply read your code. (If that’s all you want, just add a code listing to your supplementary material.)

The number one thing is an open licence. (I recently wrote about how to choose one). Assuming the licence is consistent with everything you have used (e.g. you haven’t used a library with the GPL, then put an Apache licence on it), then you are protected by the indeminity clauses and other people can re-use your code on your terms.

After that, good practice is to improve the quality of your code. Most of us write horrible code a lot of the time. But after bit of review, some refactoring, some input from colleagues, you will have something that is less buggy, more readable, and more reusable (even by you!).

If this was a one-off piece of code, providing figures for a paper for instance, you can stop here. But if you are going to keep developing this thing, and especially if you want others to use it to, you should keep going.

Best practice is to start using continuous integration, to help ensure that the code stays in good shape as you continue to develop it. And after that, you can make your tool more citable, maybe write a paper about it, and start developing a user/contributor community. The sky’s the limit — and now you have help!

Other models

When I shared this on Twitter, Simon Waldman mentioned that he had recently co-authored a paper on this topic. Harrison et al (2021) proposed that there are three priorities for scientific software: to be correct, to be reusable, and to be documented. From there, they developed a hierachy of research software projects:

  • Level 0 — Barely repeatable: the code is clear and tested in a basic way.

  • Level 1 — Publication: code is tested, readable, available and ideally openly licensed.

  • Level 2 — Tool: code is installable and managed by continuous integration.

  • Level 3 — Infrastructure: code is reviewed, semantically versioned, archived, and sustainable.

There are probably still other models out there.— if you know if a good one, please drop it in the Comments.

References

Sam Harrison, Abhishek Dasgupta, Simon Waldman, Alex Henderson & Christopher Lovell (2021, May 14). How reproducible should research software be? Zenodo. DOI: 10.5281/zenodo.4761867

Projects from the Geothermal Hackathon 2021

geothermal_skull_thumb.png

The second Geothermal Hackathon happened last week. Timed to coincide with the Geosciences virtual event of the World Geothermal Congress, our 2-day event brought about 24 people together in the famous Software Underground Chateau (I’m sorry if I missed anyone!). For comparison, last year we were 13 people, so we’re going in the right direction! Next time I hope we’re as big as one of our ‘real world’ events — maybe we’ll even be able to meet up in local clusters.

Here’s a rundown of the projects at this year’s event:

Induced seismicity at Espoo, Finland

Alex Hobé, Mohsen Bazagan and Matteo Niccoli

Alex’s original workflow for creating dynamic displays of microseismic events was to create thousands of static images then stack them into a movie, so the first goal was something more interactive. On Day 1 Alex built a Plotly widget with a time zoomer/slider in a Jupyter Notebook. On day 2 he and Matteo tried Panel for a dynamic 3D plot. Alex then moved the data into LLNL Visit for fully interactive 3D plots. The team continues to hack on the idea.

geothermal_hack_2021_seismic.png

Fluid inclusions at Coso, USA

Diana Acero-Allard, Jeremy Zhao, Samuel Price, Lawrence Kwan, Jacqueline Floyd, Brendan, Gavin, Rob Leckenby and Martin Bentley

Diana had the idea of a gas analysis case study for Coso Field, USA. The team’s specific goal was to develop visualization tools for interetpaton of fluid inclusion gas data to identify fluid types, regions of permeability, and geothermal processes. They had access to analyses from 29 wells, requiring the usual data science workflow: find and load the data, clean the data, make some visualizations and maps, and finally analyse the permeability. GitHub repo here.

geothermal_hack_2021_fluid-incl.png

Utah Forge data pipeline

Andrea Balza, Evan Bianco, and Diego Castañeda

Andrea was driven to dive into the Utah FORGE project. Navigating the OpenEI data portal was a bit hit-and-miss, having to download files to get into ZIP files and so on (this is a common issue with open data repositories). The team eventually figured out how to programmatically access the files to explore things more easily — right from a Jupyter Notebook. Their code for any data on the OpenEI site, not just Utah FORGE, so it’s potentially a great research tool. GitHub repo here.

geothermal_hack_2021_forge.png

Pythonizing a power density estimation tool

Irene Wallis, Jan Niederau, Hannah Wood, Will Middlebrook, Jeff Jex, and Bill Cummings

Like a lot of cool hackathon projects, this one started with spreadsheet that Bill created to simplify the process of making power density estimates for geothermal fields under some statistical assumptions. Such a clear goal always helps focus the mind and the team put together some Python notebooks and then a Streamlit app — which you can test-drive here! From this solid foundation, the team has plenty of plans for new directions to take the tool. GitHub repo here.

geothermal_hack_2021_streamlit2.png
geothermal_hack_2021_streamlit1.png

Computing boiling point for depth

Thorsten Hörbrand, Irene Wallis, Jan Niederau and Matt Hall

Irene identified the need for a Python tool to generate boiling-point-for-depth curves, accommodating various water salinities and chemistries. As she showed during her recent TRANSFORM tutorial (which you must watch!), so-called BPD curves are an important part of geothermal well engineering. The team produced some scripts to compute various scenarios, based on corrections in the IAPWS standards and using the PHREEQC aqueous geochemistry modeling software. GitHub repo here.

geothermal_hack_2021_bpd-curves.png

A big Thank You to all of the hackers that came along to this virtual event. Not quite the same as a meatspace hackathon, admittedly, but Gather.town + Slack was definitely an improvement over Zoom + Slack. At least we have an environment in which people can arrive and immediately get a sense of what is happening in the event. When you realize that people at the tables are actually sitting in Canada, the US, the UK, Switzerland, South Africa, and Auckland — it’s clear that this could become an important new way to collaborate across large distances.

geothermal_hack_2021_chateau.png

Do check out all these awesome and open-source projects — and check out the #geothermal channel in the Software Underground to keep up with what happens next. We’ll be back in the future — perhaps the near future! — with more hackathons and more geothermal technology. Hopefully we’ll see you there! 🌋

A (useless) map of geo-mathematics

Most scientific problems involve at least a bit of maths, even if it’s just adding things up or finding averages.

But some problems require quite a bit of maths, like solving an equation, or throwing vectors around, or even a Fourier transform or two. A lot of people switch off at this point.

Yet other problems require a lot of maths. Maybe we need a finite difference model, a volume integral, or a deep neural network. Most of us back away from the problem at this point and look for a collaborator with a lot of equations on their whiteboard.

But it’s pretty hard to find new collaborators right now. And what if you run into these problems a lot? Maybe you need to be the one with the equationy whiteboard!

What then? Where do you start? We need a map!

A roadmap for learning…

…is what I set out to draw. I failed. Possibly there exists a map, with START HERE in one corner and a whiteboard full of equations in the other. But I doubt it.

I ended up drawing this:

math_map_2400px.png

It was fun to draw, but I highly doubt that it’s any practical use. The conclusion I came to is: there is no path. In fact, this artificially flattened projection of the n-dimensional mathiverse — no doubt reflecting my own weak grasp on half of these topics — is probably a unique, personal perspective. It reflects my interests and my nonlinear journey from A-level calculus (which I loved) to undergraduate maths (which I found very hard) to… whatever half-truths I know today.

But I want to learn, where do I start?

So if there is no path, what can you do to improve? Where should you start? How can you learn? Easy: follow your nose. Start with a project — something that interests you, something you’ll stick with. Maybe it’s a spreadsheet you have, or a plot you want to make. When you get to the maths, as you inevitably will, dig in. Read around. Google things. Get a whiteboard.

As an example, when I worked on the tricky (and unsolved!) task of recovering data from pseudocolour images, my maths journey looked something like this:

Images ➡ clustering ➡ RGB vectors ➡ distance metrics ➡ k-d trees ➡ graphs ➡ Hamiltonian pathsTSP solvers

Admittedly, there’s some computer science in there too, but hey, this is applied maths.

As I described recently in Illuminated equations, there are several ways to serve, and consume, mathematical ideas: words, pictures, plots, symbols, annotations, and code (and probably some others). Seek out sources that give you three or more of these things. For me, being able to run some code makes a huge difference. Indeed, learning Python has directly led to me reading entire books on graph theory, linear algebra, deep learning, Fourier transforms, and all sorts of other things.

Well, learning Python and watching Numberphile.

I think it’s a myth that you have to be good at maths to learn to code. Instead, I think learning to code can — if you want — help make you good, or at least better, at maths. By giving you a way to try things without fully knowing what you are doing (after all, np.fft.fft(x) is pretty easy to type!), code gives you a way to peek at the answer. If you do it often enough, and follow up with some reading, understanding follows eventually.

Which programming language should you learn first?

The question I get asked most often is:

I want to learn to code, where should I start?

To which there’s really no perfect answer. It depends on a lot of things… Why do you want to learn to program? What domain are you in? Have you tried before? Do you like computers? Do your colleagues use anything in particular?

Undeterred by the futility, and inspired by an awesome blog post on freecodecamp.org, which advises you to learn JavaScript (not terrible advice), I thought I’d try to answer the question — for scientists. I adapted a rather old decision tree in that blog post to the specific needs of scientists. Here is the latest version:

which.png

Yes, it does have a lot of Python on it. Yes, I am biased.

These sorts of things are necessarily rather one-dimensional. In an effort to give general advice, all the interesting corners are sanded down. For instance, there is definitely some domain specificity to languages. In the subsurface domain, a lot of geophysicists learned their craft in MATLAB, but today are excited about Julia. I think environmental folks are more into R. Geologists mostly still like coloured pencils best. And of course reservoir engineers mastered VBA years ago. Clearly, if you’re learning to code to start a postgraduate degree, you should probably find out what language others in your lab are using before you crack into that old copy of FORTRAN in a Weekend.*

Anyway, this decision tree thing provoked quite a bit of discussion on Software Underground and Twitter. Some people felt challenged, although my purpose was to suggest a starting point for people, not to say “Never touch Java” (although, seriously, never touch Java). It’s natural — learning to master a language takes years and people are sensitive to perceived criticism of how they spent their time. But this misses the point a bit — programming is really just about getting things done, preferably in an open language (<cough> not MATLAB). So what’s the quickest path for a new programmer to start getting things done?

I appreciated this thoughtful comment from Kris Kuhlman:

I think it worked out more or less that way for me. I learned a bit of BASIC as a 12-year-old, and knew enough assembly to crash a BBC Micro. Then I learned awk in 1993, and used it for basically everything — including many things it certainly was not designed for. I tried and failed to learn Java in 2002, instead picking up MATLAB… which led to Python in about 2008. I was a slow learner though; it took years to be convinced that I needed NumPy. (Yes, you can load seismic as a list of lists.)

In the end, you need several tools in your belt. Several people pointed out that SQL (a so-called ‘domain specific language’ rather than a full-blown programming language) is incredibly useful to know. I think you could say the same for HTML and maybe even XML — or perhaps JSON these days. Then again, maybe these stretch the definition of ‘programming language’ a bit too far. Besides, if you write code, you’ll meet them eventually.

In the end, the point is to get things done. Every language on that tree will enable you to get things done. (Admittedly, Scratch, Processing, ChucK have rather narrow domains.) Fortran has been around for 70 years (not a typo) and is still in the top 20 languages. So don’t sweat it — if Kris is right, you’ll need to learn 2 languages before one sticks anyway.


* There is no such book, lol.

The hot rock hack is back

shirt_skull.png

Last year we ran the first ever Geothermal Hackathon. As with all things, we started small, but energetic: fourteen of us worked on six projects. Topics ranged from project management to geological mapping to natural language processing. It was a fun two days not thinking about coronavirus.

This year we’ll be meeting up on Thursday 13 and Friday 14 May, starting right after the Geoscience Virtual Event of the World Geothermal Congress. Everyone is invited — geoscientists, engineers, data nerds, programmers. No experience of geothermal is necessary, just creativity and curiosity.

Projects are already being discussed on the Software Underground; here are some of the ideas:

  • Data-munging project for Utah Forge, especially well 58-32.

  • Update the Awesome list Thomas Martin started last year.

  • Implementing classic, or newly published, equations and algorthims from the literature.

I expect the preceeding WGC event will spark some last-minute projects too. But for the time being, you’re welcome to add or vote on ideas on the event page. What tools or visualizations would you find useful?


Build some digital geo skills

📣 If you’re looking to build up your coding skills before the hackathon — or for a research project or an idea at work — join us for a Python class. We teach the fundamentals of Python, NumPy and matplotlib using geological and geophysical examples and geo-familiar datasets. There are two classes coming up in May (Digital Geology) and June (Digital Geophysics).