X lines of Python: Gridding map data

Difficulty rating: moderate.

Welcome to the latest in the X lines of Python series. You probably thought it had died, gawn to ‘eaven, was an x-series. Well, it’s back!

Today we’re going to fit a regularly sampled surface — a grid — to an irregular set of points in (x, y) space. The points represent porosity, measured in volume percent.

Here’s what we’re going to do; it all comes to only 9 lines of code!

  1. Load the data from a text file (needs 1 line of code).

  2. Compute the extents and then the coordinates of the new grid (2 lines).

  3. Make a radial basis function interpolator using SciPy (1 line).

  4. Perform the interpolation (1 line).

  5. Make a plot (4 lines).

As usual, there’s a Jupyter Notebook accompanying this blog post, and you can run it right now without installing anything.


Binder Run the accompanying notebook in MyBinder

Open In Colab Run the notebook in Google Colaboratory

Just the juicy bits

The notebook goes over the workflow in a bit more detail — with more plots and a few different ways of doing the interpolation. For example, we try out triangulation and demonstrate using scikit-learn’s Gaussian process model to show how we might use kriging (turns out kriging was machine learning all along!).

If you don’t have time for all that, and just want the meat of the notebook, here it is:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import Rbf

# Load the data.
df = pd.read_csv('../data/ZoneA.dat',
                 sep=' ',
                 usecols=[0, 1, 2, 3],
                 names=['x', 'y', 'thick', 'por']

# Build a regular grid with 500-metre cells.
extent = x_min, x_max, y_min, y_max = [df.x.min()-1000, df.x.max()+1000,
                                       df.y.min()-1000, df.y.max()+1000]
grid_x, grid_y = np.mgrid[x_min:x_max:500, y_min:y_max:500]

# Make the interpolator and do the interpolation.
rbfi = Rbf(df.x, df.y, df.por)
di = rbfi(grid_x, grid_y)

# Make the plot.
plt.figure(figsize=(15, 15))
plt.imshow(di.T, origin="lower", extent=extent)
cb = plt.scatter(df.x, df.y, s=60, c=df.por, edgecolor='#ffffff66')
plt.colorbar(cb, shrink=0.67)

This results in the following plot, in which the points are the original data, plotted with the same colourmap as the surface itself (so they should be the same colour, more or less, as their background).


x lines of Python: contour maps

Difficulty rating: EASY

Following on from the post a couple of weeks ago about colourmaps, I wanted to poke into contour maps a little more. Ostensibly, making a contour plot in matplotlib is a one-liner:


But making a contour plot look nice takes a little more work than most of matplotlib's other plotting functions. For example, to change the contour levels you need to make an array containing the levels you want... another line of code. Adding index contours needs another line. And then there's all the other plotty stuff.

Here's what we'll do:

  1. Load the data from a binary NumPy file.
  2. Check the data looks OK.
  3. Get the min and max values from the map.
  4. Generate the contour levels.
  5. Make a filled contour map and overlay contour lines.
  6. Make a map with index contours and contour labels.

The accompanying notebook sets out all the code you will need. You can even run the code right in your browser, no installation required.

Here's the guts of the notebook:

import numpy as np
import matplotlib.pyplot as plt

seabed = np.load('../data/Penobscot_Seabed.npy')
seabed *= -1
mi, ma = np.floor(np.nanmin(seabed)), np.ceil(np.nanmax(seabed))
step = 2
levels = np.arange(10*(mi//10), ma+step, step)
lws = [0.5 if level % 10 else 1 for level in levels]

# Make the plot
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(1,1,1)
im = ax.imshow(seabed, cmap='GnBu_r', aspect=0.5, origin='lower')
cb = plt.colorbar(im, label="TWT [ms]")
cb.set_clim(mi, ma)
params = dict(linestyles='solid', colors=['black'], alpha=0.4)
cs = ax.contour(seabed, levels=levels, linewidths=lws, **params)
ax.clabel(cs, fmt='%d')

This produces the following plot:


x lines of Python: Let's play golf!

Normally in the x lines of Python series, I'm trying to do something useful in as few lines of code as possible, but — and this is important — without sacrificing clarity. Code golf, on the other hand, tries solely to minimize the number of characters used, and to heck with clarity. This might, and probably will, result in rather obfuscated code.

So today in x lines, we set x = 1 and see what kind of geophysics we can express. Follow along in the accompanying notebook if you like.

A Ricker wavelet

One of the basic building blocks of signal processing and therefore geophysics, the Ricker wavelet is a compact, pulse-like signal, often employed as a source in simulation of seismic and ground-penetrating radar problems. Here's the equation for the Ricker wavelet:

$$ A = (1-2 \pi^2 f^2 t^2) e^{-\pi^2 f^2 t^2} $$

where \(A\) is the amplitude at time \(t\), and \(f\) is the centre frequency of the wavelet. Here's one way to translate this into Python, more or less as expressed on SubSurfWiki:

import numpy as np 
def ricker(length, dt, f):
    """Ricker wavelet at frequency f Hz, length and dt in seconds.
    t = np.arange(-length/2, length/2, dt)
    y = (1.0 - 2.0*(np.pi**2)*(f**2)*(t**2)) * np.exp(-(np.pi**2)*(f**2)*(t**2))
    return t, y

That is alredy pretty terse at 261 characters, but there are lots of obvious ways, and some non-obvious ways, to reduce it. We can get rid of the docstring (the long comment explaining what the function does) for a start. And use the shortest possible variable names. Then we can exploit the redundancy in the repeated appearance of \(\pi^2f^2t^2\)... eventually, we get to:

def r(l,d,f):import numpy as n;t=n.arange(-l/2,l/2,d);k=(n.pi*f*t)**2;return t,(1-2*k)/n.exp(k)

This weighs in at just 95 characters. Not a bad reduction from 261, and it's even not too hard to read. In the notebook accompanying this post, I check its output against the version in our geophysics package bruges, and it's legit:

The 95-character Ricker wavelet in green, with the points computed by the function in BRuges.

The 95-character Ricker wavelet in green, with the points computed by the function in BRuges.

What else can we do?

In the notebook for this post, I run through some more algorithms for which I have unit-tested examples in bruges:

To give you some idea of why we don't normally code like this, here's what the Aki–Richards solution looks like:

def r(a,c,e,b,d,f,t):import numpy as n;w=f-e;x=f+e;y=d+c;p=n.pi*t/180;s=n.sin(p);return w/x-(y/a)**2*w/x*s**2+(b-a)/(b+a)/n.cos((p+n.arcsin(b/a*s))/2)**2-(y/a)**2*(2*(d-c)/y)*s**2

A bit hard to debug! But there is still some point to all this — I've found I've had to really understand Python's order of mathematical operations, and find different ways of doing familiar things. Playing code golf also makes you think differently about repetition and redundancy. All good food for developing the programming brain.

Do have a play with the notebook, which you can even run in Microsoft Azure, right in your browser! Give it a try. (You'll need an account to do this. Create one for free.)

Many thanks to Jesper Dramsch and Ari Hartikainen for helping get my head into the right frame of mind for this silliness!

x lines of Python: load curves from LAS

Welcome to the latest x lines of Python post, in which we have a crack at some fundamental subsurface workflows... in as few lines of code as possible. Ideally, x < 10.

We've met curves once before in the series — in the machine learning edition, in which we cheated by loading the data from a CSV file. Today, we're going to get it from an LAS file — the popular standard for wireline log data.

Just as we previously used the pandas library to load CSVs, we're going to save ourselves a lot of bother by using an existing library — lasio by Kent Inverarity. Indeed, we'll go even further by also using Agile's library welly, which uses lasio behind the scenes.

The actual data loading is only 1 line of Python, so we have plenty of extra lines to try something more ambitious. Here's what I go over in the Jupyter notebook that goes with this post:

  1. Load an LAS file with lasio.
  2. Look at its header.
  3. Look at its curve data.
  4. Inspect the curves as a pandas DataFrame.
  5. Load the LAS file with welly.
  6. Look at welly's Curve objects.
  7. Plot part of a curve.
  8. Smooth a curve.
  9. Export a set of curves as a matrix.
  10. BONUS: fix some broken things in the file header.

Each one of those steps is a single line of Python. Together, I think they cover many of the things we'd like to do with well data once we get our hands on it. Have a play with the notebook and explore what you can do.

Next time we'll take things a step further and dive into some seismic petrophysics.

x lines of Python: Global seismic data

Today we'll look at finding and analysing global seismology data with Python and the wonderful seismology package ObsPy, from Moritz Beyreuther, Lion Krischer, and others originally at the Geophysical Observatory in Munich.

We've used ObsPy before to load SEG-Y files into Python, but that's not its core purpose. These tools are typically used by global seismologists and earthquake scientists, but we're going to download and analyse data from three non-earthquakes:

  1. A curious landslide and tsunami in Greenland.
  2. The recent nuclear bomb test in North Korea.
  3. Hurricane Irma's passage through the Caribbean.

We'll also look at an actual earthquake. This morning there was a very large earthquake off Mexico, killing at least 15 people. It's the first M8+ earthquake anywhere since the Illapel event, Chile, on 16 September 2015.

Only 4 lines?

Once you have ObsPy, only 4 lines of code (not counting imports) are needed to download and plot a seismic trace. Here's how to instantiate the ObsPy client using the IRIS data service, then get 5 minutes of waveform data from the Mudanjiang or MDJ station on the IC network, the New China Digital Seismograph Network, and finally plot it:

from obspy.clients.fdsn import Client
client = Client("IRIS")

from obspy import UTCDateTime
t = UTCDateTime("2017-09-03_03:30:00")
st = client.get_waveforms("IC", "MDJ", "00", "BHZ", t, t + 5*60)

Pretty awesome, right? One day getting seismic and well data will be this simple! LOL

Check out the Jupyter Notebook! I cannot get this notebook to run on Azure Notebooks I'm afraid, so the only way to run it is to set up Python and Jupyter (best way: install Canopy or Anaconda) on your machine. I urge you to give it a go, because what could be more fun than playing around with decades of seismic data from all over the world?

x lines of Python: read and write CSV

A couple of weeks ago, in Murphy's Law for Excel, I wrote about the dominance of spreadsheets in applied analysis, and how they may be getting out of hand. Then in Organizing spreadsheets I wrote about how — if you are going to store data in spreadsheets — to organize your data so that you do the least amount of damage. The general goal being to make your data machine-readable. Or, to put it another way, to allow you to save your data as comma-separated values or CSV files.

CSV is the de facto standard way to store data in text files. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in your analysis tool of choice. In our case, this is the Python language. So today I present a few different ways to get at data stored in CSV files.

How many ways can I read thee?

In the accompanying Jupyter Notebook, we read a CSV file into Python in six different ways:

  1. Using the pandas data analysis library. It's the easiest way to read CSV and XLS data into your Python environment...
  2. ...and can happily consume a file on the web too. Another nice thing about pandas. It also writes CSV files very easily.
  3. Using the built-in csv package. There are a couple of standard ways to do this — csv.reader...
  4. ...and csv.DictReader. This library is handy for when you don't have (or don't want) pandas.
  5. Using numpy, the numeric library for Python. If you just have a CSV full of numbers and you want an array in the end, you can skip pandas.
  6. OK, it's not really a CSV file, but for the finale we read a spreadsheet directly from Google Sheets.

I usually count my lines diligently in these posts, but not this time. With pandas you're looking at a one-liner to read your data:

df = pd.read_csv("myfile.csv")

and a one-liner to write it out again. With csv.DictReader you're looking at 3 lines to get a list of dicts (but watch out: your numbers will be strings). Reading a Google Doc is a little more involved, not least because you'll need to set up an app and get an API key to handle authentication.

That's all there is to CSV files. Go forth and wield data like a pro! 

Next time in the xlines of Python series we'll look at reading seismic station data from the web, and doing a bit of time-series analysis on it. No more stuff about spreadsheets and CSV files, I promise :)

The thumbnail image is based on the possibly apocryphal banksy image of an armed panda, and one of texturepalace.com's CC-BY textures.

x lines of Python: read and write a shapefile

Shapefiles are a sort-of-open format for geospatial vector data. They can encode points, lines, and polygons, plus attributes of those objects, optionally bundled into groups. I say 'sort-of-open' because the format is well-known and widely used, but it is maintained and policed, so to speak, by ESRI, the company behind ArcGIS. It's a slightly weird (annoying) format because 'a shapefile' is actually a collection of files, only one of which is the eponymous SHP file. 

Today we're going to read a SHP file, change its Coordinate Reference System (CRS), add a new attribute, and save a new file in two different formats. All in x lines of Python, where x is a small number. To do all this, we need to add a new toolbox to our xlines virtual environment: geopandas, which is a geospatial flavour of the popular data management tool pandas.

Here's the full rundown of the workflow, where each item is a line of Python:

  1. Open the shapefile with fiona (i.e. not using geopandas yet).
  2. Inspect its contents.
  3. Open the shapefile again, this time with geopandas.
  4. Inspect the resulting GeoDataFrame in various ways.
  5. Check the CRS of the data.
  6. Change the CRS of the GeoDataFrame.
  7. Compute a new attribute.
  8. Write the new shapefile.
  9. Write the GeoDataFrame as a GeoJSON file too.

By the way, if you have not come across EPSG codes yet for CRS descriptions, they are the only way to go. This dataset is initially in EPSG 4267 (NAD27 geographic coordinates) but we change it to EPSG 26920 (NAD83 UTM20N projection).

Several bits of our workflow are optional. The core part of the code, items 3, 6, 7, and 8, are just a few lines of Python:

    import geopandas as gpd
    gdf = gpd.read_file('data_in.shp')
    gdf = gdf.to_crs({'init': 'epsg:26920'})
    gdf['seafl_twt'] = 2 * 1000 * gdf.Water_Dept / 1485

That's it! 

As in all these posts, you can follow along with the code in the Jupyter Notebook.

x lines of Python: machine learning

You might have noticed that our web address has changed to agilescientific.com, reflecting our continuing journey as a company. Links and emails to agilegeoscience.com will redirect for the foreseeable future, but if you have bookmarks or other links, you might want to change them. If you find anything that's broken, we'd love it if you could let us know.

Artificial intelligence in 10 lines of Python? Is this really the world we live in? Yes. Yes it is.

After reminding you about the SEG machine learning contest just before Christmas, I thought I could show you how you train a model in a supervised learning problem, then use it to make predictions on unseen data. So we'll just break a simple contest entry down into ten easy steps (note that you could do this on anything, doesn't have to be this problem). 

A machine learning primer

Before we start, let's review quickly what a machine learning problem looks like, and introduct a bit of jargon. To begin, we have a dataset (e.g. the 'Old' well in the diagram below). This consists of records, called instances. In this problem, each instance is a depth location. Each instance is a feature vector: a row vector comprising attributes or features, which in our case are wireline log values for GR, ILD, and so on. Each feature vector is a row in a matrix we conventionally call \(X\). Associated with each instance is some target label — the thing we want to predict — which is a continuous quantity in a regression problem, discrete in a classification problem. The vector of labels is usually called \(y\). In the problem below, the labels are integers representing 9 different facies.

You can read much more about the dataset I'm using in Brendon Hall's tutorial (The Leading Edge, October 2016).

The ten steps to glory

Well, maybe not glory, but something. A prediction of facies at two wells, based on measurements made at 10 other wells. You can follow along in the notebook, but all the highlights are included here. We start by loading the data into a 'dataframe', which you can think of like a spreadsheet:

Now we specify the features we want to use, and make the matrix \(X\) and label vector \(y\):

  features = ['GR', 'ILD_log10', 'DeltaPHI', 'PHIND', 'PE']
  X = df[features].values
  y = df.Facies.values

Since this dataset is all we have, we'd like to set aside some data to test our model on. The library we're using, scikit-learn, has functions to do this sort of thing; by default, it'll split \(X\) and \(y\) into train and test datasets, with 25% of the data going into the test part:

  X_train, X_test, y_train, y_test = train_test_split(X, y)

Now we're ready to choose a model, instantiate it (with some parameters if we want), and train the model (i.e. 'fit' the data). I am calling the trained model augur, because I like that word.

  from sklearn.ensemble import ExtraTreesClassifier
  model = ExtraTreesClassifier()
  augur = model.fit(X_train, y_train)

Now we're ready to take the part of the dataset we reserved for validation, X_test, and predict its labels. Then we can compare those with the known labels, y_test, to see how well we did:

  y_pred = augur.predict(X_test)

We can get a quick idea of the quality of prediction with sklearn.metrics.accuracy_score(y_test, y_pred), but it's more interesting to look at the classification report, which shows us the precision and recall for each class, along with their harmonic mean, the F1 score:

  from sklearn.metrics import classification_report
  print(classification_report(y_test, y_pred))

Each row is a facies (facies 1, facies 2, etc.). The support is the number of instances representing that label. The key number here is 0.63 — we can regard this as an expression of the accuracy of our prediction. If that sounds low to you, I encourage you to enter the machine learning contest! If it sounds high, that's because it is — it's much too high. In fact, the instances of our dataset are not independent: they are spatially correlated (in depth). It would be smarter not to remove some random samples for validation, but to reserve entire wells. After all, this is how we typically collect subsurface data: one well at a time.

But now we're getting into the weeds of data science. I'll let you venture in there on your own...

x lines of Python: AVO plot

Amplitude vs offset (or, more properly, angle) analysis is a core component of quantitative interpretation. The AVO method is based on the fact that the reflectivity of a geological interface does not depend only on the acoustic rock properties (velocity and density) on both sides of the interface, but also on the angle of the incident ray. Happily, this angular reflectivity encodes elastic rock property information. Long story short: AVO is awesome.

As you may know, I'm a big fan of forward modeling — predicting the seismic response of an earth model. So let's model the response the interface between a very simple model of only two rock layers. And we'll do it in only a few lines of Python. The workflow is straightforward:

  1. Define the properties of a model shale; this will be the upper layer.
  2. Define a model sandstone with brine in its pores; this will be the lower layer.
  3. Define a gas-saturated sand for comparison with the wet sand. 
  4. Define a range of angles to calculate the response at.
  5. Calculate the brine sand's response at the interface, given the rock properties and the angle range.
  6. For comparison, calculate the gas sand's response with the same parameters.
  7. Plot the brine case.
  8. Plot the gas case.
  9. Add a legend to the plot.

That's it — nine lines! Here's the result:





Once we have rock properties, the key bit is in the middle:

    θ = range(0, 31)
    shuey = bruges.reflection.shuey2(vp0, vs0, ρ0, vp1, vs1, ρ1, θ)

shuey2 is one of the many functions in bruges — here it provides the two-term Shuey approximation, but it contains lots of other useful equations. Virtually everything else in our AVO plotting routine is just accounting and plotting.

As in all these posts, you can follow along with the code in the Jupyter Notebook. You can view this on GitHub, or run it yourself in the increasingly flaky MyBinder (which is down at the time of writing... I'm working on an alternative).

What would you like to see in x lines of Python? Requests welcome!

x lines of Python: web scraping and web APIs

The Web is obviously an incredible source of information, and sometimes we'd like access to that information from within our code. Indeed, if the information keeps changing — like the price of natural gas, say — then we really have no alternative.

Fortunately, Python provides tools to make it easy to access the web from within a program. In this installment of x lines of Python, I look at getting information from Wikipedia and requesting natural gas prices from Yahoo Finance. All that in 10 lines of Python — total.

As before, there's a completely interactive, live notebook version of this post for you to run, right in your browser. Quick tip: Just keep hitting Shift+Enter to run the cells. There's also a static repo if you want to run it locally.

Geological ages from Wikipedia

Instead of writing the sentences that describe the code, I'll just show you the code. Here's how we can get the duration of the Jurassic period fresh from Wikipedia:

url = "http://en.wikipedia.org/wiki/Jurassic"
r = requests.get(url).text
start, end = re.search(r'<i>([\.0-9]+)–([\.0-9]+)&#160;million</i>', r.text).groups()
duration = float(start) - float(end)
print("According to Wikipedia, the Jurassic lasted {:.2f} Ma.".format(duration))

The output:

According to Wikipedia, the Jurassic lasted 56.30 Ma.

There's the opportunity for you to try writing a little function to get the age of any period from Wikipedia. I've given you a spot of help, and you can even complete it right in your browser — just click here to launch your own copy of the notebook.

Gas price from Yahoo Finance

url = "http://download.finance.yahoo.com/d/quotes.csv"
params = {'s': 'HHG17.NYM', 'f': 'l1'}
r = requests.get(url, params=params)
price = float(r.text)
print("Henry Hub price for Feb 2017: ${:.2f}".format(price))

Again, the output is fast, and pleasingly up-to-the-minute:

Henry Hub price for Feb 2017: $2.86

I've added another little challenge in the notebook. Give it a try... maybe you can even adapt it to find other live financial information, such as stock prices or interest rates.

What would you like to see in x lines of Python? Requests welcome!