October 19, 2017

The norm and simple solutions

October 19, 2017/ Matt Hall

Last time I wrote about different ways of calculating distance in a vector space — say, a two-dimensional Euclidean plane like the streets of Portland, Oregon. I showed three ways to reckon the distance, or norm, between two points (i.e. vectors). As a reminder, using the distance between points u and v on the map below this time:

$$ \|\mathbf{u} - \mathbf{v}\|_1 = |u_x - v_x| + |u_y - v_y| $$

$$ \|\mathbf{u} - \mathbf{v}\|_2 = \sqrt{(u_x - v_x)^2 + (u_y - v_y)^2} $$

$$ \|\mathbf{u} - \mathbf{v}\|_\infty = \mathrm{max}(|u_x - v_x|, |u_y - v_y|) $$

Let's think about all the other points on Portland's streets that are the same distance away from u as v is. Again, we have to think about what we mean by distance. If we're walking, or taking a cab, we'll need to think about $\ell_1$ — the sum of the distances in x and y. This is shown on the left-most map, below.

For simplicity, imagine u is the origin, or (0, 0) in Cartesian coordinates. Then v is (0, 4). The sum of the distances is 4. Looking for points with the same sum, we find the pink points on the map.

If we're thinking about how the crow flies, or $\ell_2$ norm, then the middle map sums up the situation: the pink points are all equidistant from u. All good: this is what we usually think of as 'distance'.

The $\ell_\infty$ norm, on the other hand, only cares about the maximum distance in any direction, or the maximum element in the vector. So all points whose maximum coordinate is 4 meet the criterion: (1, 4), (2, 4), (4, 3) and (4, 0) all work.

You might remember there was also a weird definition for the $\ell_0$ norm, which basically just counts the non-zero elements of the vector. So, again treating u as the origin for simplicity, we're looking for all the points that, like v, have only one non-zero Cartesian coordinate. These points form an upright cross, like a + sign (right).

So there you have it: four ways to draw a circle.

Wait, what?

A circle is just a set of points that are equidistant from the centre. So, depending on how you define distance, the shapes above are all 'circles'. In particular, if we normalize the (u, v) distance as 1, we have the following unit circles:

It turns out we can define any number of norms (if you like the sound of $\ell_{2.4}$ or $\ell_{240}$ or $\ell_{0.024}$...) but most of the time, these will suffice. You can probably imagine the shapes of the unit circles defined by these other norms.

What can we do with this stuff?

Let's think about solving equations. Think about solving this:

$$ x + 2y = 8 $$

I'm sure you can come up with a soluiton in your head, x = 6 and y = 1 maybe. But one equation and two unknowns means that this problem is underdetermined, and consequently has an infinite number of solutions. The solutions can be visualized geometrically as a line in the Euclidean plane (right).

But let's say I don't want solutions like (3.141590, 2.429205) or (2742, –1367). Let's say I want the simplest solution. What's the simplest solution?

This is a reasonable question, but how we answer it depends how we define 'simple'. One way is to ask for the nearest solution to the origin. Also reasonable... but remember that we have a few different ways to define 'nearest'. Let's start with the everyday definition: the shortest crow-flies distance from the origin. The crow-flies, $\ell_2$ distances all lie on a circle, so you can imagine starting with a tiny circle at the origin, and 'inflating' it until it touches the line $x + 2y - 8 = 0$. This is usually called the minimum norm solution, minimized on $\ell_2$. We can find it in Python like so:

    import numpy.linalg as la
    A = [[1, 2]]
    b = [8]
    la.lstsq(A, b)

The result is the vector (1.6, 3.2). You could almost have worked that out in your head, but imagine having 1000 equations to solve and you start to appreciate numpy.linalg. Admittedly, it's even easier in Octave (or MATLAB if you must) and Julia:

    A = [1 2]
    b = [8]
    A \ b

But remember we have lots of norms. It turns out that minimizing other norms can be really useful. For example, minimizing the $\ell_1$ norm — growing a diamond out from the origin — results in (0, 4). The $\ell_0$ norm gives the same sparse* result. Minimizing the $\ell_\infty$ norm leads to $ x = y = 8/3 \approx 2.67$.

This was the diagram I wanted to get to when I started with the 'how far away is the supermarket' business. So I think I'll stop now... have fun with Norm!

* I won't get into sparsity now, but it's a big deal. People doing big computations are always looking for sparse representations of things. They use less memory, are less expensive to compute with, and are conceptually 'neater'. Sparsity is really important in compressed sensing, which has been a bit of a buzzword in geophysics lately.

October 05, 2017

The norm: kings, crows and taxicabs

October 05, 2017/ Matt Hall

How far away is the supermarket from your house? There are lots of ways of answering this question:

As the crow flies. This is the green line from $\mathbf{a}$ to $\mathbf{b}$ on the map below.
The 'city block' driving distance. If you live on a grid of streets, all possible routes are the same length — represented by the orange lines on the map below.
In time, not distance. This is usually a more useful answer... but not one we're going to discuss today.

Don't worry about the mathematical notation on this map just yet. The point is that there's more than one way to think about the distance between two points, or indeed any measure of 'size'.

Higher dimensions

The map is obviously two-dimensional, but it's fairly easy to conceive of 'size' in any number of dimensions. This is important, because we often deal with more than the 2 dimensions on a map, or even the 3 dimensions of a seismic stack. For example, we think of raw so-called 3D seismic data as having 5 dimensions (x position, y position, offset, time, and azimuth). We might even formulate a machine learning task with a hundred or more dimensions (or 'features').

Why do we care about measuring distances in high dimensions? When we're dealing with data in these high-dimensional spaces, 'distance' is a useful way to measure the similarity between two points. For example, I might want to select those samples that are close to a particular point of interest. Or, from among the points satisfying some constraint, select the one that's closest to the origin.

Definitions and nomenclature

We'll define norms in the context of linear algebra, which is the study of vector spaces (think of multi-dimensional 'data spaces' like the 5D space of seismic data). A norm is a function that assigns a positive scalar size to a vector $\mathbf{v}$ , with a size of zero reserved for the zero vector (in the Cartesian plane, the zero vector has coordinates (0, 0) and is usually called the origin). Any norm $\|\mathbf{v}\|$ of this vector satisfies the following conditions:

Absolutely homogenous. The norm of $\alpha\mathbf{v}$ is equal to $|\alpha|$ times the norm of $\mathbf{v}$.
Subadditive. The norm of $ (\mathbf{u} + \mathbf{v}) $ is less than or equal to the norm of $\mathbf{u}$ plus the norm of $\mathbf{v}$. In other words, the norm satisfies the triangle inequality.
Positive. The first two conditions imply that the norm is non-negative.
Definite. Only the zero vector has a norm of 0.

Kings, crows and taxicabs

Let's return to the point about lots of ways to define distance. We'll start with the most familiar definition of distance on a map— the Euclidean distance, aka the $\ell_2$ or $L_2$ norm (confusingly, sometimes the two is written as a superscript), the 2-norm, or sometimes just 'the norm' (who says maths has too much jargon?). This is the 'as-the-crow-flies distance' on the map above, and we can calculate it using Pythagoras:

$$ \|\mathbf{v}\|_2 = \sqrt{(a_x - b_x)^2 + (a_y - b_y)^2} $$

You can extend this to an arbitrary number of dimensions, just keep adding the squared elementwise differences. We can also calculate the norm of a single vector in n-space, which is really just the distance between the origin and the vector:

$$ \|\mathbf{u}\|_2 = \sqrt{u_1^2 + u_2^2 + \ldots + u_n^2} = \sqrt{\mathbf{u} \cdot \mathbf{u}} $$

As shown here, the 2-norm of a vector is the square root of its dot product with itself.

So the crow-flies distance is fairly intuitive... what about that awkward city block distance? This is usually referred to as the Manhattan distance, the taxicab distance, the $\ell_1$ or $L_1$ norm, or the 1-norm. As you can see on the map, it's just the sum of the absolute distances in each dimension, x and y in our case:

$$ \|\mathbf{v}\|_1 = |a_x - b_x| + |a_y - b_y| $$

What's this magic number 1 all about? It turns out that the distance metric can be generalized as the so-called p-norm, where p can take any positive value up to infinity. The definition of the p-norm is consistent with the two norms we just met:

$$ \| \mathbf{u} \|_p = \left( \sum_{i=1}^n | u_i | ^p \right)^{1/p} $$

[EDIT, May 2021: This generalized version is sometimes called the Minkowski distance, e.g. in the scipy documentation.]

In practice, I've only ever seen p = 1, 2, or infinity (and 0, but we'll get to that). Let's look at the meaning of the $\infty$-norm, aka the $\ell_\infty$ or $L_\infty$ norm, which is sometimes called the Chebyshev distance or chessboard distance (because it defines the minimum number of moves for a king to any given square):

$$ \|\mathbf{v}\|_\infty = \mathrm{max}(|a_x - b_x|, |a_y - b_y|) $$

In other words, the Chebyshev distance is simply the maximum element in a given vector. In a nutshell, the infinitieth root of the sum of a bunch of numbers raised to the infinitieth power, is the same as the infinitieth root of the largest of those numbers raised to the infinitieth power — because infinity is weird like that.

What about p = 0?

Infinity is weird, but so is zero sometimes. Taking the zeroeth root of a lot of ones doesn't make a lot of sense, so mathematicians often redefine the $\ell_0$ or $L_0$ "norm" (not a true norm) as a simple count of the number of non-zero elements in a vector. In other words, we toss out the 0th root, define $0^0 := 0 $ and do:

$$ \| \mathbf{u} \|_0 = |u_1|^0 + |u_2|^0 + \cdots + |u_n|^0 $$

(Or, if we're thinking about the points $\mathbf{a}$ and $\mathbf{b}$ again, just remember that $\mathbf{v}$ = $\mathbf{a}$ - $\mathbf{b}$.)

Computing norms

Let's take a quick look at computing the norm of some vectors in Python:

>>> import numpy as np

>>> a = np.array([1, 1]).T
>>> b = np.array([6, 5]).T

>>> L_0 = np.count_nonzero(a - b)
2

>>> L_1 = np.sum(np.abs(a - b))
9

>>> L_2 = np.sqrt((a - b) @ (a - b))
6.4031242374328485

>>> L_inf = np.max(np.abs(a - b))
5

>>> # Using NumPy's `linalg` module:
>>> import numpy.linalg as la
>>> for p in (0, 1, 2, np.inf):
>>>    print("L_{} norm = {}".format(p, la.norm(a - b, p)))
L_0 norm = 2.0
L_1 norm = 9.0
L_2 norm = 6.4031242374328485
L_inf norm = 5.0

What can we do with all this?

So far, so good. But what's the point of these metrics? How can we use them to solve problems? We'll get into that in a future post, so don't go too far!

For now I'll leave you to play with this little interactive demo of the effect of changing p-norms on a Voronoi triangle tiling — it's by Sarah Greer, a geophysics student at UT Austin.

UPDATE — The next post is The norm and simple solutions, which looks at how these different norms can be used to solve real-world problems.

September 29, 2017

Hacking in Houston

September 29, 2017/ Matt Hall

Houston 2013
Houston 2014
Denver 2014
Calgary 2015
New Orleans 2015
Vienna 2016
Paris 2017
Houston 2017... The eighth geoscience hackathon landed last weekend!

We spent last weekend in hot, humid Houston, hacking away with a crowd of geoscience and technology enthusiasts. Thirty-eight hackers joined us on the top-floor coworking space, Station Houston, for fun and games and code. And tacos.

Matt kicking things off

Discussing projects

The hacking starts

Coding all day :)

Team coding

Figuring out street-level flooding in Houston

Alexa, how do we make an Alexa skill?

Evening at 8th Wonder Brewing

Back at it with help from NVIDIA

A Poweredge C4130 posing with David Holmes from Dell EMC

Getting ready to podcast

The organizer sofa

Making secret plans :)

NVIDIA Jetson TX1 all ready to go

Swag!

Listening to demos

David MC-ing the demos

Eric Jones from Enthought

Some of the prize-winners with their prizes

Here's a rundown of the teams and what they worked on.

Seismic Imagers

Jingbo Liu (CGG), Zohreh Souri (University of Houston).

Tech — DCGAN in Tensorflow, Amazon AWS EC2 compute.

The team looked for patterns that make seismic data different from other images, using a deep convolutional generative adversarial network (DCGAN). Using a seismic volume and a set of 2D lines, they made 121,000 sub-images (tiles) for their training set.

The Young And The RasLAS

William Sanger (Schlumberger), Chance Sanger (Museum of Fine Arts, Houston), Diego Castañeda (Agile), Suman Gautam (Schlumberger), Lanre Aboaba (University of Arkansas).

State of the art text detection by Google Cloud Vision API

Tech — Google Cloud Vision API, Python flask web app, Scatteract (sort of). Repo on GitHub.

Digitizing well logs is a common industry task, and current methods require a lot of manual intervention. The team's automated pipeline: convert PDF files to images, perform OCR with Google Cloud Vision API to extract headers and log track labels, pick curves using a CNN in TensorFlow. The team implemented the workflow in a Python flask front-end. Check out their slides.

Hutton Rocks

Kamal Hami-Eddine (Paradigm), Didi Ooi (University of Bristol), James Lowell (GeoTeric), Vikram Sen (Anadarko), Dawn Jobe (Aramco).

Tech — Amazon Echo Dot, Amazon AWS (RDS, Lambda).

The team built Hutton, a cloud-based cognitive assistant for gaining more efficient, better insights from geologic data. Project includes integrated cloud-hosted database, interactive web application for uploading new data, and a cognitive assistant for voice queries. Hutton builds upon existing Amazon Alexa skills. Check out their GitHub repo, and slides.

Big data > Big Lore

Licheng Zhang (CGG), Zhenzhen Zhong (CGG), Justin Gosses (Valador/NASA), Jonathan Parker (Marathon)

The team used machine learning to predict formation tops on wireline logs, which would allow for rapid generation of structure maps for exploration play evaluation, save man hours and assist in difficuly formation-top correlations. The team used the AER Athabasca open dataset of 2193 wells (yay, open data!).

Tech — Jupyter Notebooks, SciPy, scikit-learn. Repo on GitHub.

Free near surface

Tien-Huei Wang, Jing Wu, Clement Zhang (Schlumberger).

Multiples are a kind of undesired seismic signal and take expensive modeling to remove. The project used machine learning to identify multiples in seismic images. They attempted to use GAN frameworks, but found it difficult to formulate their problem, turning instead to the simpler problem of binary classification. Check out their slides.

Tech — CNN... I don't know the framework.

The Cowboyz

Mingliang Liu, Mohit Ayani, Xiaozheng Lang, Wei Wang (University of Wyoming), Vidal Gonzalez (Universidad Simón Bolívar, Venezuela).

A tight group of researchers joined us from the University of Wyoming at Laramie, and snagged one of the most enthusiastic hackers at the event, a student from Venezuela called Vidal. The team attempted acceleration of geostatistical seismic inversion using TensorFlow, a central theme in Mingliang's research.

Tech — TensorFlow.

Augur.ai

Altay Sensal (Geokinetics), Yan Zaretskiy (Aramco), Ben Lasscock (Geokinetics), Colin Sturm (Apache), Brendon Hall (Enthought).

Electrical submersible pumps (ESPs) are critical components for oil production. When they fail, they can cause significant down time. Augur.ai provides tools to analyze pump sensor data to predict when pumps when pump are behaving irregularly. Check out their presentation!

Tech — Amazon AWS EC2 and EFS, Plotly Dash, SigOpt, scikit-learn. Repo on GitHub.

The Disaster Masters

Joe Kington (Planet), Brendan Sullivan (Chevron), Matthew Bauer (CSM), Michael Harty (Oxy), Johnathan Fry (Chevron)

Hydrologic models predict floodplain flooding, but not local street flooding. Can we predict street flooding from LiDAR elevation data, conditioned with citizen-reported street and house flooding from U-Flood? Maybe! Check out their slides.

Tech — Python geospatial and machine learning stacks: rasterio, shapely, scipy.ndimage, scikit-learn. Repo on GitHub.

The structure does WHAT?!

Chris Ennen (White Oak), Nanne Hemstra (dGB Earth Sciences), Nate Suurmeyer (Shell), Jacob Foshee (Durwella).

Inspired by the concept of an iPhone 'face ageing' app, Nate recruited a team to poke at applying the concept to maps of the subsurface. Think of a simple map of a structural field early in its life, compared to how it looks after years of interpretation and drilling. Maybe we can preview the 'aged' appearance to help plan where best to drill next to reduce uncertainty!

Tech — OpendTect, Azure ML Studio, C#, self-boosting forest cluster. Repo on GitHub.

Thank you!

Massive thanks to our sponsors — including Pioneer Natural Resources — for their part in bringing the event to life!

More thank-yous

Apart from the participants themselves, Evan and I benefitted from a team of technical support, mentors, and judges — huge thanks to all these folks:

The indefatigable David Holmes from Dell EMC. The man is a legend.
Andrea Cortis from Pioneer Natural Resources.
Francois Courteille and Issam Said of NVIDIA.
Carlos Castro, Sunny Sunkara, Dennis Cherian, Mike Lapidakis, Jit Biswas, and Rohan Mathews of Amazon AWS.
Maneesh Bhide and Steven Tartakovsky of SigOpt.
Dave Nichols and Aria Abubakar of Schlumberger.
Eric Jones from Enthought.
Emmanuel Gringarten from Paradigm.
Frances Buhay and Brendon Hall for help with catering and logistics.
The team at Station for accommodating us.
Frank's Pizza, Tacos-a-Go-Go, Cali Sandwich (banh mi), Abby's Cafe (bagels), and Freebird (burritos) for feeding us.

Finally, megathanks to Gram Ganssle, my Undersampled Radio co-host. Stalwart hack supporter and uber-fixer, Gram came over all the way from New Orleans to help teams make sense of deep learning architectures and generally smooth things over. We recorded an episode of UR at the hackathon, talking to Dawn Jobe, Joe Kington, and Colin Sturm about their respective projects. Check it out!

[Update, 29 Sep & 3 Nov] Some statistics from the event:

39 participants, including 7 women (way too few, but better than 4 out of 63 in Paris)
9 students (and 0 professors!).
12 people from petroleum companies.
18 people from service and technology companies, including 5 from Schlumberger!
13 no-shows, not including folk who cancelled ahead of time; a bit frustrating because we had a long wait list.
Furthest travelled: James Lowell from Newcastle, UK — 7560 km!
98 tacos, 67 burritos, 96 slices of pizza, 55 kolaches, and an untold number of banh mi.

September 20, 2017

Looking ahead to SEG

September 20, 2017/ Matt Hall

The SEG Annual Meeting is coming up. Next week sees the festival of geophysics return to the global energy capital, shaken and damp but undefeated after its recent battle with Hurricane Harvey. Even though Agile will not be at the meeting this year, I wanted to point out some highlights of the week.

The Annual Meeting

The meeting will be big, as usual: 108 talk sessions, and 50 poster and e-presentation sessions. I have no idea how many presentations we're talking about but suffice to say that there's a lot. Naturally, there's a machine learning session, with the following talks:

A growing machine learning approach to optimize use of prestack and poststack seismic data (Kamal Hami-Eddine)
A machine learning approach to facies classification using well logs (Vincenzo Lipari)
A weakly supervised approach to seismic structure labeling (Yazeed Alaudah)
Automated input attribute weighting for unsupervised seismic facies analysis (Tao Zhao)
Different training sample selection strategies in unsupervised seismic facies analysis (Tao Zhao)
Geobody interpretation through multiattribute surveys, natural clusters, and machine learning (Thomas Andrew Smith)
Patterns classification in assisting seismic-facies analysis (Rongchang Liu)
Towards real-time geologic feature detection from seismic measurements using a randomized machine-learning algorithm (Dr. Youzuo Lin)

The Geophysics Hackathon

Even though we're not at the conference, we are in Houston this weekend — for the latest edition of the Geophysics Hackathon! The focus was set to be firmly on 'machine learning', but after the hurricane, we added the theme of 'disaster recovery and mitigation'. People are completely free to choose whatever project they'd like to work on; we'll be ready to help and advise on both topics. We also have some cool gear to play with: a Dell C4130 with 4 x NVIDIA P100s, NVIDIA Jetson TX1s, Amazon Echo Dots, and a Raspberry Shake. Many, many thanks to Dell EMC and Pioneer Natural Resources and all our other sponsors:

If you're one of the 70 or so people coming to this event, I'm looking forward to seeing you there... if you're not, then I'm looking forward to telling you all about it next week.

Petrel User Group

Jacob Foshee and Durwella are hosting a Petrel User Group meetup at The Dogwood, which is in midtown (not far from downtown). If you're a user of Petrel — power user or beginner, it doesn't matter — and you're interested in making the most of technology, it'd be good to see you there. Apart from anything else, you'll get to meet Jacob, who is one of those people with technology superpowers that you never know when you might need.

Rock Physics Reception

Tuesday If you've never been to the famous Rock Physics Reception, then you're missing out. It's your best shot at bumping into the luminaries of rock physics — Colin Sayers, Stefan Gelinsky, Per Avseth, Marco Perez, Bill Goodway, Tad Smith — you know the sort of thing. If the first thing you think about when you wake up in the morning is Lamé's second parameter, RSVP right now. Hurry: there are only a handful of spots left.

There's more! Don't miss:

The Women's Network Breakfast on Wednesday.
The Wiki Committee meeting on Wednesday, 8:00 am, Hilton Room 344B.
If you're an SEG member, you can go to any committee meeting you like! Find one that matches your interests.

If you know of any other events, please drop them in the comments!

September 14, 2017

Isn't everything on the internet free?

September 14, 2017/ Matt Hall

A couple of weeks ago I wrote about a new publication from Elsevier. The book seems to contain quite a bit of unlicensed copyrighted material, collected without proper permission from public and private groups on LinkedIn, SPE papers, and various websites. I had hoped to have an update for you today, but the company is still "looking into" the matter.

The comments on that post, and on Twitter, raised some interesting views. Like most views, these views usually come in pairs. There is a segment of the community that feels quite enraged by the use of (fully attributed) LinkedIn comments in a book; but many people hold the opposing view, that everything on the Internet is fair game.

I sympathise with this permissive view, to an extent. If you put stuff on the web, people are (one hopes) going to see it, interpret it, and perhaps want to re-use it. If they do re-use it, they may do so in ways you did not expect, or perhaps even disagree with. This is okay — this is how ideas develop.

I mean, if I can't use a properly attributed LinkedIn post as the basis for a discussion, or a YouTube video to illustrate a point, then what's really the point of those platforms? It would undermine the idea of the web as a place for interaction and collaboration, for cultural or scientific evolution.

Freely accessible but not free

Not to labour the point, but I think we all understand that what we put on the Internet is 'out there'. Indeed, some security researchers suggest you should assume that every email you type will be in the local newspaper tomorrow morning. This isn't just 'a feeling', it's built into how the web works. most websites are exclusively composed of strictly copyrighted content, but most websites also have conspicuous buttons to share that copyrighted content — Tweet this, Pin that, or whatever. The signals are confusing... do you want me to share this or not?

One can definitely get carried away with the idea that everything should be free. There's a spectrum of infractions. On the 'everyday abuse' end of things, we have the point of view that grabbing randoms images from the web and putting the URL at the bottom is 'good enough'. Based on papers at conferences, I suspect that most people think this and, as I explained before, it's definitely not true: you usually need permission.

At the other end of the scale, you end up with Sci-Hub (which sounds like it's under pressure to close at the moment) and various book-sharing sites, both of which I think are retrograde and anti-open-access (as well as illegal). I believe we should respect the copyright of others — even that of supposedly evil academic publishers — if we want others to respect ours.

So what's the problem with a bookful of LinkedIn posts and other dubious content? Leaving aside for now the possibility of more serious plagiarism, I think the main problem is simply that the author went too far — it is a wholesale rip-off of 350 people's work, not especially well done, with no added value, and sold for a hefty sum.

Best practice for re-using stuff on the web

So how do we know what is too far? Is it just a value judgment? How do you re-use stuff on the web properly? My advice:

Stop it. Resist the temptation to Google around, grabbing whatever catches your eye.
Re-use sparingly, only using one or two of the real gems. Do you really need that picture of a casino on your slide entitled "Risk and reward"? (No, you definitely don't.)
Make your own. Ideas are not copyrightable, so it might be easier to copy the idea and make the thing you want yourself (giving credit where it's due, of course).
Ask for permission from the creator if you do use someone's stuff. Like I said before, this is only fair and right.
Go open! Preferentially share things by people who seem to be into sharing their stuff.
Respect the work. Make other people's stuff look awesome. You might even...
...improve the work if you can — redraw a diagram, fix a typo — then share it back to them and the community.
Add value. Add real insight, combine things in new ways, surprise and delight the original creators.
And finally, if you're not doing any of these things, you better not be trying to profit from it.

Everything on the Internet is not free. My bet is that you'll be glad of this fact when you start putting your own stuff out there. We can all do our homework and model good practice. This is especially important for those people in influential positions in academia, because their behaviours rub off on so many impressionable people.

We talked to Fernando Enrique Ziegler on the Undersampled Radio podcast last week. He was embroiled in the 'bad book' furore too, in fact he brought it to many people's attention. So this topic came up in the show, as well as a lot of stuff about pore pressure and hurricanes. Check it out...

September 08, 2017

x lines of Python: Global seismic data

September 08, 2017/ Matt Hall

Today we'll look at finding and analysing global seismology data with Python and the wonderful seismology package ObsPy, from Moritz Beyreuther, Lion Krischer, and others originally at the Geophysical Observatory in Munich.

We've used ObsPy before to load SEG-Y files into Python, but that's not its core purpose. These tools are typically used by global seismologists and earthquake scientists, but we're going to download and analyse data from three non-earthquakes:

A curious landslide and tsunami in Greenland.
The recent nuclear bomb test in North Korea.
Hurricane Irma's passage through the Caribbean.

We'll also look at an actual earthquake. This morning there was a very large earthquake off Mexico, killing at least 15 people. It's the first M8+ earthquake anywhere since the Illapel event, Chile, on 16 September 2015.

Only 4 lines?

Once you have ObsPy, only 4 lines of code (not counting imports) are needed to download and plot a seismic trace. Here's how to instantiate the ObsPy client using the IRIS data service, then get 5 minutes of waveform data from the Mudanjiang or MDJ station on the IC network, the New China Digital Seismograph Network, and finally plot it:

from obspy.clients.fdsn import Client
client = Client("IRIS")

from obspy import UTCDateTime
t = UTCDateTime("2017-09-03_03:30:00")
st = client.get_waveforms("IC", "MDJ", "00", "BHZ", t, t + 5*60)
st.plot()  

Pretty awesome, right? One day getting seismic and well data will be this simple! LOL

Check out the Jupyter Notebook! I cannot get this notebook to run on Azure Notebooks I'm afraid, so the only way to run it is to set up Python and Jupyter (best way: install Canopy or Anaconda) on your machine. I urge you to give it a go, because what could be more fun than playing around with decades of seismic data from all over the world?

September 05, 2017

90 years of well logs

September 05, 2017/ Matt Hall

Today is the 90th anniversary of the first well log. On 5 September 1927, three men from Schlumberger logged the Diefenbach [sic] well 2905 at Dieffenbach-lès-Wœrth in the Pechelbronn heavy oil field in the Alsace region of France.

The site of the Diefenbach 2905 well. © Google, according to terms.

The geophysical services company Société de Prospection Électrique (Processes Schlumberger), or PROS, had only formed in July 1926 but already had sixteen employees. Headquartered in Paris at 42, rue Saint-Dominique, the company was attempting to turn its resistivity technology to industrial applications, especially mining and petroleum. Having had success with horizontal surface measurements, the Diefenbach well was the first attempt to measure resistivity in a wellbore. PROS went on to become Schlumberger.

The resistivity prospecting system had been designed by the Schlumberger brothers, Conrad (1878–1936, a professor at École des Mines) and Maurice (1884–1953, a mining engineer), over the period from about 1912 until 1923. The task of adapting the technology was given to Henri Doll (1902–1991), Conrad's son-in-law since 1923, and the Alsatian well was to be the first field test of the so-called "electrical coring" method. The client was Deutsche Erdöl Aktiengesellschaft, now DEA of Hamburg, Germany.

As far as I can tell, the well — despite usually being called "the Pechelbronn well" — was located at the site of a monument at the intersection of Route de Wœrth with Rue de Preuschdorf in Dieffenbach-lès-Wœrth, about 3 km west of Merkwiller-Pechelbronn. Henri Doll logged the well with Roger Jost and Charles Scheibli. Using rudimentary equipment, they logged about 145 m of the 488-metre hole, starting at 279 m MD, taking a reading every metre and plotting the log by hand. Yesterday I digitized this log; download it in LAS format here.

The story of what the Schlumberger brothers and Henri Doll achieved is fascinating; I recommend reading Don Hill's brief history (2012) — it's free to read at Wiley. The period of invention that followed the Pechelbronn success was inspiring.

If you're looking at well logs today, take a second to thank Conrad, Maurice, and Henri for their remarkable idea.

PS If you're interested in petroleum history, the AOGHS page This Week is worth a look.

The French television programme Midi en France recorded this segment about the Pechelbronn field in 2014. The narration is in French, "The fields of maize gorge on sunshine, the pumps on petroleum...", but there are some nice pictures to look at.

References and bibliography

Clapp, Frederick G (1932). Oil and gas possibilities of France. AAPG Bulletin 16 (11), 1092–1143. Contains a good history of exploration and production from the Oligocene sands in Pechelbronn, up to about 1931 (the field produced up to 1970). AAPG Datapages.

Delacour, Jacques (2003). Une technique de prospection minière et pétrolière née en Pays d'Auge. SABIX 34, September 2003. Available online.

École des Mines page on Conrad Schlumberger at annales.org.

Hill, DG (2012). Appendix A: Historical Review (Milestone Developments in Petrophysics). In: Buryakovsky, L, Chilingar, GV, Rieke, HH, and Shin, S (2012). Petrophysics: Fundamentals of the Petrophysics of Oil and Gas Reservoirs, John Wiley & Sons, Inc., Hoboken, NJ, USA. doi: 10.1002/9781118472750.app1. A nice potted history of well logging, including important dates.

Musée Français du Pétrole website, http://www.musee-du-petrole.com/historique/

Pike, B and Duey, R (2002). Logging history rich with innovation. Hart's E&P Magazine. September 2002. Available online. Interesting article, but beware: there are one or two inaccuracies in this article, and I believe the image of the well log is incorrect.

August 31, 2017

Attribution is not permission

August 31, 2017/ Matt Hall

This morning a friend of mine, Fernando Enrique Ziegler, a pore pressure researcher and practitioner in Houston, let me know about an "interesting" new book from Elsevier: Practical Solutions to Integrated Oil and Gas Reservoir Analysis, by Enwenode Onajite, a geophysicist in Nigeria... And about 350 other people.

What's interesting about the book is that the majority of the content was not written by Onajite, but was copy-and-pasted from discussions on LinkedIn. A novel way to produce a book, certainly, but is it... legal?

Who owns the content?

Before you read on, you might want to take a quick look at the way the book presents the LinkedIn material. Check it out, then come back here. By the way, if LinkedIn wasn't so damn difficult to search, or if the book included a link or some kind of proper citation of the discussion, I'd show you a conversation in LinkedIn too. But everything is completely untraceable, so I'll leave it as an exercise to the reader.

LinkedIn's User Agreement is crystal clear about the ownership of content its users post there:

[...] you own the content and information that you submit or post to the Services and you are only granting LinkedIn and our affiliates the following non-exclusive license: A worldwide, transferable and sublicensable right to use, copy, modify, distribute, publish, and process, information and content that you provide through our Services [...]

This is a good user agreement [Edit: see UPDATE, below]. It means everything you write on LinkedIn is © You — unless you choose to license it to others, e.g. under the terms of Creative Commons (please do!).

Fernando — whose material was used in the book — tells me that none of the several other authors he has asked gave, or were even asked for, permission to re-use their work. So I think we can say that this book represents a comprehensive infringement of copyright of the respective authors of the discussions on LinkedIn.

Roles and reponsibilities

Given the scale of this infringement, I think there's a clear lack of due diligence here on the part of the publisher and the editors. Having said that, while publishers are quick to establish their copyright on the material they publish, I would say that this lack of diligence is fairly normal. Publishers tend to leave this sort of thing to the author, hence the standard "Every effort has been made..." disclaimer you often find in non-fiction books... though not, apparently, in this book (perhaps because zero effort has been made!).

But this defence doesn't wash: Elsevier is the copyright holder here (Onajite signed it over to them, as most authors do), so I think the buck stops with them. Indeed, you can be sure that the company will make most of the money from the sale of this book — the author will be lucky to get 5% of gross sales, so the buck is both figurative and literal.

Incidentally, in Agile's publishing house, Agile Libre, authors retain copyright, but we take on the responsibility (and cost!) of seeking permissions for re-use. We do this because I consider it to be our reputation at stake, as much as the author's.

OK, so we should blame Elsevier for this book. Could Elsevier argue that it's really no different from quoting from a published research paper, say? Few researchers ask publishers or authors if they can do this — especially in the classroom, "for educational purposes", as if it is somehow exempt from copyright rules (it isn't). It's just part of the culture — an extension of the uneducated (uninterested?) attitude towards copyright that prevails in academia and industry. Until someone infringes your copyright, at least.

Seek permission not forgiveness

I notice that in the Acknowledgments section of the book, Onajite does what many people do — he gives acknowledgement ("for their contributions", he doesn't say they were unwitting) to some the authors of the content. Asking for forgiveness, as it were (but not really). He lists the rest at the back. It's normal to see this sort of casual hat tip in presentations at conferences — someone shows an unlicensed image they got from Google Images, slaps "Courtesy of A Scientist" or a URL at the bottom, and calls it a day. It isn't good enough: attribution is not permission. The word "courtesy" implies that you had some.

Indeed, most of the figures in Onajite's book seem to have been procured from elsewhere, with "Courtesy ExxonMobil" or whatever passing as a pseudolicense. If I was a gambler, I would bet that the large majority were used without permission.

OK, you're thinking, where's this going? Is it just a rant? Here's the bottom line:

The only courteous, professional and, yes, legal way to re-use copyrighted material — which is "anything someone created", more or less — is to seek written permission. It's that simple.

A bit of a hassle? Indeed it is. Time-consuming? Yep. The good news is that you'll usually get a "Sure! Thanks for asking". I can count on one hand the number of times I've been refused.

The only exceptions to the rule are when:

The copyrighted material already carries a license for re-use (as Agile does — read the footer on this page).
The copyright owner explicitly allows re-use in their terms and conditions (for example, allowing the re-publication of single figures, as some journals do).
The law allows for some kind of fair use, e.g. for the purposes of criticism.

In these cases, you do not need to ask, just be sure to attribute everything diligently.

A new low in scientific publishing?

What now? I believe Elsevier should retract this potentially useful book and begin the long process of asking the 350 authors for permission to re-use the content. But I'm not holding my breath.

By a very rough count of the preview of this $130 volume in Google Books, it looks like the ratio of LinkedIn chat to original text is about 2:1. Whatever the copyright situation, the book is definitely an uninspiring turn for scientific publishing. I hope we don't see more like it, but let's face it: if a massive publishing conglomerate can make $87 from comments on LinkedIn, it's gonna happen.

What do you think about all this? Does it matter? Should Elsevier do something about it? Let us know in the comments.

UPDATE Friday 1 September

Since this is a rather delicate issue, and events are still unfolding, I thought I'd post some updates from Twitter and the comments on this post:

Elsevier is aware of these questions and is looking into it.
Re-read the user agreement quote carefully. As Ronald points out below, I was too hasty — it's really not a good user agreement, LinkedIn have a lot of scope to re-use what you post there.
It turns out that some people were asked for permission, though it seems it was unclear what they were agreeing to. So the author knew that seeking permission was a good idea.
It also turns out that at least one SPE paper was reproduced in the book, in a rather inconspicuous way. I don't know if SPE granted rights for this, but the author at least was not identified.
Some people are throwing the word 'plagiarism' around, which is rather a serious word. I'm personally willing to ascribe it to 'normal industry practices' and sloppy editing and reviewing (the book was apparently reviewed by no fewer than 5 people!). And, at least in the case of the LinkedIn content, proper attribution was made. For me, this is more about honesty, quality, and value in scientific publishing than about misconduct per se.
It's worth reading the comments on this post. People are raising good points.

Part of the thumbnail image was created by Jannoon028 — Freepik.com — and licensed CC-BY.

August 23, 2017

x lines of Python: read and write CSV

August 23, 2017/ Matt Hall

A couple of weeks ago, in Murphy's Law for Excel, I wrote about the dominance of spreadsheets in applied analysis, and how they may be getting out of hand. Then in Organizing spreadsheets I wrote about how — if you are going to store data in spreadsheets — to organize your data so that you do the least amount of damage. The general goal being to make your data machine-readable. Or, to put it another way, to allow you to save your data as comma-separated values or CSV files.

CSV is the de facto standard way to store data in text files. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in your analysis tool of choice. In our case, this is the Python language. So today I present a few different ways to get at data stored in CSV files.

How many ways can I read thee?

In the accompanying Jupyter Notebook, we read a CSV file into Python in six different ways:

Using the pandas data analysis library. It's the easiest way to read CSV and XLS data into your Python environment...
...and can happily consume a file on the web too. Another nice thing about pandas. It also writes CSV files very easily.
Using the built-in csv package. There are a couple of standard ways to do this — csv.reader...
...and csv.DictReader. This library is handy for when you don't have (or don't want) pandas.
Using numpy, the numeric library for Python. If you just have a CSV full of numbers and you want an array in the end, you can skip pandas.
OK, it's not really a CSV file, but for the finale we read a spreadsheet directly from Google Sheets.

I usually count my lines diligently in these posts, but not this time. With pandas you're looking at a one-liner to read your data:

df = pd.read_csv("myfile.csv")

and a one-liner to write it out again. With csv.DictReader you're looking at 3 lines to get a list of dicts (but watch out: your numbers will be strings). Reading a Google Doc is a little more involved, not least because you'll need to set up an app and get an API key to handle authentication.

That's all there is to CSV files. Go forth and wield data like a pro!

Next time in the xlines of Python series we'll look at reading seismic station data from the web, and doing a bit of time-series analysis on it. No more stuff about spreadsheets and CSV files, I promise :)

The thumbnail image is based on the possibly apocryphal banksy image of an armed panda, and one of texturepalace.com's CC-BY textures.

August 15, 2017

Organizing spreadsheets

August 15, 2017/ Matt Hall

A couple of weeks ago I alluded to ill-formed spreadsheets in my post Murphy 's Law for Excel. Spreadsheets are clearly indispensable, and are definitely great for storing data and checking CSV files. But some spreadsheets need to die a horrible death. I'm talking about spreadsheets that look like this (click here for the entire sheet):

This spreadsheet has several problems. Among them:

The position of a piece of data changes how I interpret it. E.g. a blank row means 'new sheet' or 'new well'.
The cells contain a mixture of information (e.g. 'Site' and the actual data) and appear in varying units.
Some information is encoded by styles (e.g. using red to denote a mineral species). If you store your sheet as a CSV (which you should), this information will be lost.
Columns are hidden, there are footnotes, it's just a bit gross.

Using this spreadsheet to make plots, or reading it with software, with be a horrible experience. I will probably swear at my computer, suffer a repetitive strain injury, and go home early with a headache, cursing the muppet that made the spreadsheet in the first place. (Admittedly, I am the muppet that made this spreadsheet in this case, but I promise I did not invent these pathologies. I have seen them all.)

Let's make the world a better place

Consider making separate sheets for the following:

Raw data. This is important. See below.
Computed columns. There may be good reasons to keep these with the data.
Charts.
'Tabulated' data, like my bad spreadsheet above, with tables meant for summarization or printing.
Some metadata, either in the file properties or a separate sheet. Explain the purpose of the dataset, any major sources, important assumptions, and your contact details.
A rich description of each column, with its caveats and assumptions.

The all-important data sheet has its own special requirements. Here's my guide for a pain-free experience:

No computed fields or plots in the data sheet.
No hidden columns.
No semantic meaning in formatting (e.g. highlighting cells or bolding values).
Headers in the first row, only data in all the other rows.
The column headers should contain only a unique name and [units], e.g. Depth [m], Porosity [v/v].
Only one type of data per column: text OR numbers, discrete categories OR continuous scalars.
No units in numeric data cells, only quantities. Record depth as 500, not 500 m.
Avoid keys or abbreviations: use Sandstone, Limestone, Shale, not Ss, Ls, Sh.
Zero means zero, empty cell means no data.
Only one unit per column. (You only use SI units right?)
Attribution! Include a citation or citations for every record.
If you have two distinct types or sources of data, e.g. grain size from sieve analysis and grain size from photomicrographs, then use two different columns.
Personally, I like the data sheet to be the first sheet in the file, but maybe that's just me.
Check that it turns into a valid CSV so you can use this awesome format.
After all that, here's what we have (click here for the entire sheet):

The same data as the first image, but improved. The long strings in columns 3 and 4 are troublesome, but we can tolerate them. Click to enlarge.

Maybe the 'clean' analysis-friendly sheet looks boring to you, but to me it looks awesome. Above all, it's easy to use for SCIENCE! And I won't have to go home with a headache.

The data in this post came from this Cretaceous shale dataset [XLS file] from the government of Manitoba. Their spreadsheet is pretty good and only breaks a couple of my golden rules. Here's my version with the broken and fixed spreadsheets shown here. Let me know if you spot something else that should be fixed!

Blog