Looking ahead to SEG

SEGAM-logo-2017.jpg

The SEG Annual Meeting is coming up. Next week sees the festival of geophysics return to the global energy capital, shaken and damp but undefeated after its recent battle with Hurricane Harvey. Even though Agile will not be at the meeting this year, I wanted to point out some highlights of the week.

The Annual Meeting

The meeting will be big, as usual: 108 talk sessions, and 50 poster and e-presentation sessions. I have no idea how many presentations we're talking about but suffice to say that there's a lot. Naturally, there's a machine learning session, with the following talks:

The Geophysics Hackathon

Even though we're not at the conference, we are in Houston this weekend — for the latest edition of the Geophysics Hackathon! The focus was set to be firmly on 'machine learning', but after the hurricane, we added the theme of 'disaster recovery and mitigation'. People are completely free to choose whatever project they'd like to work on; we'll be ready to help and advise on both topics. We also have some cool gear to play with: a Dell C4130 with 4 x NVIDIA P100s, NVIDIA Jetson TX1s, Amazon Echo Dots, and a Raspberry Shake. Many, many thanks to Dell EMC and Pioneer Natural Resources and all our other sponsors:

sponsors_tight.png

If you're one of the 70 or so people coming to this event, I'm looking forward to seeing you there... if you're not, then I'm looking forward to telling you all about it next week.


Petrel User Group

icons-petrel.png

Jacob Foshee and Durwella are hosting a Petrel User Group meetup at The Dogwood, which is in midtown (not far from downtown). If you're a user of Petrel — power user or beginner, it doesn't matter — and you're interested in making the most of technology, it'd be good to see you there. Apart from anything else, you'll get to meet Jacob, who is one of those people with technology superpowers that you never know when you might need.


Rock Physics Reception

Tuesday If you've never been to the famous Rock Physics Reception, then you're missing out. It's your best shot at bumping into the luminaries of rock physics — Colin Sayers, Stefan Gelinsky, Per Avseth, Marco Perez, Bill Goodway, Tad Smith — you know the sort of thing. If the first thing you think about when you wake up in the morning is Lamé's second parameter, RSVP right now. Hurry: there are only a handful of spots left.


There's more! Don't miss:

  • The Women's Network Breakfast on Wednesday.
  • The Wiki Committee meeting on Wednesday, 8:00 am, Hilton Room 344B.
  • If you're an SEG member, you can go to any committee meeting you like! Find one that matches your interests.

If you know of any other events, please drop them in the comments!

 

Isn't everything on the internet free?

A couple of weeks ago I wrote about a new publication from Elsevier. The book seems to contain quite a bit of unlicensed copyrighted material, collected without proper permission from public and private groups on LinkedIn, SPE papers, and various websites. I had hoped to have an update for you today, but the company is still "looking into" the matter.

The comments on that post, and on Twitter, raised some interesting views. Like most views, these views usually come in pairs. There is a segment of the community that feels quite enraged by the use of (fully attributed) LinkedIn comments in a book; but many people hold the opposing view, that everything on the Internet is fair game.

I sympathise with this permissive view, to an extent. If you put stuff on the web, people are (one hopes) going to see it, interpret it, and perhaps want to re-use it. If they do re-use it, they may do so in ways you did not expect, or perhaps even disagree with. This is okay — this is how ideas develop. 

I mean, if I can't use a properly attributed LinkedIn post as the basis for a discussion, or a YouTube video to illustrate a point, then what's really the point of those platforms? It would undermine the idea of the web as a place for interaction and collaboration, for cultural or scientific evolution. 

Freely accessible but not free

Not to labour the point, but I think we all understand that what we put on the Internet is 'out there'. Indeed, some security researchers suggest you should assume that every email you type will be in the local newspaper tomorrow morning. This isn't just 'a feeling', it's built into how the web works. most websites are exclusively composed of strictly copyrighted content, but most websites also have conspicuous buttons to share that copyrighted content — Tweet this, Pin that, or whatever. The signals are confusing... do you want me to share this or not? 

One can definitely get carried away with the idea that everything should be free. There's a spectrum of infractions. On the 'everyday abuse' end of things, we have the point of view that grabbing randoms images from the web and putting the URL at the bottom is 'good enough'. Based on papers at conferences, I suspect that most people think this and, as I explained before, it's definitely not true: you usually need permission. 

At the other end of the scale, you end up with Sci-Hub (which sounds like it's under pressure to close at the moment) and various book-sharing sites, both of which I think are retrograde and anti-open-access (as well as illegal). I believe we should respect the copyright of others — even that of supposedly evil academic publishers — if we want others to respect ours.

So what's the problem with a bookful of LinkedIn posts and other dubious content? Leaving aside for now the possibility of more serious plagiarism, I think the main problem is simply that the author went too far — it is a wholesale rip-off of 350 people's work, not especially well done, with no added value, and sold for a hefty sum.

Best practice for re-using stuff on the web

So how do we know what is too far? Is it just a value judgment? How do you re-use stuff on the web properly? My advice:

  • Stop it. Resist the temptation to Google around, grabbing whatever catches your eye.
  • Re-use sparingly, only using one or two of the real gems. Do you really need that picture of a casino on your slide entitled "Risk and reward"? (No, you definitely don't.)
  • Make your own. Ideas are not copyrightable, so it might be easier to copy the idea and make the thing you want yourself (giving credit where it's due, of course).
  • Ask for permission from the creator if you do use someone's stuff. Like I said before, this is only fair and right.
  • Go open! Preferentially share things by people who seem to be into sharing their stuff.
  • Respect the work. Make other people's stuff look awesome. You might even...
  • ...improve the work if you can — redraw a diagram, fix a typo — then share it back to them and the community.
  • Add value. Add real insight, combine things in new ways, surprise and delight the original creators.
  • And finally, if you're not doing any of these things, you better not be trying to profit from it. 

Everything on the Internet is not free. My bet is that you'll be glad of this fact when you start putting your own stuff out there. We can all do our homework and model good practice. This is especially important for those people in influential positions in academia, because their behaviours rub off on so many impressionable people. 


We talked to Fernando Enrique Ziegler on the Undersampled Radio podcast last week. He was embroiled in the 'bad book' furore too, in fact he brought it to many people's attention. So this topic came up in the show, as well as a lot of stuff about pore pressure and hurricanes. Check it out...

x lines of Python: Global seismic data

Today we'll look at finding and analysing global seismology data with Python and the wonderful seismology package ObsPy, from Moritz Beyreuther, Lion Krischer, and others originally at the Geophysical Observatory in Munich.

We've used ObsPy before to load SEG-Y files into Python, but that's not its core purpose. These tools are typically used by global seismologists and earthquake scientists, but we're going to download and analyse data from three non-earthquakes:

  1. A curious landslide and tsunami in Greenland.
  2. The recent nuclear bomb test in North Korea.
  3. Hurricane Irma's passage through the Caribbean.

We'll also look at an actual earthquake. This morning there was a very large earthquake off Mexico, killing at least 15 people. It's the first M8+ earthquake anywhere since the Illapel event, Chile, on 16 September 2015.

Only 4 lines?

Once you have ObsPy, only 4 lines of code (not counting imports) are needed to download and plot a seismic trace. Here's how to instantiate the ObsPy client using the IRIS data service, then get 5 minutes of waveform data from the Mudanjiang or MDJ station on the IC network, the New China Digital Seismograph Network, and finally plot it:

from obspy.clients.fdsn import Client
client = Client("IRIS")

from obspy import UTCDateTime
t = UTCDateTime("2017-09-03_03:30:00")
st = client.get_waveforms("IC", "MDJ", "00", "BHZ", t, t + 5*60)
st.plot()  
ObsPy_IC-MDJ.png

Pretty awesome, right? One day getting seismic and well data will be this simple! LOL


Check out the Jupyter Notebook! I cannot get this notebook to run on Azure Notebooks I'm afraid, so the only way to run it is to set up Python and Jupyter (best way: install Canopy or Anaconda) on your machine. I urge you to give it a go, because what could be more fun than playing around with decades of seismic data from all over the world?

90 years of well logs

Today is the 90th anniversary of the first well log. On 5 September 1927, three men from Schlumberger logged the Diefenbach [sic] well 2905 at Dieffenbach-lès-Wœrth in the Pechelbronn heavy oil field in the Alsace region of France.

The site of the Diefenbach 2905 well. © Google, according to terms.

The site of the Diefenbach 2905 well. © Google, according to terms.

 
Pechelbronn_log_plot.png

The geophysical services company Société de Prospection Électrique (Processes Schlumberger), or PROS, had only formed in July 1926 but already had sixteen employees. Headquartered in Paris at 42, rue Saint-Dominique, the company was attempting to turn its resistivity technology to industrial applications, especially mining and petroleum. Having had success with horizontal surface measurements, the Diefenbach well was the first attempt to measure resistivity in a wellbore. PROS went on to become Schlumberger.

The resistivity prospecting system had been designed by the Schlumberger brothers, Conrad (1878–1936, a professor at École des Mines) and Maurice (1884–1953, a mining engineer), over the period from about 1912 until 1923. The task of adapting the technology was given to Henri Doll (1902–1991), Conrad's son-in-law since 1923, and the Alsatian well was to be the first field test of the so-called "electrical coring" method. The client was Deutsche Erdöl Aktiengesellschaft, now DEA of Hamburg, Germany.

As far as I can tell, the well — despite usually being called "the Pechelbronn well" — was located at the site of a monument at the intersection of Route de Wœrth with Rue de Preuschdorf in Dieffenbach-lès-Wœrth, about 3 km west of Merkwiller-Pechelbronn. Henri Doll logged the well with Roger Jost and Charles Scheibli. Using rudimentary equipment, they logged about 145 m of the 488-metre hole, starting at 279 m MD, taking a reading every metre and plotting the log by hand. Yesterday I digitized this log; download it in LAS format here


Pechelbronn_thumbnail.png

The story of what the Schlumberger brothers and Henri Doll achieved is fascinating; I recommend reading Don Hill's brief history (2012) — it's free to read at Wiley. The period of invention that followed the Pechelbronn success was inspiring.

If you're looking at well logs today, take a second to thank Conrad, Maurice, and Henri for their remarkable idea.

PS If you're interested in petroleum history, the AOGHS page This Week is worth a look.


The French television programme Midi en France recorded this segment about the Pechelbronn field in 2014. The narration is in French, "The fields of maize gorge on sunshine, the pumps on petroleum...", but there are some nice pictures to look at.

References and bibliography

Clapp, Frederick G (1932). Oil and gas possibilities of France. AAPG Bulletin 16 (11), 1092–1143. Contains a good history of exploration and production from the Oligocene sands in Pechelbronn, up to about 1931 (the field produced up to 1970). AAPG Datapages.

Delacour, Jacques (2003). Une technique de prospection minière et pétrolière née en Pays d'Auge. SABIX 34, September 2003. Available online.

École des Mines page on Conrad Schlumberger at annales.org.

Hill, DG (2012). Appendix A: Historical Review (Milestone Developments in Petrophysics). In: Buryakovsky, L, Chilingar, GV, Rieke, HH, and Shin, S (2012). Petrophysics: Fundamentals of the Petrophysics of Oil and Gas Reservoirs, John Wiley & Sons, Inc., Hoboken, NJ, USA. doi: 10.1002/9781118472750.app1. A nice potted history of well logging, including important dates.

Musée Français du Pétrole website, http://www.musee-du-petrole.com/historique/

Pike, B and Duey, R (2002). Logging history rich with innovation. Hart's E&P Magazine. September 2002. Available online. Interesting article, but beware: there are one or two inaccuracies in this article, and I believe the image of the well log is incorrect.

Attribution is not permission

Onajite_cover.png

This morning a friend of mine, Fernando Enrique Ziegler, a pore pressure researcher and practitioner in Houston, let me know about an "interesting" new book from Elsevier: Practical Solutions to Integrated Oil and Gas Reservoir Analysis, by Enwenode Onajite, a geophysicist in Nigeria... And about 350 other people.

What's interesting about the book is that the majority of the content was not written by Onajite, but was copy-and-pasted from discussions on LinkedIn. A novel way to produce a book, certainly, but is it... legal?

Who owns the content?

Before you read on, you might want to take a quick look at the way the book presents the LinkedIn material. Check it out, then come back here. By the way, if LinkedIn wasn't so damn difficult to search, or if the book included a link or some kind of proper citation of the discussion, I'd show you a conversation in LinkedIn too. But everything is completely untraceable, so I'll leave it as an exercise to the reader.

LinkedIn's User Agreement is crystal clear about the ownership of content its users post there:

[...] you own the content and information that you submit or post to the Services and you are only granting LinkedIn and our affiliates the following non-exclusive license: A worldwide, transferable and sublicensable right to use, copy, modify, distribute, publish, and process, information and content that you provide through our Services [...]

This is a good user agreement [Edit: see UPDATE, below]. It means everything you write on LinkedIn is © You — unless you choose to license it to others, e.g. under the terms of Creative Commons (please do!).

Fernando — whose material was used in the book — tells me that none of the several other authors he has asked gave, or were even asked for, permission to re-use their work. So I think we can say that this book represents a comprehensive infringement of copyright of the respective authors of the discussions on LinkedIn.

Roles and reponsibilities

Given the scale of this infringement, I think there's a clear lack of due diligence here on the part of the publisher and the editors. Having said that, while publishers are quick to establish their copyright on the material they publish, I would say that this lack of diligence is fairly normal. Publishers tend to leave this sort of thing to the author, hence the standard "Every effort has been made..." disclaimer you often find in non-fiction books... though not, apparently, in this book (perhaps because zero effort has been made!).

But this defence doesn't wash: Elsevier is the copyright holder here (Onajite signed it over to them, as most authors do), so I think the buck stops with them. Indeed, you can be sure that the company will make most of the money from the sale of this book — the author will be lucky to get 5% of gross sales, so the buck is both figurative and literal.

Incidentally, in Agile's publishing house, Agile Libre, authors retain copyright, but we take on the responsibility (and cost!) of seeking permissions for re-use. We do this because I consider it to be our reputation at stake, as much as the author's.

OK, so we should blame Elsevier for this book. Could Elsevier argue that it's really no different from quoting from a published research paper, say? Few researchers ask publishers or authors if they can do this — especially in the classroom, "for educational purposes", as if it is somehow exempt from copyright rules (it isn't). It's just part of the culture — an extension of the uneducated (uninterested?) attitude towards copyright that prevails in academia and industry. Until someone infringes your copyright, at least.

Seek permission not forgiveness

I notice that in the Acknowledgments section of the book, Onajite does what many people do — he gives acknowledgement ("for their contributions", he doesn't say they were unwitting) to some the authors of the content. Asking for forgiveness, as it were (but not really). He lists the rest at the back. It's normal to see this sort of casual hat tip in presentations at conferences — someone shows an unlicensed image they got from Google Images, slaps "Courtesy of A Scientist" or a URL at the bottom, and calls it a day. It isn't good enough: attribution is not permission. The word "courtesy" implies that you had some.

Indeed, most of the figures in Onajite's book seem to have been procured from elsewhere, with "Courtesy ExxonMobil" or whatever passing as a pseudolicense. If I was a gambler, I would bet that the large majority were used without permission.

OK, you're thinking, where's this going? Is it just a rant? Here's the bottom line:

The only courteous, professional and, yes, legal way to re-use copyrighted material — which is "anything someone created", more or less — is to seek written permission. It's that simple.

A bit of a hassle? Indeed it is. Time-consuming? Yep. The good news is that you'll usually get a "Sure! Thanks for asking". I can count on one hand the number of times I've been refused.

The only exceptions to the rule are when:

  • The copyrighted material already carries a license for re-use (as Agile does — read the footer on this page).
  • The copyright owner explicitly allows re-use in their terms and conditions (for example, allowing the re-publication of single figures, as some journals do).
  • The law allows for some kind of fair use, e.g. for the purposes of criticism.

In these cases, you do not need to ask, just be sure to attribute everything diligently.

A new low in scientific publishing?

What now? I believe Elsevier should retract this potentially useful book and begin the long process of asking the 350 authors for permission to re-use the content. But I'm not holding my breath.

By a very rough count of the preview of this $130 volume in Google Books, it looks like the ratio of LinkedIn chat to original text is about 2:1. Whatever the copyright situation, the book is definitely an uninspiring turn for scientific publishing. I hope we don't see more like it, but let's face it: if a massive publishing conglomerate can make $87 from comments on LinkedIn, it's gonna happen.

What do you think about all this? Does it matter? Should Elsevier do something about it? Let us know in the comments.


UPDATE Friday 1 September

Since this is a rather delicate issue, and events are still unfolding, I thought I'd post some updates from Twitter and the comments on this post:

  • Elsevier is aware of these questions and is looking into it.
  • Re-read the user agreement quote carefully. As Ronald points out below, I was too hasty — it's really not a good user agreement, LinkedIn have a lot of scope to re-use what you post there. 
  • It turns out that some people were asked for permission, though it seems it was unclear what they were agreeing to. So the author knew that seeking permission was a good idea.
  • It also turns out that at least one SPE paper was reproduced in the book, in a rather inconspicuous way. I don't know if SPE granted rights for this, but the author at least was not identified.
  • Some people are throwing the word 'plagiarism' around, which is rather a serious word. I'm personally willing to ascribe it to 'normal industry practices' and sloppy editing and reviewing (the book was apparently reviewed by no fewer than 5 people!). And, at least in the case of the LinkedIn content, proper attribution was made. For me, this is more about honesty, quality, and value in scientific publishing than about misconduct per se.
  • It's worth reading the comments on this post. People are raising good points.

Part of the thumbnail image was created by Jannoon028 — Freepik.com — and licensed CC-BY.

x lines of Python: read and write CSV

A couple of weeks ago, in Murphy's Law for Excel, I wrote about the dominance of spreadsheets in applied analysis, and how they may be getting out of hand. Then in Organizing spreadsheets I wrote about how — if you are going to store data in spreadsheets — to organize your data so that you do the least amount of damage. The general goal being to make your data machine-readable. Or, to put it another way, to allow you to save your data as comma-separated values or CSV files.

CSV is the de facto standard way to store data in text files. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in your analysis tool of choice. In our case, this is the Python language. So today I present a few different ways to get at data stored in CSV files.

How many ways can I read thee?

In the accompanying Jupyter Notebook, we read a CSV file into Python in six different ways:

  1. Using the pandas data analysis library. It's the easiest way to read CSV and XLS data into your Python environment...
  2. ...and can happily consume a file on the web too. Another nice thing about pandas. It also writes CSV files very easily.
  3. Using the built-in csv package. There are a couple of standard ways to do this — csv.reader...
  4. ...and csv.DictReader. This library is handy for when you don't have (or don't want) pandas.
  5. Using numpy, the numeric library for Python. If you just have a CSV full of numbers and you want an array in the end, you can skip pandas.
  6. OK, it's not really a CSV file, but for the finale we read a spreadsheet directly from Google Sheets.

I usually count my lines diligently in these posts, but not this time. With pandas you're looking at a one-liner to read your data:

df = pd.read_csv("myfile.csv")

and a one-liner to write it out again. With csv.DictReader you're looking at 3 lines to get a list of dicts (but watch out: your numbers will be strings). Reading a Google Doc is a little more involved, not least because you'll need to set up an app and get an API key to handle authentication.

That's all there is to CSV files. Go forth and wield data like a pro! 

Next time in the xlines of Python series we'll look at reading seismic station data from the web, and doing a bit of time-series analysis on it. No more stuff about spreadsheets and CSV files, I promise :)


The thumbnail image is based on the possibly apocryphal banksy image of an armed panda, and one of texturepalace.com's CC-BY textures.

Organizing spreadsheets

A couple of weeks ago I alluded to ill-formed spreadsheets in my post Murphy's Law for Excel. Spreadsheets are clearly indispensable, and are definitely great for storing data and checking CSV files. But some spreadsheets need to die a horrible death. I'm talking about spreadsheets that look like this (click here for the entire sheet):

Bad_spreadsheet_3.png

This spreadsheet has several problems. Among them:

  • The position of a piece of data changes how I interpret it. E.g. a blank row means 'new sheet' or 'new well'.
  • The cells contain a mixture of information (e.g. 'Site' and the actual data) and appear in varying units.
  • Some information is encoded by styles (e.g. using red to denote a mineral species). If you store your sheet as a CSV (which you should), this information will be lost.
  • Columns are hidden, there are footnotes, it's just a bit gross.

Using this spreadsheet to make plots, or reading it with software, with be a horrible experience. I will probably swear at my computer, suffer a repetitive strain injury, and go home early with a headache, cursing the muppet that made the spreadsheet in the first place. (Admittedly, I am the muppet that made this spreadsheet in this case, but I promise I did not invent these pathologies. I have seen them all.)

Let's make the world a better place

Consider making separate sheets for the following:

  • Raw data. This is important. See below.
  • Computed columns. There may be good reasons to keep these with the data.
  • Charts.
  • 'Tabulated' data, like my bad spreadsheet above, with tables meant for summarization or printing.
  • Some metadata, either in the file properties or a separate sheet. Explain the purpose of the dataset, any major sources, important assumptions, and your contact details.
  • A rich description of each column, with its caveats and assumptions.

The all-important data sheet has its own special requirements. Here's my guide for a pain-free experience:

  • No computed fields or plots in the data sheet.
  • No hidden columns.
  • No semantic meaning in formatting (e.g. highlighting cells or bolding values).
  • Headers in the first row, only data in all the other rows.
  • The column headers should contain only a unique name and [units], e.g. Depth [m], Porosity [v/v].
  • Only one type of data per column: text OR numbers, discrete categories OR continuous scalars.
  • No units in numeric data cells, only quantities. Record depth as 500, not 500 m.
  • Avoid keys or abbreviations: use Sandstone, Limestone, Shale, not Ss, Ls, Sh.
  • Zero means zero, empty cell means no data.
  • Only one unit per column. (You only use SI units right?)
  • Attribution! Include a citation or citations for every record.
  • If you have two distinct types or sources of data, e.g. grain size from sieve analysis and grain size from photomicrographs, then use two different columns.
  • Personally, I like the data sheet to be the first sheet in the file, but maybe that's just me.
  • Check that it turns into a valid CSV so you can use this awesome format.

      After all that, here's what we have (click here for the entire sheet):

    The same data as the first image, but improved. The long strings in columns 3 and 4 are troublesome, but we can tolerate them. Click to enlarge.

    Maybe the 'clean' analysis-friendly sheet looks boring to you, but to me it looks awesome. Above all, it's easy to use for SCIENCE! And I won't have to go home with a headache.


    The data in this post came from this Cretaceous shale dataset [XLS file] from the government of Manitoba. Their spreadsheet is pretty good and only breaks a couple of my golden rules. Here's my version with the broken and fixed spreadsheets shown here. Let me know if you spot something else that should be fixed!

    x lines of Python: read and write a shapefile

    Shapefiles are a sort-of-open format for geospatial vector data. They can encode points, lines, and polygons, plus attributes of those objects, optionally bundled into groups. I say 'sort-of-open' because the format is well-known and widely used, but it is maintained and policed, so to speak, by ESRI, the company behind ArcGIS. It's a slightly weird (annoying) format because 'a shapefile' is actually a collection of files, only one of which is the eponymous SHP file. 

    Today we're going to read a SHP file, change its Coordinate Reference System (CRS), add a new attribute, and save a new file in two different formats. All in x lines of Python, where x is a small number. To do all this, we need to add a new toolbox to our xlines virtual environment: geopandas, which is a geospatial flavour of the popular data management tool pandas.

    Here's the full rundown of the workflow, where each item is a line of Python:

    1. Open the shapefile with fiona (i.e. not using geopandas yet).
    2. Inspect its contents.
    3. Open the shapefile again, this time with geopandas.
    4. Inspect the resulting GeoDataFrame in various ways.
    5. Check the CRS of the data.
    6. Change the CRS of the GeoDataFrame.
    7. Compute a new attribute.
    8. Write the new shapefile.
    9. Write the GeoDataFrame as a GeoJSON file too.

    By the way, if you have not come across EPSG codes yet for CRS descriptions, they are the only way to go. This dataset is initially in EPSG 4267 (NAD27 geographic coordinates) but we change it to EPSG 26920 (NAD83 UTM20N projection).

    Several bits of our workflow are optional. The core part of the code, items 3, 6, 7, and 8, are just a few lines of Python:

        import geopandas as gpd
        gdf = gpd.read_file('data_in.shp')
        gdf = gdf.to_crs({'init': 'epsg:26920'})
        gdf['seafl_twt'] = 2 * 1000 * gdf.Water_Dept / 1485
        gdf.to_file('data_out.shp')

    That's it! 

    As in all these posts, you can follow along with the code in the Jupyter Notebook.

    Murphy's Law for Excel

    Where would scientists and engineers be without Excel? Far, far behind where they are now, I reckon. Whether it's a quick calculation, or making charts for a thesis, or building elaborate numerical models, Microsoft Excel is there for you. And it has been there for 32 years, since Douglas Klunder — now a lawyer at ACLU — gave it to us (well, some of us: the first version was Mac only!).

    We can speculate about reasons for its popularity:

    • It's relatively easy to use, and most people started long enough ago that they don't have to think too hard about it.
    • You have access to it, and you know that your collaborators (boss, colleagues, future self) have access to it.
    • It's flexible enough that it can do almost anything.
    Figure 1 from 'Predicting bed thickness with cepstral decomposition'.

    Figure 1 from 'Predicting bed thickness with cepstral decomposition'.

    For instance, all the computation and graphics for my two 2006 articles on signal processing were done in Excel (plus the FFT add-on). I've seen reservoir simulators, complete with elaborate user interfaces, in Excel. An infinity of business-critical documents are stored in Excel (I just filled out a vendor registration form for a gigantic multinational in an Excel spreadsheet). John Nelson at ESRI made a heatmap in Excel. You can even play Pac Man.

    Maybe it's gone too far:


    So what's wrong with Excel?

    Nothing is wrong with it, but it's not the best tool for every number-crunching task. Why?

    • Excel files are just that — files. Sometimes you want to do analysis across datasets, and a pool of data (a database) becomes more useful. And sometimes you wish nine different people didn't have nine different versions of your spreadsheet, each emailing their version to nine other people...
    • The charts are rather clunky and static. They don't do well with large datasets, or in data you'd like to filter or slice dynamically.
    • In large datasets, scrolling around a spreadsheet gets old pretty quickly.
    • The tool is so flexible that people get carried away with pretty tables, annotating their sheets in ways that make the printed page look nice, but analysis impossible.

    What are the alternatives?

    Excel is a wonder-tool, but it's not the only tool. There are alternatives, and you should at least know about them.

    For everyday spreadsheeting needs, I now use Google Sheets. Collaboration is built-in. Being able to view and edit a sheet at the same time as someone else is a must-have (probably Office 365 does this now too, so if you're stuck with Excel I urge you to check). Version control — another thing I'm not sure I can live without — is built in. For real nerds, there's even a complete API. I also really like the native 'webbiness' of Google Docs, for example being able to use web API calls natively, for example getting the current CAD–USD exchange rate with GoogleFinance("CURRENCY:CADUSD").

    If it's graphical analysis you want, try Tableau or Spotfire. I'm especially looking at you, reservoir engineers — you are seriously missing out if you're stuck in Excel, especially if you have a lot of columns of different types (time series, categories and continuous variables for example). The good news is that the fastest way to get data into Spotfire is... Excel. So it's easy to get started.

    If you're gathering information from people, like registering the financial details of vendors for instance, then a web form is your best bet. You can set one up in Google Forms in minutes, and there are lots of similar services. If you want to use your own servers, no problem: any dev worth their wages can throw one together in a few hours.

    If you're doing geoscience in Excel, like my 2006 self — filtering logs, or generating synthetics, or computing spectrums — your mind will be blown by spending a few hours learning a programming language. Your first day in Python (or Julia or Octave or R) will change your quantitative life forever.

    Excel is great at some things, but for most things, there's a better way. Take some time to explore them the next time you have some slack in your schedule.

    References

    Hall, M (2006). Resolution and uncertainty in spectral decomposition. First Break 24, December 2006, p 43–47.

    Hall, M (2006). Predicting stratigraphy with cepstral decomposition. The Leading Edge 25 (2, Special Issue on Spectral Decomposition). doi:10.1190/1.2172313


    UPDATE

    As a follow-up example, I couldn't resist sharing this recent story about an artist that draws anime characters in Excel.

    Newsflash: the Geophysics Hackathon is back!

    Mark your calendar: 22–24 September (right before SEG), at a downtown Houston location to be confirmed.

    We're filling the room with 50 geoscientists of all stripes. Interpreters, programmers, students, professionals... everyone is welcome. The plan: to imagine, design, and prototype some new tools in geophysics — all around the theme of machine learning. It's going to be awesome. 

    The schedule: we'll get started at 6 pm on Friday 22 September, and go till 10 pm. Then we pick it up again on Saturday morning, and go till 6 pm, and the same again on Sunday. Teams will present a demo to everyone on Sunday after 3 pm. There will be a few prizes, a few drinks, lots of food, and a lot of new geophysical tools and widgets. 

    If you want to know more about what a hackathon is, read my summary from the last one: Le grand hack! Or check out the project round-up posts, part 1 and part 2.

    If you're not sure you belong, I promise that you do. One of the prize-winning teams in Paris had no coding experience! And every team needs help with brainstorming, design, testing, and presentation. Absolutely anyone can contribute, and absolutely everyone will learn something.

    If you have some like-minded friends, bring them along! We need teams of 5 people, so if there are already 5 of you, you can start coding as soon as you walk in the door!

    If you can't be there yourself, please share this post with someone you know.

    When you're ready, click here to buy a ticket.


    Thank you as always to our sponsors so far: Dell EMC and Amazon AWS. If you'd like to sponsor the Houston event, please check this page out, or just get in touch.