Organizing spreadsheets

A couple of weeks ago I alluded to ill-formed spreadsheets in my post Murphy's Law for Excel. Spreadsheets are clearly indispensable, and are definitely great for storing data and checking CSV files. But some spreadsheets need to die a horrible death. I'm talking about spreadsheets that look like this (click here for the entire sheet):

Bad_spreadsheet_3.png

This spreadsheet has several problems. Among them:

  • The position of a piece of data changes how I interpret it. E.g. a blank row means 'new sheet' or 'new well'.
  • The cells contain a mixture of information (e.g. 'Site' and the actual data) and appear in varying units.
  • Some information is encoded by styles (e.g. using red to denote a mineral species). If you store your sheet as a CSV (which you should), this information will be lost.
  • Columns are hidden, there are footnotes, it's just a bit gross.

Using this spreadsheet to make plots, or reading it with software, with be a horrible experience. I will probably swear at my computer, suffer a repetitive strain injury, and go home early with a headache, cursing the muppet that made the spreadsheet in the first place. (Admittedly, I am the muppet that made this spreadsheet in this case, but I promise I did not invent these pathologies. I have seen them all.)

Let's make the world a better place

Consider making separate sheets for the following:

  • Raw data. This is important. See below.
  • Computed columns. There may be good reasons to keep these with the data.
  • Charts.
  • 'Tabulated' data, like my bad spreadsheet above, with tables meant for summarization or printing.
  • Some metadata, either in the file properties or a separate sheet. Explain the purpose of the dataset, any major sources, important assumptions, and your contact details.
  • A rich description of each column, with its caveats and assumptions.

The all-important data sheet has its own special requirements. Here's my guide for a pain-free experience:

  • No computed fields or plots in the data sheet.
  • No hidden columns.
  • No semantic meaning in formatting (e.g. highlighting cells or bolding values).
  • Headers in the first row, only data in all the other rows.
  • The column headers should contain only a unique name and [units], e.g. Depth [m], Porosity [v/v].
  • Only one type of data per column: text OR numbers, discrete categories OR continuous scalars.
  • No units in numeric data cells, only quantities. Record depth as 500, not 500 m.
  • Avoid keys or abbreviations: use Sandstone, Limestone, Shale, not Ss, Ls, Sh.
  • Zero means zero, empty cell means no data.
  • Only one unit per column. (You only use SI units right?)
  • Attribution! Include a citation or citations for every record.
  • If you have two distinct types or sources of data, e.g. grain size from sieve analysis and grain size from photomicrographs, then use two different columns.
  • Personally, I like the data sheet to be the first sheet in the file, but maybe that's just me.
  • Check that it turns into a valid CSV so you can use this awesome format.

      After all that, here's what we have (click here for the entire sheet):

    The same data as the first image, but improved. The long strings in columns 3 and 4 are troublesome, but we can tolerate them. Click to enlarge.

    Maybe the 'clean' analysis-friendly sheet looks boring to you, but to me it looks awesome. Above all, it's easy to use for SCIENCE! And I won't have to go home with a headache.


    The data in this post came from this Cretaceous shale dataset [XLS file] from the government of Manitoba. Their spreadsheet is pretty good and only breaks a couple of my golden rules. Here's my version with the broken and fixed spreadsheets shown here. Let me know if you spot something else that should be fixed!

    Murphy's Law for Excel

    Where would scientists and engineers be without Excel? Far, far behind where they are now, I reckon. Whether it's a quick calculation, or making charts for a thesis, or building elaborate numerical models, Microsoft Excel is there for you. And it has been there for 32 years, since Douglas Klunder — now a lawyer at ACLU — gave it to us (well, some of us: the first version was Mac only!).

    We can speculate about reasons for its popularity:

    • It's relatively easy to use, and most people started long enough ago that they don't have to think too hard about it.
    • You have access to it, and you know that your collaborators (boss, colleagues, future self) have access to it.
    • It's flexible enough that it can do almost anything.
    Figure 1 from 'Predicting bed thickness with cepstral decomposition'.

    Figure 1 from 'Predicting bed thickness with cepstral decomposition'.

    For instance, all the computation and graphics for my two 2006 articles on signal processing were done in Excel (plus the FFT add-on). I've seen reservoir simulators, complete with elaborate user interfaces, in Excel. An infinity of business-critical documents are stored in Excel (I just filled out a vendor registration form for a gigantic multinational in an Excel spreadsheet). John Nelson at ESRI made a heatmap in Excel. You can even play Pac Man.

    Maybe it's gone too far:


    So what's wrong with Excel?

    Nothing is wrong with it, but it's not the best tool for every number-crunching task. Why?

    • Excel files are just that — files. Sometimes you want to do analysis across datasets, and a pool of data (a database) becomes more useful. And sometimes you wish nine different people didn't have nine different versions of your spreadsheet, each emailing their version to nine other people...
    • The charts are rather clunky and static. They don't do well with large datasets, or in data you'd like to filter or slice dynamically.
    • In large datasets, scrolling around a spreadsheet gets old pretty quickly.
    • The tool is so flexible that people get carried away with pretty tables, annotating their sheets in ways that make the printed page look nice, but analysis impossible.

    What are the alternatives?

    Excel is a wonder-tool, but it's not the only tool. There are alternatives, and you should at least know about them.

    For everyday spreadsheeting needs, I now use Google Sheets. Collaboration is built-in. Being able to view and edit a sheet at the same time as someone else is a must-have (probably Office 365 does this now too, so if you're stuck with Excel I urge you to check). Version control — another thing I'm not sure I can live without — is built in. For real nerds, there's even a complete API. I also really like the native 'webbiness' of Google Docs, for example being able to use web API calls natively, for example getting the current CAD–USD exchange rate with GoogleFinance("CURRENCY:CADUSD").

    If it's graphical analysis you want, try Tableau or Spotfire. I'm especially looking at you, reservoir engineers — you are seriously missing out if you're stuck in Excel, especially if you have a lot of columns of different types (time series, categories and continuous variables for example). The good news is that the fastest way to get data into Spotfire is... Excel. So it's easy to get started.

    If you're gathering information from people, like registering the financial details of vendors for instance, then a web form is your best bet. You can set one up in Google Forms in minutes, and there are lots of similar services. If you want to use your own servers, no problem: any dev worth their wages can throw one together in a few hours.

    If you're doing geoscience in Excel, like my 2006 self — filtering logs, or generating synthetics, or computing spectrums — your mind will be blown by spending a few hours learning a programming language. Your first day in Python (or Julia or Octave or R) will change your quantitative life forever.

    Excel is great at some things, but for most things, there's a better way. Take some time to explore them the next time you have some slack in your schedule.

    References

    Hall, M (2006). Resolution and uncertainty in spectral decomposition. First Break 24, December 2006, p 43–47.

    Hall, M (2006). Predicting stratigraphy with cepstral decomposition. The Leading Edge 25 (2, Special Issue on Spectral Decomposition). doi:10.1190/1.2172313


    UPDATE

    As a follow-up example, I couldn't resist sharing this recent story about an artist that draws anime characters in Excel.