Organizing spreadsheets

August 15, 2017 Matt Hall

A couple of weeks ago I alluded to ill-formed spreadsheets in my post Murphy 's Law for Excel. Spreadsheets are clearly indispensable, and are definitely great for storing data and checking CSV files. But some spreadsheets need to die a horrible death. I'm talking about spreadsheets that look like this (click here for the entire sheet):

This spreadsheet has several problems. Among them:

The position of a piece of data changes how I interpret it. E.g. a blank row means 'new sheet' or 'new well'.
The cells contain a mixture of information (e.g. 'Site' and the actual data) and appear in varying units.
Some information is encoded by styles (e.g. using red to denote a mineral species). If you store your sheet as a CSV (which you should), this information will be lost.
Columns are hidden, there are footnotes, it's just a bit gross.

Using this spreadsheet to make plots, or reading it with software, with be a horrible experience. I will probably swear at my computer, suffer a repetitive strain injury, and go home early with a headache, cursing the muppet that made the spreadsheet in the first place. (Admittedly, I am the muppet that made this spreadsheet in this case, but I promise I did not invent these pathologies. I have seen them all.)

Let's make the world a better place

Consider making separate sheets for the following:

Raw data. This is important. See below.
Computed columns. There may be good reasons to keep these with the data.
Charts.
'Tabulated' data, like my bad spreadsheet above, with tables meant for summarization or printing.
Some metadata, either in the file properties or a separate sheet. Explain the purpose of the dataset, any major sources, important assumptions, and your contact details.
A rich description of each column, with its caveats and assumptions.

The all-important data sheet has its own special requirements. Here's my guide for a pain-free experience:

No computed fields or plots in the data sheet.
No hidden columns.
No semantic meaning in formatting (e.g. highlighting cells or bolding values).
Headers in the first row, only data in all the other rows.
The column headers should contain only a unique name and [units], e.g. Depth [m], Porosity [v/v].
Only one type of data per column: text OR numbers, discrete categories OR continuous scalars.
No units in numeric data cells, only quantities. Record depth as 500, not 500 m.
Avoid keys or abbreviations: use Sandstone, Limestone, Shale, not Ss, Ls, Sh.
Zero means zero, empty cell means no data.
Only one unit per column. (You only use SI units right?)
Attribution! Include a citation or citations for every record.
If you have two distinct types or sources of data, e.g. grain size from sieve analysis and grain size from photomicrographs, then use two different columns.
Personally, I like the data sheet to be the first sheet in the file, but maybe that's just me.
Check that it turns into a valid CSV so you can use this awesome format.
After all that, here's what we have (click here for the entire sheet):

The same data as the first image, but improved. The long strings in columns 3 and 4 are troublesome, but we can tolerate them. Click to enlarge.

Maybe the 'clean' analysis-friendly sheet looks boring to you, but to me it looks awesome. Above all, it's easy to use for SCIENCE! And I won't have to go home with a headache.

The data in this post came from this Cretaceous shale dataset [XLS file] from the government of Manitoba. Their spreadsheet is pretty good and only breaks a couple of my golden rules. Here's my version with the broken and fixed spreadsheets shown here. Let me know if you spot something else that should be fixed!