A couple of weeks ago I alluded to ill-formed spreadsheets in my post Murphy's Law for Excel. Spreadsheets are clearly indispensable, and are definitely great for storing data and checking CSV files. But some spreadsheets need to die a horrible death. I'm talking about spreadsheets that look like this (click here for the entire sheet):
This spreadsheet has several problems. Among them:
- The position of a piece of data changes how I interpret it. E.g. a blank row means 'new sheet' or 'new well'.
- The cells contain a mixture of information (e.g. 'Site' and the actual data) and appear in varying units.
- Some information is encoded by styles (e.g. using red to denote a mineral species). If you store your sheet as a CSV (which you should), this information will be lost.
- Columns are hidden, there are footnotes, it's just a bit gross.
Using this spreadsheet to make plots, or reading it with software, with be a horrible experience. I will probably swear at my computer, suffer a repetitive strain injury, and go home early with a headache, cursing the muppet that made the spreadsheet in the first place. (Admittedly, I am the muppet that made this spreadsheet in this case, but I promise I did not invent these pathologies. I have seen them all.)
Let's make the world a better place
Consider making separate sheets for the following:
- Raw data. This is important. See below.
- Computed columns. There may be good reasons to keep these with the data.
- 'Tabulated' data, like my bad spreadsheet above, with tables meant for summarization or printing.
- Some metadata, either in the file properties or a separate sheet. Explain the purpose of the dataset, any major sources, important assumptions, and your contact details.
- A rich description of each column, with its caveats and assumptions.
The all-important data sheet has its own special requirements. Here's my guide for a pain-free experience:
- No computed fields or plots in the data sheet.
- No hidden columns.
- No semantic meaning in formatting (e.g. highlighting cells or bolding values).
- Headers in the first row, only data in all the other rows.
- The column headers should contain only a unique name and [units], e.g. Depth [m], Porosity [v/v].
- Only one type of data per column: text OR numbers, discrete categories OR continuous scalars.
- No units in numeric data cells, only quantities. Record depth as 500, not 500 m.
- Avoid keys or abbreviations: use Sandstone, Limestone, Shale, not Ss, Ls, Sh.
- Zero means zero, empty cell means no data.
- Only one unit per column. (You only use SI units right?)
- Attribution! Include a citation or citations for every record.
- If you have two distinct types or sources of data, e.g. grain size from sieve analysis and grain size from photomicrographs, then use two different columns.
- Personally, I like the data sheet to be the first sheet in the file, but maybe that's just me.
- Check that it turns into a valid CSV so you can use this awesome format.
After all that, here's what we have (click here for the entire sheet):
Maybe the 'clean' analysis-friendly sheet looks boring to you, but to me it looks awesome. Above all, it's easy to use for SCIENCE! And I won't have to go home with a headache.
The data in this post came from this Cretaceous shale dataset [XLS file] from the government of Manitoba. Their spreadsheet is pretty good and only breaks a couple of my golden rules. Here's my version with the broken and fixed spreadsheets shown here. Let me know if you spot something else that should be fixed!