An open source wish list

May 24, 2021 Matt Hall

After reviewing a few code-dependent scientific papers recently, I’ve been thinking about reproducibility. Is there a minimum requirement for scientific code, or should we just be grateful for any code at all?

The sky’s the limit

Click to enlarge

I’ve come to the conclusion that there are a few things that are essential if you want anyone to be able to do more than simply read your code. (If that’s all you want, just add a code listing to your supplementary material.)

The number one thing is an open licence. (I recently wrote about how to choose one). Assuming the licence is consistent with everything you have used (e.g. you haven’t used a library with the GPL, then put an Apache licence on it), then you are protected by the indeminity clauses and other people can re-use your code on your terms.

After that, good practice is to improve the quality of your code. Most of us write horrible code a lot of the time. But after bit of review, some refactoring, some input from colleagues, you will have something that is less buggy, more readable, and more reusable (even by you!).

If this was a one-off piece of code, providing figures for a paper for instance, you can stop here. But if you are going to keep developing this thing, and especially if you want others to use it to, you should keep going.

Best practice is to start using continuous integration, to help ensure that the code stays in good shape as you continue to develop it. And after that, you can make your tool more citable, maybe write a paper about it, and start developing a user/contributor community. The sky’s the limit — and now you have help!

Other models

When I shared this on Twitter, Simon Waldman mentioned that he had recently co-authored a paper on this topic. Harrison et al (2021) proposed that there are three priorities for scientific software: to be correct, to be reusable, and to be documented. From there, they developed a hierachy of research software projects:

Level 0 — Barely repeatable: the code is clear and tested in a basic way.
Level 1 — Publication: code is tested, readable, available and ideally openly licensed.
Level 2 — Tool: code is installable and managed by continuous integration.
Level 3 — Infrastructure: code is reviewed, semantically versioned, archived, and sustainable.

There are probably still other models out there.— if you know if a good one, please drop it in the Comments.

References

Sam Harrison, Abhishek Dasgupta, Simon Waldman, Alex Henderson & Christopher Lovell (2021, May 14). How reproducible should research software be? Zenodo. DOI: 10.5281/zenodo.4761867