A couple of weeks ago, in Murphy's Law for Excel, I wrote about the dominance of spreadsheets in applied analysis, and how they may be getting out of hand. Then in Organizing spreadsheets I wrote about how — if you are going to store data in spreadsheets — to organize your data so that you do the least amount of damage. The general goal being to make your data machine-readable. Or, to put it another way, to allow you to save your data as comma-separated values or CSV files.
CSV is the de facto standard way to store data in text files. They are human-readable, easy to parse with multiple tools, and they compress easily. So you need to know how to read and write them in your analysis tool of choice. In our case, this is the Python language. So today I present a few different ways to get at data stored in CSV files.
- Using the
pandasdata analysis library. It's the easiest way to read CSV and XLS data into your Python environment...
- ...and can happily consume a file on the web too. Another nice thing about
pandas. It also writes CSV files very easily.
- Using the built-in
csvpackage. There are a couple of standard ways to do this —
csv.DictReader. This library is handy for when you don't have (or don't want)
numpy, the numeric library for Python. If you just have a CSV full of numbers and you want an array in the end, you can skip
- OK, it's not really a CSV file, but for the finale we read a spreadsheet directly from Google Sheets.
I usually count my lines diligently in these posts, but not this time. With
pandas you're looking at a one-liner to read your data:
df = pd.read_csv("myfile.csv")
and a one-liner to write it out again. With
csv.DictReader you're looking at 3 lines to get a list of dicts (but watch out: your numbers will be strings). Reading a Google Doc is a little more involved, not least because you'll need to set up an app and get an API key to handle authentication.
That's all there is to CSV files. Go forth and wield data like a pro!
Next time in the xlines of Python series we'll look at reading seismic station data from the web, and doing a bit of time-series analysis on it. No more stuff about spreadsheets and CSV files, I promise :)
The thumbnail image is based on the possibly apocryphal banksy image of an armed panda, and one of texturepalace.com's CC-BY textures.