10 ways to improve your data store

When I look at the industry's struggle with the data mess, I see a parallel with science's struggle with open data. I've written lots about that before, but the basic idea is simple: scientists need discoverable, accessible, documented, usable data. Does that sound familiar?

I wrote yesterday that I think we have to get away from the idea that we can manage data like we might manage a production line. Instead, we need to think about more organic, flexible strategies that cope with and even thrive on chaos. I like, or liked until yesterday, the word 'curation', because it implies ongoing care and a focus on the future. But my friend Eric Marchand was right in his comment yesterday — the dusty connotation is too strong, and dust is bad for data. I like his supermarket analogy: packaged, categorized items, each with a cost of production and a price. A more lively, slightly chaotic market might match my vision better — multiple vendors maintaining their own realms. One can get carried away with analogies, but I like this better than a library or museum.

The good news is that lots of energetic and cunning people have been working on this idea of open data markets. So there are plenty of strategies we can try, alongside the current strategy of giving giant service companies millions of dollars for their TechCloud® Integrated ProSIGHT™ Data Management Solutions.

Serve your customer:

  • Above all else, build what people need. It's amazing that this needs to be said, but ask almost anyone what they think of IT at their company and you will know that it is not how it works today. Everything you build should be in response to the business pulling. 
  • This means you have to get out of the building and talk to your customers. In person, one-one-one. Watch them use your systems. Listen to them. Respond to them. 

Knock down the data walls:

  • Learn and implement open data practices inside the organization. Focus on discoverability, accessiblity, documentation of good-enough data, not on building The One True Database. 
  • Encourage and fund open data practices among providers of public data. There is a big role here for our technical societies, I believe, but I don't think they have seen it yet.

I've said it before: hire loads of geeks:

  • The web (well, intranet) is your pipeline. Build and maintain proper machine interfaces (APIs and web APIs) for data. What, you don't know how to do this? I know; it means hiring web-savvy data-obsessed programmers.
  • Bring back the hacker technologists that I think I remember from the nineties. Maybe it's a myth memory, but sprinkled around big companies there used to be super-geeks with degrees in astrophysics, mad UNIX skills, and the Oracle admin password. Now it's all data managers with Petroleum Technology certificates who couldn't write an awk script if your data depended on it (it does). 
  • Institute proper data wrangling and analysis training for scientists. I think this is pretty urgent. Anecdotal evidence: the top data integration tools in our business is PowerPoint... or an Excel chart with two y-axes if we're talking about engineers. (What does E&P mean?)

Three more things:

  • Let data live where it wants to live (databases, spreadsheets, wikis, SharePoint if you must). Focus on connecting data with APIs and data translators. It's pointless trying to move data to where you want it to be — you're just making it worse. ("Oh, you moved my spreadsheet? Fine, I will copy my spreadsheet.")
  • Get out of the company and find out what other people are doing. Not the other industry people struggling with data — they are just as clueless as we are. Find out what the people who are doing amazing things with data are doing: Google, Twitter, Facebook, data.gov, Wikipedia, Digital Science, The New York Times, The Guardian,... there are so many to choose from. We should invite these people to our conferences; they can help us.
  • If you only do one thing, fix search in your company. Stop tinkering with semantic ontologies and smart catalogs, just buy Google Search Appliance and fix it. You can get this one done by Christmas.

Last thing. If there's one mindset that will really get in the way, it's the project mindset. If we want to go beyond coping with the data mess, far beyond it to thriving on it, then we have to get comfortable with the idea that this is not a project. The word is banned, along with 'initiative', 'governance', and Gantt charts. The requirements you write on the back of a napkin with three colleagues will be just as useful as the ones you get back from three months of focus groups.

No, this is the rest of your career. This is never done, next year there are better ideas, more flexible libraries, faster hardware, and new needs. It's like getting fit: this ain't an 8-week get-fit program, it's an eternity of crunches.

The photograph of Covent Market in London, Ontario is from Boris Kasimov on Flickr.

Data management fairy tales

On Tuesday I read this refreshing post in LinkedIn by Jeffrey Maskell of Westheimer Energy Consultants. It's a pretty damning assessment of the current state of data management in the petroleum industry:

The fact is that no major technology advances have been seen in the DM sector for some 20 years. The [data management] gap between acquisition and processing/interpretation is now a void and is impacting the industry across the board...

I agree with him. But I don't think he goes far enough on the subject of what we should do about it. Maskell is, I believe, advocating more effort (and more budget) developing what the data management crowd have been pushing for years. In a nutshell:

I agree that standards, process, procedures, workflows, data models are all important; I also agree that DM certification is a long term desirable outcome. 

These words make me sad. I'd go so far as to say that it's the pursuit of these mythical ideas that's brought about today's pitiful scene. If you need proof, just look around you. Go look at your shared drive. Go ask someone for a well file. Go and (a) find then (b) read your IT policies and workflow documents — they're all fairy tales.

Maskell acknowledges at least that these are not enough; he goes on:

However I believe the KEY to achieving a breakthrough is to prove positively that data management can make a difference and that the cost of good quality data management is but a very small price to pay...

No, the key to achieving a breakthrough is a change of plan. Another value of information study just adds to the misery.

Here's what I think: 'data management' is an impossible fiction. A fairy tale.

You can't manage data

I'm talking to you, big-company-data-management-person.

Data is a mess, and it's distributed across your organization (and your partners, and your government, and your data vendors), and it's full of inconsistencies, and people store local copies of everything because of your broken SharePoint permissions, and... data is a mess.

The terrible truth you're fighting is that subsurface data wants to be a mess. Subsurface geoscience is not accounting. It's multi-dimensional. It's interdependent. Some of it is analog. There are dozens, maybe hundreds of formats, many of which are proprietary. Every single thing is unquantifiably uncertain. There are dozens of units. Interpretation generates more data, often iteratively. Your organization won't fund anything to do with IT properly — "We're an oil company, not a technology company!" — but it's OK because VPs only last 2 years. Well, subsurface facts last for decades.

You can't manage data. Try something else.

The principle here is: cope don't fix.

People earnestly trying to manage data reminds me of Yahoo trying to catalog the Internet in 1995. Bizarrely, they're still doing it... for 3 more months anyway. But we all know there's only one way to find things on the web today: search. Search transcends the catalog. 

So what transcends data management? I've got my own ideas, but first I really, really want to know what you think. What's one thing we could do — or stop doing — to make things a bit better?

Not picking parameters

I like socks. Bright ones. I've liked bright socks since Grade 6. They were the only visible garment not governed by school uniform, or at least not enforced, and I think that was probably the start of it. The tough boys wore white socks, and I wore odd red and green socks. These days, my favourites are Cole & Parker, and the only problem is: how to choose?

Last Tuesday I wrote about choosing parameters for geophysical algorithms — window lengths, velocities, noise levels, and so on. Like choosing socks, it's subjective, and it's hard to find a pair for every occasion. The comments from Matteo, Toastar, and GuyM raised an interesting question: maybe the best way to pick parameters is to not pick them? I'm not talking about automatically optimizing parameters, because that's still choosing. I'm talking about not choosing at all.

How many ways can we think of to implement this non-choice? I can think of four approaches, but I'm not 100% sure they're all different, or if I can even describe them...

Is the result really optimal, or just a hard-to-interpret patchwork?

Adaptivity

Well, okay, we still choose, but we choose a different value everywhere depending on local conditions. A black pair for a formal function, white for tennis, green for work, and polka dots for special occasions. We can adapt to any property (rather like automatic optimization), along any dimension of our data: spatially, azimuthally,  temporally, or frequentially (there's a word you don't see every day).

Imagine computing seismic continuity. At each sample, we might evaluate some local function — such as contrast — for a range of window sizes. Or, when smoothing, we might specifiy some minimum signal loss compared to the original. We end up using a different value everywhere, and expect an optimal result.

One problem is that we still have to choose a cost function. And to be at all useful, we would need to produce two new data products, besides the actual result: a map of the parameter's value, and a map of the residual cost, so to speak. In other words, we need a way to know what was chosen, and how satisfactory the choice was.

Stochastic shotgun

We could fall back on that geostatistical favourite and pick the parameter values stochastically, grabbing socks at random out of the drawer. This works, but I need a lot of socks to have a chance of getting even a local maximum. And we run into the old problem of really not knowing what to do with all the realizations. Common approaches are to take the P50, P10, and P90, or to average them. Both of these approaches make me want to ask: Why did I generate all those realizations?

Experimental design methods

The design of experiments is a big deal in the life sciences,  but for some reason rarely (never?) talked about in geoscience. Applying a cost function, or even just visual judgment, to a single parameter is perhaps trivial, but what if you have two variables? Three? What if they are non-linear and covariant? Then the optimization process amounts to a sticky inverse problem.

Fortunately, lots of clever people have thought about these problems. I've even seen them implemented in subsurface software. Cool-sounding combinatorial reduction techniques like Greco-Latin squares, or Latin hypercubes offer ways to intelligently sample the parameter space and organize the results. We could do the same with socks, evaluating pattern and toe colour separately...

The mixing board

There is another option: the mixing board. Like a music producer, a film editor, or the Lytro camera, I can leave the raw data in place, and always work from the masters. Given the right tools, I can make myself just the right pair of socks whenever I like.

This way we can navigate the parameter space, applying views, processes, or other tools on the fly. Clearly this would mean changing everything about the way we work. We'd need a totally different approach not just to interpretation, but to the entire subsurface characterization workflow.

Are there other ways to avoid choosing? What are people using in other industries, or other sciences? I think we need to invite some experimental design and machine learning people to SEG...

Cole & Parker socks are awesomeThe quilt image is by missvancamp on Flickr and licensed CC-BY. The spools are by surfzone on Flickr, licensed CC-BY. Many thanks to Cole & Parker for permission to use the sock images, despite not knowing what on earth I was going to do with them. Buy their socks! They're Canadian and everything.

The hackathon is coming

The Geophysics Hackathon is one month away! Signing up is not mandatory — you can show up on the day if you like — but it does help with the planning. It's 100% free, and I guarantee you'll enjoy yourself. You'll also learn tons about geophysics and about building software. Deets: Thrive, Denver, 8 am, 25–26 October. Bring a laptop.

Need more? Here's all the info you could ask for. Even more? Ask by email or in the comments

Send your project ideas

The theme this year is RESOLUTION. Participants are encouraged to post projects to hackathon.io ahead of time — especially if you want to recruit others to help. And even if you're not coming to the event, we'd love to hear your project ideas. Here are some of the proto-ideas we have so far: 

  • Compute likely spatial and temporal resolution from some basic acquisition info: source, design, etc.
  • Do the same but from information from the stack: trace spacing, apparent bandwidth, etc.
  • Find and connect literature about seismic and log resolution using online bibliographic data.
  • What does the seismic spectrum look like, given STFT limitations, or Gabor uncertainty?

If you have a bright idea, get in touch by email or in the comments. We'd love to hear from you.

Thank you to our sponsors

Three forward-thinking companies have joined us in making the hackathon as much a geophysics party as well as a scientific workshop (a real workshop). I think this industry may have trained us to take event sponsorship for granted, but it's easy to throw $5000 at the Marriott for Yet Another Coffee Break. Handing over money to a random little company in Nova Scotia to buy coffee, tacos, and cool swag for hungry geophysicists and programmers takes real guts! 

Please take a minute to check out our sponsors and reward them for supporting innovation in our community. 

dGB GeoTeric OGS

Students: we are offering $250 bursaries to anyone looking for help with travel or accommodation. Just drop me a line with a project idea. If you know a student that might enjoy the event, please forwadrd this to them.

Picking parameters

One of the reasons I got interested in programming was to get smarter about broken workflows like this one from a generic seismic interpretation tool (I'm thinking of Poststack-PAL, but does that even exist any more?)...

  1. I want to make a coherence volume, which requires me to choose a window length.
  2. I use the default on a single line and see how it looks, then try some other values at random.
  3. I can't remember what I did so I get systematic: I try 8 ms, 16 ms, 32 ms, and 64 ms, saving each one as a result with _XXms appended so I can remember what I did
  4. I display them side by side but the windows are annoying to line up and resize, so instead I do it once, display them one at a time, grab screenshots, and import the images into PowerPoint because let's face it I'll need that slide eventually anyway
  5. I can't decide between 16 ms and 32 ms so I try 20 ms, 24 ms, and 28 ms as well, and do it all again, and gaaah I HATE THIS STUPID SOFTWARE

There has to be a better way.

Stumbling towards optimization

Regular readers will know that this is the time to break out the IPython Notebook. Fear not: I will focus on the outcomes here — for the real meat, go to the Notebook. Or click on these images to see larger versions, and code.

Let's run through using the Canny edge detector in scikit-image, a brilliant image processing Python library. The algo uses the derivative of a Gaussian to compute gradient, and I have to choose 3 parameters. First, we'll try to optimize 'sigma', the width of the Gaussian. Let's try the default value of 1:

Clearly, there is too much noise in the result. Let's try the interval method that drove me crazy in desktop software:

Well, I think something between 8 and 16 might work. I could compute the average intensity of each image, choose a value in between them, and then use the sigma that gives that result. OK, it's a horrible hack, but turns out to be 10:

But the whole point of scientific computing is the efficient application of informed human judgment. So let's try adding some interactivity — then we can explore the 3D parameter space in a near-parallel instead of purely serial way:

I finally feel like we're getting somewhere... But it still feels a bit arbitrary. I still don't know I'm getting the optimal result.

What can I try next? I could try to extend the 'goal seek' option, and come up with a more sophisticated cost function. If I could define something well enough — for edge detection, like coherence, I might be interested in contrast — then I could potentially just find the best answers, in much the same way that a digital camera autofocuses (indeed, many of them look for the highest contrast image). But goal seeking, if the cost function is too literal, in a way begs the question. I mean, you sort of have to know the answer — or something about the answer — before you find it.

Social machines

Social machines are the hot new thing in computing (Big Data is so 2013). Perhaps instead I can turn to other humans, in my social and professional networks. I could...

  • Ask my colleagues — perhaps my company has a knowledge sharing network I can go to.
  • Ask t'Internet — I could ask Twitter, or my friends on Facebook, or a seismic interpretation group in LinkedIn. Better yet, Earth Science Stack Exchange!
  • What if the software I was using just told me what other people had used for these parameters? Maybe this is only one step up from the programmer's default... especially if most people just use the programmer's default.
  • But what if people could rate the outcome of the algorithm? What if their colleagues or managers could rate the outcome? Then I could weight the results with these ratings.
  • What if there was a game that involved optimizing images (OK, maybe a bit of a stretch... maybe more like a Mechanical Turk). Then we might have a vast crowd of people all interested in really pushing the edge of what is intuitively reasonable, and maybe exploring the part of the parameter space I'm most interested in.

What if I could combine the best of all these approaches? Interactive exploration, with guided optimization, constrained by some cost function or other expectation. That could be interesting, but unfortunately I have absolutely no idea how that would work. I do think the optimization workflow of the future will contain all of these elements.

What do you think? Do you have an awesome way to optimize the parameters of seismic attributes? Do you have a vision for how it could be better? It occurs to me this could be a great topic for a future hackathon...

ipynb_icon.png
Click here for an IPython Notebook version of this blog post. If you don't have it, IPython is easy to install. The easiest way is to install all of scientific Python, or use Canopy or Anaconda.

Julia in a nutshell

Julia is the most talked-about language in the scientific Python community. Well, OK, maybe second to Python... but only just. I noticed this at SciPy in July, and again at EuroSciPy last weekend.

As promised, here's my attempt to explain why scientists are so excited about it.

Why is everyone so interested in Julia?

At some high level, Julia seems to solve what Steven Johnson (MIT) described at EuroSciPy on Friday as 'the two-language problem'. It's also known as Outerhout's dichotomy. Basically, there are system languages (hard to use, fast), and scripting languages (easy to use, slow). Attempts to get the best of boths worlds have tended to result in a bit of a mess. Until Julia.

Really though, why?

Cheap speed. Computer scientists adore C because it's rigorous and fast. Scientists and web developers worship Python because it's forgiving and usually fast enough. But the trade-off has led to various cunning ploys to get the best of both worlds, e.g. PyPy and Cython. Julia is perhaps the cunningest ploy of all, achieving speeds that compare with C, but with readable code, dynamic typing, garbage collection, multiple dispatch, and some really cool tricks like Unicode variable names that you enter in pure LaTeX. And check out this function definition shorthand:

Why is Julia so fast?

Machines don't understand programming languages — the code written by humans has to be translated into machine language in a process called 'compiling'. There are three approaches:

Compiling makes languages fast, because the executed code is tuned to the task (e.g. in terms of the types of variables it handles), and to the hardware it's running on. Indeed, it's only by building special code for, say, integers, that compiled languages achieve the speeds they do.

Julia is compiled, like C or Fortran, so it's fast. However, unlike C and Fortran, which are compiled before execution, Julia is compiled at runtime ('just in time' for execution). So it looks a little like an interpreted language: you can write a script, hit 'run' and it just works, just like you can with Python.

You can even see what the generated machine code looks like:

Don't worry, I can't read it either.

But how is it still dynamically typed?

Because the compiler can only build machine code for specific types — integers, floats, and so on — most compiled languages have static typing. The upshot of this is that the programmer has to declare the type of each variable, making the code rather brittle. Compared to dynamically typed languages like Python, in which any variable can be any type at any time, this makes coding a bit... tricky. (A computer scientist might say it's supposed to be tricky — you should know the type of everything — but we're just trying to get some science done.)

So how does Julia cope with dynamic typing and still compile everything before execution? This is the clever bit: Julia scans the instructions and compiles for the types it finds — a process called type inference — then makes the bytecode, and caches it. If you then call the same instructions but with a different type, Julia recompiles for that type, and caches the new bytecode in another location. Subsequent runs use the appropriate bytecode, with recompilation.

Metaprogramming

It gets better. By employing metaprogramming — on-the-fly code generation for special tasks — it's possible for Julia to be even faster than highly optimized Fortran code (right), in which metaprogramming is unpleasantly difficult. So, for example, in Fortran one might tolerate a relatively slow loop that can only be avoided with code generation tricks; in Julia the faster route is much easier. Here's Steven's example.

Interoperability and parallelism

It gets even better. Julia has been built with interoperability in mind, so calling C — or Python — from inside Julia is easy. Projects like Jupyter will only push this further, and I expect Julia to soon be the friendiest way to speed up that stubborn inner NumPy loop. And I'm told a lot of thoughtful design has gone into Julia's parallel processing libraries... I have never found an easy way into that world, so I hope it's true.


I'm not even close to being able to describe all the other cool things Julia, which is still a young language, can do. Much of it will only be of interest to 'real' programmers. In many ways, Julia seems to be 'Python for C programmers'.

If you're really interested, read Steven's slides and especially his notebooks. Better yet, just install Julia and IJulia, and play around. Here's another tutorial and a cheatsheet to get you started.

Highlights from EuroSciPy

In July, Agile reported from SciPy in Austin, Texas, one of several annual conferences for people writing scientific software in the Python programming language. I liked it so much I was hungry for more, so at the end of my recent trip to Europe I traveled to the city of Cambridge, UK, to participate in EuroSciPy.

The conference was quite a bit smaller than its US parent, but still offered 2 days of tutorials, 2 days of tech talks, and a day of sprints. It all took place in the impressive William Gates Building, just west of the beautiful late Medieval city centre, and just east of Schlumberger's cool-looking research centre. What follows are my highlights...

Okay you win, Julia

Steven Johnson, an applied mathematician at MIT, gave the keynote on the first morning. His focus was Julia, the current darling of the scientific computing community, and part of a new ecosystem of languages that seek to cooperate, not compete. I'd been sort of ignoring Julia, in the hope that it might go away and let me focus on Python, the world's most useful language, and JavaScript, the world's most useful pidgin... but I don't think scientists can ignore Julia much longer.

I started writing about what makes Julia so interesting, but it turned into another post — up next. Spoiler: it's speed. [Edit: Here is that post! Julia in a nutshell.]

Learning from astrophysics

The Astropy project is a truly inspiring community — in just 2 years it has synthesized a dozen or so disparate astronomy libraries into an increasingly coherent and robust toolbox for astronomers and atrophysicists. What does this mean?

  • The software is well-tested and reliable.
  • Datatypes and coordinate systems are rich and consistent.
  • Documentation is useful and evenly distributed.
  • There is a tangible project to rally developers and coordinate funding.

Geophysicists might even be interested in some of the components of Astropy and the related SunPy project, for example:

  • astropy.units, just part of the ever-growing astropy library, as a unit conversion and quantity handler to compare with pint.
  • sunpy datatypes map and spectra for types of data that need special methods.
  • asv is a code-agnostic benchmarking library, a bit like freebench.

Speed dating for scientists

Much of my work is about connecting geoscientists in meaningful collaboration. There are several ways to achieve this, other than through project work: unsessions, wikis, hackathons, and so on. Now there's another way: speed dating.

Okay, it doesn't quite get to the collaboration stage, but Vaggi and his co-authors shared an ingenious way to connect people and give their professional relationship the best chance of success (an amazing insight, a new algorithm, or some software). They asked everyone at a small 40-person conference to complete a questionnaire that asked, among other things, what they knew about, who they knew, and, crucially, what they wanted to know about. Then they applied graph theory to find the most desired new connections (the matrix shows the degree of similarity of interests, red is high), and gave the scientists five 10-minute 'dates' with scientists whose interests overlapped with theirs, and five more with scientists who knew about fields that were new to them. Brilliant! We have to try this at SEG.

Vaggi, F, T Schiavinotto, A Csikasz-Nagy, and R Carazo-Salas (2014). Mixing scientists at conferences using speed dating. Poster presentation at EuroSciPy, Cambridge, UK, August 2014. Code on GitHub.

Vaggi, F, T Schiavinotto, J Lawson, A Chessel, J Dodgson, M Geymonat, M Sato, R Carazo Salas, A Csikasz Nagy (2014). A network approach to mixing delegates at meetings. eLife, 3. DOI: 10.7554/eLife.02273

Other highlights

  • sumatra to generate and keep track of simulations.
  • vispy, an OpenGL-based visualization library, now has higher-level, more Pythonic components.
  • Ian Osvald's IPython add-on for RAM usage.
  • imageio for lightweight I/O of image files.
  • nbagg backend for matplotlib version 1.4, bringin native (non-JS) interactivity.
  • An on-the-fly kernel chooser in upcoming IPython 3 (currently in dev).

All in all, the technical program was a great couple of days, filled with the usual note-taking and hand-shaking. I had some good conversations around my poster on modelr. I got a quick tour of the University of Cambridge geophysics department (thanks to @lizzieday), which made me a little nostalgic for British academic institutions. A fun week!

The hack is back: An invitation to get creative

We're organizing another hackathon! It's free, and it's for everyone — not just programmers. So mark your calendar for the weekend of 25 and 26 October, sign up with a friend, and come to Denver for the most creative 48 hours you'll spend this year. Then stay for the annual geophysics fest that is the SEG Annual Meeting!

First things first: what is a hackathon? Don't worry, it's not illegal, and it has nothing to do with security. It has to do with ideas and collaborative tool creation. Here's a definition from Wikipedia:

A hackathon (also known as a hack day, hackfest, or codefest) is an event in which computer programmers and others involved in software development, including graphic designers, interface designers and project managers, collaborate intensively on software projects.

I would add that we just need a lot of scientists — you can bring your knowledge of workflows, attributes, wave theory, or rock physics. We need all of that.

Creativity in geophysics

The best thing we can do with our skills — and to acquire new ones — is create things. And if we create things with and alongside others, we learn from them and they learn from us, and we make lasting connections with people. We saw all this last year, when we built several new geophysics apps:

hackathon_2014_calendar.png

The event is at the THRIVE coworking space in downtown Denver, less than 20 minutes' walk from the convention centre — a Manhattan distance of under 1 mile. They are opening up especially for us — so we'll have the place to ourselves. Just us, our laptops, high-speed WiFi, and lots of tacos. 

Sign up here.It's going to be awesome.

The best in the biz

GeoTeric_logo.jpg

This business is blessed with some forward-looking companies that know all about innovation in subsurface geoscience. We're thrilled to have some of them as sponsors of our event, and I hope they will also be providing coders and judges for the event itself. So far we have generous support from dGB — creators of the OpendTect seismic interpretation platform — and ffA — creators the GeoTeric seismic attribute analysis toolbox. A massive Thank You to them both.

If you think your organization might be up for supporting the event, please get in touch! And remember, a fantastic way to support the event — for free! — is just to come along and take part. Sign your team up here!

Student grants

We know there's a lot going on at SEG on this same weekend, and we know it's easier to get money for traditional things like courses. So... We promise that this hackathon will bring you at least as much lasting joy, insight, and skill development as any course. And, if you'll write and tell us what you'd build, we'll consider you for one of four special grants of $250 to help cover your extra costs. No strings. Send your ideas to matt@agilegeoscience.com.

Update

on 2014-09-07 12:17 by Matt Hall

OpenGeoSolutions, the Calgary-based tech company that's carrying the FreeUSP torch and exporing the frequency domain so thoroughly, has sponsred the hackathon again this year. Thank you to Jamie and Chris and everyone else at OGS!

What I learned at Wikimania

As you may know, I like going to conferences outside the usual subsurface circuit. For this year's amusement, I spent part of last week at the annual Wikimania conference, which this year was in London, UK. I've been to Wikimania before, but this year the conference promised to be bigger and/or better than ever. And I was looking for an excuse to visit the motherland...

What is Wikimania?

Wikipedia, one of humanity's greatest achievements, has lots of moving parts:

  • All the amazing content on Wikipedia.org — the best encyclopedia the world has ever seen (according to a recent study by Rodrigues and Silvério).
  • The huge, diverse, distributed community of contributors and editors that writes and maintains the content.
  • The free, open source software it runs on, MediaWiki, and the community of developers that built it.
  • The family of sister projects: Wikimedia Commons for images, Wikidata for facts, WikiSource for references, Wiktionary for definitions, and several others.
  • The Wikimedia Foundation, the non-profit that makes all this amazing stuff happen.

Wikimania is the gathering for all of these pieces. And this year the event drew over 2000 employees of the Foundation, software contributors, editors, and consultants like me. I can't summarize it all, so here are a few highlights...

Research reviews

My favourite session, The state of WikiMedia scholarship, was hosted by Benjamin Mako Hill, Tilman Bayer, and Aaron Shaw. These guys are academics conducting research into the sociological side of wikis. They took it upon themselves to survey most of the 800 papers that appeared in the last 12 months, and to pick a few themes and highlights them for everyone. A little like the Geophysics Bright Spots column in The Leading Edge, but for the entire discipline. Very cool — definitely an idea to borrow!

A definition of community

Communities are one thing, but what sets the Wikimania community apart is its massive productivity. It has created one of the premier intellectual works in history, and done so in under 10 years, and without a leader or a Gantt chart. So it's interesting to hear about what makes this community work. What would you guess? Alignment? Collaboration? Altruism?

No, it seems to be conflict. Conflict, centered firmly on content—specifically sources, wording, accuracy, and article structure—is more prevalent in the community than collaboration (Kim Osman, WikiSym 2013). It's called it 'generative friction', and it underlines something I think is intuitively obvious: communities thrive on diversity, not homogeneity.

How to make a difference

The most striking talk, illustrating perfectly how the world today is a new and wonderful place, was by one of the most inspiring leaders I've ever seen in action: Clare Sutcliffe. In 2012, she discovered that kids weren't getting a chance to give computers instructions (other than 'post this', or 'buy that') in most UK primary schools. Instead of writing a paper about it, or setting up a research institute, or indeed blogging about it, she immediately started doing something about it. Her program, Code Club, is now running in more than 2000 schools. Today, less than 3 years after starting, Code Club is teaching teachers too, and has spread internationally. Amazing and inspiring.

Amusingly, here's a (paraphrased) comment she got from a computer science professor at the end:

I teach computer science at university, where we have to get the kids to unlearn all the stuff they think they know about programming. What are you teaching them about computer science and ethics, or is it all about making games?

Some people are beyond help.

The product is not the goal

I'll finish off with a remark by the new Executive Director of the WikiMedia Foundation, Lila Tretikov. Now that Wikipedia's quality issues are well and truly behind it — the enemy now is bias. At least 87% of edits are by men. She wondered if it might be time to change the goal of the community from 'the greatest possible article', to 'the greatest possible participation'. By definition, the greatest article is also presumably unbiased.

In other words, instead of imagining a world where everyone has free access to the sum of all human knowledge, she is asking us to imagine a world where everyone contributes to the sum of all human knowledge. If you can think of a more profound idea than this — let's hear it in the comments!

The next Wikimania will be in Mexico City, in July 2015. See you there!

Here's a thought. All this stuff is free — yay! But happy thoughts aren't enough to get stuff done. So if you value this corner of the Internet, please consider donating to the Foundation. Better still, if your company values it — maybe it uses the MediaWiki software for its own wiki — then it can help with the software's development by donating. Instead of giving Microsoft $1M for a rubbish SharePoint pseudowiki, download MediaWiki for free and donate $250k to the foundation. It's a win-win... and it's tax-deductible!

Graphics that repay careful study

The Visual Display of Quantitative Information by Edward Tufte (2nd ed., Graphics Press, 2001) celebrates communication through data graphics. The book provides a vocabulary and practical theory for data graphics, and Tufte pulls no punches — he suggests why some graphics are better than others, and even condemns failed ones as lost opportunities. The book outlines empirical measures of graphical performance, and describes the pursuit of graphic-making as one of sequential improvement through revision and editing. I see this book as a sort of moral authority on visualization, and as the reference book for developing graphical taste.

Through design, the graphic artist allows the viewer to enter into a transaction with the data. High performance graphics, according to Tufte, 'repay careful study'. They support discovery, probing questions, and a deeper narrative. These kinds of graphics take a lot of work, but they do a lot of work in return. In later books Tufte writes, 'To clarify, add detail.'

A stochastic AVO crossplot

Consider this graphic from the stochastic AVO modeling section of modelr. Its elements are constructed with code, and since it is a program, it is completely reproducible.

Let's dissect some of the conceptual high points. This graphic shows all the data simultaneously across 3 domains, one in each panel. The data points are sampled from probability density estimates of the physical model. It is a large dataset from many calculations of angle-dependent reflectivity at an interface. The data is revealed with a semi-transparent overlay, so that areas of certainty are visually opaque, and areas of uncertainty are harder to see.

At the same time, you can still see every data point that makes the graphic giving a broad overview (the range and additive intensity of the lines and points) as well as the finer structure. We place the two modeled dimensions with templates in the background, alongside the physical model histograms. We can see, for instance, how likely we are to see a phase reversal, or a Class 3 response subject to the physical probability estimates. The statistical and site-specific nature of subsurface modeling is represented in spirit. All the data has context, and all the data has uncertainty.

Rules for graphics that work

Tufte summarizes that excellent data graphics should:

  • Show all the data.
  • Provoke the viewer into thinking about meaning.
  • Avoid distorting what the data have to say.
  • Present many numbers in a small space.
  • Make large data sets coherent.
  • Encourage the eye to compare different pieces of the data.
  • Reveal the data at several levels of detail, from a broad overview to the fine structure.
  • Serve a reasonably clear purpose: description, exploration, tabulation, or decoration.
  • Be closely integrated with the statistical and verbal descriptions of a data set.

The data density, or data-to-ink ratio, looks reasonably high in my crossplot, but it could like still be optimized. What would you remove? What would you add? What elements need revision?