Digging for data gold in soil

Kathe Todd-Brown

Pacific Northwest National Laboratory, USA

@KatheMathBio

I’ll frequently hear my field and lab colleagues disparaging their data sets: “Oh this core is constantly giving high values for no reason!” or “We just couldn’t get any methane out of this core, and then we found a rock.” As a researcher, you know all the warts of your own data set and it’s sometimes hard to see the gold underneath.

And your data set is gold. The time, money, and general effort invested in both designing the experiment/survey and then collecting the data to capture a single point in time that will never be exactly replicated again, all makes data incredibly valuable. In a very real sense, scientific data are the Truth that we are trying to understand as scientists. Especially when viewed in aggregation with other data to exact general scientific insight.

Data deserves respect, both yours and our colleagues. This is already common practice with field and lab notebooks. Who hasn’t opened a senior mentors old field notebook with a certain amount of awe and reverence to look at sketches of soil profiles or coffee stains over detailed protocol notes. But all too often researchers see publication as the end of the road for their data. While many data don’t even make it this far (publication bias is another post), this shortchanges the enormous potential for re-use and additional insights that could be gained from most data sets.

It goes without saying that soils are incredibly heterogeneous. Move over 10 centimeters, resample, and you get huge variations in your soil measurements. This makes large data sets critical for extrapolating generalizable insights. But most soil measurements are laborious, limiting most sample sizes to such small numbers that statisticians typically throw up their hands in despair. Individual studies will try to get around this by restricting the scope of their conclusions, homogenizing soil samples, or other methods. But another way around this is to pool data sets after the original study is concluded.

The more data you have, the more valuable it is. But there is a lot of work to get to a multi-data set harmonized data base. There are several hurdles to data re-use but they loosely fall into availability, discoverability and harmonization.

Is the data available? Data locked in basement filing cabinets is literally inaccessible unless you happen to have that key. Increasingly ‘contact the PI’ is an inadequate data policy for many funding agencies and academic journals. And for good reason. Individual researchers move around, leave the field entirely, or simply lose data. Making your data available through a university library, society archive, or some other long term institute is critical for preservation. But it is only the first step.

Once that data set is archived it needs to be discoverable. In the age of Google it still surprises many people how hard it is to find data once it is archived. Since data sets typically come in many different formats, data sets are not indexed directly on the data itself but on meta-data provided with the archived submission. This meta-data description is thus critical and frequently impossible to standardize. If you are investigating a new phenomenon there may be no standard way to describe this in the meta-data at the time of submission. In my opinion, the associated manuscripts makes the best advertisement for an archived data set. So it is important to link that manuscript to the archived data at publication. But there is active research in semantics and around dynamically developing control vocabularies to try to solve this problem.

Finally, if you manage to get the data out of basement storage and adequately described, there remains the third hurdle: harmonization. Is there enough information about the measurements and methods to be able to intelligently compare the data to other data sets? As mentioned previously, data sets are frequently in unique formats and may have new or otherwise distinct measurement protocols that makes automatic data ingestion intractable. Harmonizing the data set to make it comparable to a broader data collection requires expert understanding of the context of the data.

This final harmonization effort can be particularly tricky because you need to maintain a direct link to the original data set. Much like a field notebook or set of lab protocols, a script or computer program robustly preserves data provenance. Manual entry, transcribing data from one templet to another, is generally error prone and can be difficult to reproduce when reviewing the resulting meta-analysis. Sometimes manual transcription can’t be avoided but hand crafting unique scripts to process individual datasets provides the best way to harmonize data sets. Scripting data translation instead of manual entry is both explicit and reproducible, providing a clear line of data provenance from the original data file to the final data product.

All three stages -archiving, discovery, and harmonization- are laborious. Finding a repository to take the data, adequately describing what is there in the meta-data, and writing harmonization scripts is not glamourous work. No one will win a Nobel for their beautifully complete meta-data, but it is necessary.

Global soil maps, for example, can only be constructed by combining the results from soil surveys of different nations and regions. These maps are critical to benchmarking the land-carbon cycle of Earth system models used to inform anthropogenic emissions targets. The land-carbon models themselves rely on other data collections for parameterization and validation during model development. If you want your data to be of broader use to the community and live on beyond an individual project, it has to be preserved intelligently.

Data is scientific gold whose value increases as more data is collected. This is particularly true for a highly heterogeneous system like soils. Contributing data to the broader scientific collection requires that data be available, discoverable, and harmonize-able. While meeting these requirements can be laborious, the long-term rewards to the broader community are significant.

Kathe Todd-Brown is a computational biogeochemist at the Pacific Northwest National Laboratory in Richland, Washington. She initially trained as a mathematician but it was a bit too clean. She transitioned to soil carbon cycling and couldn’t be happier to work in one of the most interesting systems on the planet.

GSBINovember 28, 2017