Peer review is widely considered to be the best way to check the validity and correctness of a piece of published research. It’s unsurprising then, that there is a drive to apply the principles of peer review to other research outputs, such as software, workflows, and of course data.
I won’t be saying anything revolutionary when I say that data is complicated. Journal articles are (pretty much) a standardized format, easily accessible for review, and not requiring special tools to interrogate what is in the paper. This does not mean that reviewing papers is easy, but just that there is an extra layer of complication on top of data which needs to be addressed before review can happen.
For a start, data are widely heterogeneous. In just the earth sciences, data can come in forms as varied as:
- Time series, some still being updated e.g. meteorological measurements
- Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer
- 2D scans e.g. satellite data, weather radar data
- 2D snapshots, e.g. cloud camera
- Traces through a changing medium, e.g. radiosonde launches, aircraft flights, ocean salinity and temperature
- Datasets consisting of data from multiple instruments as part of the same measurement campaign
- Physical samples, e.g. fossils
Of course, all these types of data will come in their own (sometimes specialized and proprietary) formats, often requiring specialized software to open the files and read the data. It’s often not as simple as clicking on a link and reading a html or pdf document. And in cases where data are stored in commonly used formats, like spreadsheets, extra information is still needed to determine what the numbers in the cells actually represent, and how to use them.
Data are also big. A reviewer may be happy to review a 20 page paper on the train, but printing out 2 TB of data tables is not really an option (and even if you did – it would be very difficult to make sense of it all!) The hard copy of the Human Genome, at the Wellcome Collection in London, is an excellent example of the sheer physical size of some datasets, and also the impossibility of using it, or reviewing it, in any meaningful way solely from the hard copy.
It can also be very difficult to determine when a dataset is “finished.” Some data are collected over the course of years/decades/centuries, with the community wanting to use the most up-to-date data now. If an ongoing dataset is reviewed, obviously the review only applies to the data that was collected up to the point of the review. But what of the data collected after? Is that still covered by the review? My answer is yes, to a certain extent, because the review doesn’t just look at the data – it looks at the methodologies of collection, the data management and archiving processes, and the metadata published along with the data – all of which remain valid for data collected after the review.
While peer review evaluates the quality of a piece of research, evaluating the quality of data is not as easy as it seems. For example, a dataset might be collected for one purpose (e.g. listening to atmospheric fading on an Earth to satellite communication channel which has too much noise in the received signal). However, this noise is exactly the sort of data that researchers studying the effects of scintillation (changes in the refractive index of the atmosphere) are interested in. So a dataset with too much noise is “bad” for a researcher working on fading, while “good” for a researcher working on scintillation. In some cases, a dataset might be marginally useful now, but in twenty or a hundred years, it could be exceptionally valuable. For example in the eighteenth century, ships’ captains regularly made meteorological measurements and recorded them in the ship’s log. They would have had no idea at the time that their measurements would be an incredibly valuable resource for investigating the effects of climate change several hundred years later!
With all these complications, it seems that reviewing data should be a horrible, difficult job – a perception that puts reviewers off even attempting to review data. From personal experience, I can say that reviewing data isn’t as hard as you’d expect – provided guidance is given on what the review is trying to determine.
As I said earlier, the quality of a dataset is often determined by what use one wants to make of it. But regardless how useful a dataset might be in the future, if it is not documented and archived properly, all its potential is wasted. My thinking is that data peer review should focus on the fundamental question of “Can this dataset be used by others, sometime in the future?” That question leads you to ask about the supporting information (metadata) provided with the dataset, the formats that the data is stored in, and the longevity and trustworthiness of the repository it is archived in. Datasets can rely on domain specific knowledge for their use (though implicit domain knowledge may well disappear over time), but it is better and safer to provide as much metadata and associated documentation as possible.
Providing this supporting information is time consuming and takes effort, but there is an increasing recognition by funders and researchers that properly archiving and documenting data is worth the effort and should be funded. Making data open for peer review gives other benefits besides error-checking and protection against fraud.It also opens data for collaboration and exploitation by industry and out-of-discipline researchers, promotes the dataset, and makes it easier to track the data’s impact.
Peer review of data will require different skills than peer review of articles. One way to deal with this is to split the review into different types, each asking different questions, for example:
- Editorial review – “Does the dataset have a permanent identifier?” “Is the dataset stored in a trusted repository?” “Are the access conditions clearly laid out and following journal guidance?”
- Technical review – “Is the data in an appropriate, community standard, format?” “Are tools and services provided to facilitate visualisation and manipulation of the data?” “Is there enough metadata so that non specialist users can understand what the data is and how it was collected?”
- Scientific review – “Is the metadata provided accurate?” “Are there suspicious values in the data that shouldn’t be (e.g. are there negative values for rain rate)?” “Is the dataset fit for the purpose it was collected for?” “Is the dataset suitable for any other uses?”
We are still in the very early stages of peer review for data, so we’re only just starting to come up with solutions for many of these problems. Of course, given the heterogeneity of data, and the broad spread of academic disciplines, we simply won’t be able to create a generic solution!
This is an interesting time for data, and for the whole academic publication process. I’m looking forward to seeing what solutions we develop.
Caption (first image) Hard copy of the Human Genome at the Wellcome Collection, London Source: Sarah Callaghan
Caption (second image) A look inside one of the books of the hard copy of the Human Genome at the Wellcome Collection, London Source: Sarah Callaghan