Science’s data secrecy problem
In 2006, amid growing skepticism about the reliability of psychology studies, a group of researchers decided to figure out just how solidly grounded those studies were. They looked at 141 major psychology papers and emailed their authors to request the original data.
Four hundred emails and six months later, they’d received the data for only a quarter of those studies. The rest were unavailable. And so, instead of the question they’d set out to answer, they wrote a different article—titled, pointedly, “The poor availability of psychological research data for reanalysis.”
What went wrong? Given how important data is in scientific research, and how much of it is publicly funded, one might think research data is easily available for examination – for other researchers to kick the tires, so to speak. But actually, only a small minority of papers are published with the data available.
Those psych researchers in 2006 aren’t the only team to encounter such frustration. In 2009, a group looking at studies related to modeling in cancer, malaria, and other diseases found only 20 percent of datasets could be accessed. Other researchers who looked specifically at high-impact studies— those published in the most prestigious journals—found that only 10 percent of publications contained the raw data on which their findings were based.
This might come as a surprise. The entire scientific enterprise is, in theory, built on sharing data – it’s how researchers convince skeptics, how they pressure-test one another’s theories. Unlike the secretive world of private-sector invention, science is largely funded with federal or nonprofit money, adding a public-interest component to the basic scientific principle of transparency.
The reasons for the lack of data sharing sometimes are quite simple: Providing data can be a nuisance, taking time and money from running experiments. And sometimes published datasets vanish over time, a function of non-standard archival mechanisms and poor enforcement of data sharing. (This was documented by a research group in 2013; as one author described it, some data sets are simply being “lost to science.”)
But secrecy is another problem. Data helps researchers publish, and publications are the currency of scientists, earning them grants and promotions. Thus, researchers often cling jealously to their most important data, treating it more like proprietary information than a public resource.
Troubled by this secrecy – especially given the public funding of most research – a movement for open data and overall open science has arisen, calling for open-access publishing—that is, research to be published in non-paywalled forums—and data sharing. This movement builds upon the mandate by the Obama administration, implemented in 2013, that all federally funded research articles be made available to read for free within one year of publication.
Such a movement is supported by the scientific community in principle, but not often followed in practice. Over 16,000 researchers have signed a pledge to not publish in Elsevier, the world’s largest publisher and one that is known for expensive paywalls, and other closed-door practices. But four 4 years after the pledge started to circulate, more than one-third of signers who’ve published have already broken it.
(The movement has also triggered something of a data-sharing backlash: An op-ed in the New England Journal of Medicine last year coined the term “research parasite” to describe scientists who reuse and adapt others’ data without the explicit benefit of the collector of data.)
Today, mandates from research funders, federal and private, are starting to change this process—whether researchers like it or not. The Wellcome Trust and the Gates Foundation, two of the biggest independent sources of medical research funding, require any researcher receiving funding to post data openly.