Should reviewers be expected to review supporting datasets and code?

by John Helliwell, Emeritus Professor of Chemistry University of Manchester and DSc Physics University of York (@HelliwellJohn)

Introduction

For the meeting entitled “Transparency, Reward, and Innovation in Peer Review in the Life Sciences” to be held on Feb. 7-9, 2018 at the Howard Hughes Medical Institute in Chevy Chase, Maryland (http://asapbio.org/peer-review) I have been asked by The Wellcome Trust to open the discussion on the question in my title.

In my view peer reviewing research article submissions to journals is arguably one of the most important roles we scientists play.  Through this process we seek to improve the research of our peers, highlighting errors and omissions and work to ensure that scientifically flawed research does not get published.  To perform this work effectively however – especially in our new data-driven age – it is crucial that peer reviewers are given unfettered access to the data and code underlying the research we are reviewing.  Unfortunately, while many journals provide access to this data after an article’s publication,* most journals do not provide access to this material during the refereeing process, making it almost impossible to perform an effective peer review function.

In this blog post I will discuss why peer review of the  underpinning data of a research article is important – using examples from the my field of crystallography – and outline some steps which funders and publishers could take to implement peer review of data. 

Some history

As an editor for a learned society suite of journals (the International Union of Crystallography, IUCr) I have handled around 1000 research article submissions. As a researcher myself I never kept count of my refereeing tasks that I accepted but at say 10 per year these total around 400 for a wide variety of journals during my research career so far.

I have campaigned within my field, biological crystallography, for the last 15 years that a peer reviewer who only has access to the narrative of an article is without the facts of the underpinning data. Indeed even further back in time I recall including my diffraction data in my Oxford University DPhil thesis at the time of submission in 1977 so that my Examiners could view these underpinning data if they wished.

I regard the biggest failure of my career was when I lost a vote at the IUCr’s Commission on Biological Macromolecules Open meeting held at the IUCr’s World Congress in Geneva in 2002 to require open access of diffraction data and coordinates to the referees of biology articles submitted to the International Union of Crystallography Journals (IUCr); I was Editor in Chief at the time.

So, what were the arguments against my proposal of access to the data for referees of a submitted article made by those present at the Open Commission meeting and how can they be refuted?

Firstly, I was told, biological structure results may have industrial potential and authors should be allowed by the Protein Data Bank to have an embargo period of up to 1 year after publication before public release. The very idea of referees having access to the data before publication was therefore inadmissible. My reply that that could be considered a special category of article submissions for exemption was obviously not reassuring enough.

Secondly, at the stage of pre-publication  authors were worried about referees gaining an advantage in their own research whilst having power to delay the authors’ publication; of course by necessity the editor would have had to select referees who knew the topic of the article and were potentially competitors. My reply that authors could list in their submission letter those they wished that the editor would not consult was again not reassuring enough.

Interestingly an argument against my proposal that might have been made of reviewers being overloaded with work if they had to assess the underpinning data as well was not made. Some years later this issue did, in general i.e. across all of science, start getting raised. It is worth noting that at this time other journals, such as IUCr’s Acta Cryst C Crystal Structure Communications, for chemists, were requiring that the underlying data were shared with reviewers, with no obvious problems.  Moreover, through computer-generated validation reports  – such as that provided by checkcif, today with more than 400 automated data checks of the structural chemistry and diffraction data underpinning an article, (see http://checkcif.iucr.org/) – reviewers could be assured that the data are valid. The advantage of this type of data checks is a consistency of approach. Thus with the IUCr Managing Editor we had already proposed to the Protein Data Bank to instigate an equivalent validation report for protein crystallography database entries. This has been successfully introduced by the PDB subsequently. Referees today are greatly aided by the PDB’s Validation report which also highlights in yellow problem portions it may have picked up in its automatic checks.

Present day

Today, 15 years later, there are still no journals that publish biological crystal structures who mandate that these data be shared with reviewers. At most, puzzlingly, the instructions to a reviewer might require the reviewer to attest that the article’s conclusions are supported by the data but with no indication of having the data routinely made available to the reviewer! The emergence of the general science data journal Scientific Data is refreshing as it does require access of research data for its referees (https://www.nature.com/articles/sdata201633).

With a small band of like-minded colleagues concerned about the growing number of cases of irreproducible results we recently published two articles (here and here)  exhorting journal editors to take on board the need to scrutinise the underpinning biological crystallography data. [Please note that these two opinion piece articles are behind subscriber paywalls which is because the opinions are unfunded efforts on our part.]

In the last 5 years great strides have been made with the storage capacities of the digital archives. The funding agencies increasingly are stringent on their funded researchers having research data management plans. Within IUCr I was charged to lead a Diffraction Data Deposition Working Group to evaluate the need for and possibilities of raw diffraction data archiving. Our work in the last 6 years culminated in August 2017 with our final report available at http://forums.iucr.org/viewtopic.php?f=21&t=396 .

Our top two recommendations are:

  • Authors should provide a permanent and prominent link from their article to the raw data sets which underpin their journal publication and associated database deposition of processed diffraction data (e.g. structure factor amplitudes and intensities) and coordinates, and which should obey the ‘FAIR’ principles, that their raw diffraction data sets should be Findable, Accessible, Interoperable and Re-usable (https://www.force11.org/group/fairgroup/fairprinciples).
  • A registered Digital Object Identifier (doi) should be the persistent identifier of choice (rather than a Uniform Resource Locator, url) as the most sustainable way to identify and locate a raw diffraction data set.

There is the possibility now that raw data archiving can bring the referee’s evaluation procedure to the very earliest stages of the authors’ analyses. This also removes possible doubts of an overly selective use of data by authors, or the opposite, of using data measured from a sample that is affected badly by X-ray damage. [The same concerns apply to electron damage of samples in cryoEM.]

Software

For a field like biological crystallography there are several public domain, open code software packages available. Thus referees can check for reproducibility across more than one software package. Linked with software there is the need for guidance to referees on the computational and data science skills they should have, and which is linked with the need for a certification of training for our referees, something which I outline further here.

Conclusions

In the last two years I now only accept to undertake a refereeing task if I can be provided with the underpinning data of the submitted article. So far this has comprised approximately ten different refereeing commissions from various journals. Fortunately only one case involved a refusal to give me access to the underpinning data; I had to recommend rejection in that case because I could not attest to the article’s conclusions. The other cases where I did have access to the data I most often required major revisions and including improvements to the diffraction data processing and the biological structure model refinements. Overall I felt I had done a much better and proper job as a reviewer than previously.

I still firmly believe that a specialist referee, with the appropriate data science skills, is a considerable help to make the very best possible judgement calls on what is made the version of record of data sets accompanying an article. The ‘crowd’ will of course come later to possibly offer its own style of refereeing or critique of those datasets and articles. But if such critiques are then published without the revised datasets being properly scrutinised in the way I have described above then the critique’s accompanying data may themselves be incorrect!

So, what should publishers and funders be doing?

Clearly, as well as requiring data availability after publication, they should make clear in their policies that they value peer reviewers who sincerely try to attest to a submitted article’s conclusions by truly having access to the underpinning data.

What should professional associations be doing?

They should ensure training courses are given which are properly examined at the end of a course. These should be available to early, mid and late career researchers. I realise now, in my retraining and reskilling I have undertaken in my retirement, that in my late career I had become a manager of research with old, rather out of date, data skills. Continual professional development for a researcher is important at all stages of one’s career not least if they are going to accept to do proper peer review including the data.

 

 

* Footnote

For example, whilst The Wellcome Trust’s Outputs sharing policy doesn’t make any specific comments about making data sets available to peer reviewers their Open Research publishing platform does include it in their research data policy:

4.2 Data repository requirements

In order to host data linked to a Wellcome Open Research article, a repository must be actively managed. Repositories must:

4.2.1 Enable access to the dataset

  • Access to the data should normally be completely open, unless there are genuine concerns over security/privacy of the data. Information should be provided about who can access the data, terms and conditions of access, and a clear point of contact.
  • The repository must have a policy for data that do require additional protection. This includes appropriate access for peer reviewers, as required as part of the data peer-review process. (In the context of data, peer reviewers are experienced researchers who produce or use data in the same field as the data being published.)

Reference: https://wellcomeopenresearch.org/for-authors/data-guidelines Most recently accessed October 12th 2017.

General reading

JR Helliwell, B McMahon, M Guss and L M J Kroon-Batenburg The Science is in the Data IUCrJ (2017)  4, 714-722. Open access at:- https://journals.iucr.org/m/issues/2017/06/00/ah5002/index.html

  • Rebecca Lawrence

    Great piece John, and completely agree with you that it is very hard to see how a referee can judge the conclusions of a paper without seeing the underpinning data. Just to add a point of clarity about Wellcome Open Research (and the other Open Research platforms that F1000 operate – and to declare my COI, I am Managing Director of F1000): you are absolutely correct that we require the data to be made openly available for all our articles, but because our peer review process takes place after publication, this means that all our referees can indeed look at the data as part of their evaluation process. We also specifically prompt our referees to do exactly that in the referee report form, with some examples of really valuable reanalyses and subsequent insights.