The Protein Data Bank (PDB) was established as the first open access repository for biological data, and the datasets it hosts have been invaluable to research in fundamental biology and the understanding of health and disease. Just this month, we witnessed the announcement of the AlphaFold2 results toward structure prediction, made possible thanks to the more than 170,000 freely accessible structures in the PDB which provided “training data” for the structure prediction software.

It was not always the case that such structural biology data were freely available, even upon journal publication. From the founding of the PDB in 1971 until the late 1980s, most journals did not require deposition of structures in a public database. A key moment was a petition, circulated in 1987 by a group of leading structural biologists, demanding that the data created be made openly available upon journal publication. This petition led to major journals adopting data deposition standards. In the early 1990s, the National Institute of General Medical Sciences (NIGMS) imposed similar requirements on all grantees. 

The revolution in publishing made possible by preprints calls for a re-evaluation of data disclosure practices in structural biology. While journal review processes take weeks, months, or even years, preprints allow researchers to rapidly communicate their findings to the community. However, withholding access to PDB files that accompany preprints inhibits the progress towards scientific discovery which preprints can enable. 

Commitment

We pledge to publicly release our PDB files (and associated structure factor, restraint, and map files) with deposition of our preprints.

We encourage all structural biologists to also deposit raw data in appropriate resources (e.g. EMPIAR, proteindiffraction.org, https://data.sbgrid.org/, etc).

ASAPpdb Signatories

Once you’ve signed, use the Tweet button below to share the news with your network.

Next steps

Funders could also play a role in encouraging data deposition through their guidance to grantees and applicants. Preprint servers could also encourage users to share their data during the submissions process (with appropriate citation in accordance with the FORCE11 data citation principles) and encourage affiliates to check for the availability of such data during the screening process. ASAPbio will share this letter and its signatories with these entities to advance the conversation about other ways to encourage data availability.

While this letter is focused on structural data, we hope other communities will follow in their support for data sharing upon preprint deposition, particularly those with a strong culture of data sharing and established dedicated repositories, for example in relation to gene sequences (GenBank), gene expression (GEO), microscopy data (EMDB), NMR assignment (BMRB) and similar datasets.

We invite these communities to develop their own call for support for data sharing with preprints and we encourage them to contact us if they would like to pursue a similar call.

Frequently asked questions about preprints and structural biology

For more information about preprints, including additional FAQ, check the info center.

What is a preprint?

A preprint is a scientific manuscript that is uploaded by the authors to a public server. The preprint contains data and methods, but has not yet been accepted by a journal. While some servers perform brief quality-control inspections (for more details on the practices of individual servers, see asapbio.org/preprint-servers), the author’s manuscript is typically posted online within a day or so without peer review and can be viewed (and possibly translated, reposted, or used in other ways, depending on the license) without charge by anyone in the world. Most preprint servers support versioning, or the posting of updated versions of your paper based upon feedback and/or new data. However, most servers also retain prior preprint versions which cannot typically be removed to preserve the scholarly record. Preprints allow scientists to directly control the dissemination of their work to the world-wide scientific community.

Are preprints compatible with journals?

Yes. While both preprints and journal articles enable researchers to disseminate their findings to the research community, they are complementary in that preprints represent an opportunity to disseminate at an early stage. 

In most cases, the same work posted as preprint also is submitted for peer review at a journal. Thus, preprints (rapid, but not validated through peer-review) and journal publication (slow, but providing validation using peer-review) work in parallel as a communication system for scientific research. 

In many fields, the majority of journals allow submission and citation of preprints. To get a sense for preprint policies, you can check SHERPA/RoMEO, Transpose, or Wikipedia’s List of academic journals by preprint policy. However, before submitting a manuscript, always check the journal’s website for recent changes or any nuances of their policy.

How does the PDB interact with preprint servers?

PDB considers papers posted on a preprint server as publications (https://www.wwpdb.org/documentation/policy#toc_release) and will release PDB data associated with the preprint once this is posted.

Will my preprint be rejected from a preprint server if posted without PDB data?

We are not aware of preprint servers that screen on this basis at this time, but we hope that preprint servers or community projects might highlight preprints that contain complete data. 

How will a preprint affect my patent application?

Preprints, like journal articles, are considered public disclosures, which can affect a patent application. If you intend to file an application to patent work disclosed in your paper, discuss the situation with your technology transfer office before posting your preprint.

Can I still link my PDB record to the journal version?

The advent of versioning in the PDB makes it possible for the authors to update their files and journal information while preserving the unique PDB identifiers. This system will ensure that the public always has the “up to date” version.

What about CASP, which relies on embargoes? 

There are three easy ways to share your protein information with CASP. You can either directly submit the sequence of your protein and the related information through the web form; mark your PDB deposition as ‘CASP target’ (check box) within the PDB deposition system; or send CASP an email (casp AT predictioncenter.org). All of these steps can be done before preprint disclosure.

Should I use “REL” or “HPUB” as my author-requested status codes for PDB entries?

REL entries are released as soon as the authors have approved the processed files. Whereas HPUB (Hold until PUBlication) entries are placed on hold until publication or until one year from the date of deposition, whichever comes first. In both cases, the authors need to approve the final validate structure before release. Choosing REL will promote the closest release of the data alongside the preprint. It is also possible to proactively associate the PDB with the preprint DOI as a way to use the HPUB status code (and then subsequently create an updated version with the journal DOI). 

What should I do if I read a preprint that does not include the underlying structural data?

You can contact the authors to query the availability of the dataset and encourage them to deposit and release the data to the PDB. You can also share this letter and resources with the authors and invite them to join the commitment to release their PDB file with their future preprints.

Header image: The structure of the SARS CoV 2 macrodomain bound to its substrate ADP ribose (PDB ID: 7KQP, https://www.biorxiv.org/content/10.1101/2020.11.24.393405v1.full