The Benefits of a “Central Service” for Biology Preprints

Preprints are complete and public manuscripts with associated data shared before undergoing peer review. Physicists, mathematicians, and computer scientists post 100,000 preprints per year to arXiv, a scientist-governed preprint server that has been in operation for over a quarter of a century. Preprints in the life sciences are in a more embryonic stage, with less than 10,000 posted manuscripts per year. However, several meetings hosted by ASAPbio have ended with the conclusion that preprints, in conjunction with journals, hold great potential for enhancing scholarly communication in biology.

Recently, eleven major international funding agencies (Wellcome Trust, National Institutes of Health, Medical Research Council (UK), Helmsley Trust, Howard Hughes Medical Institute (HHMI), European Research Council, Simons Foundation, Canadian Institutes for Health Research, Alfred P. Sloan Foundation, Department of Biotechnology (Government of India), Laura and John Arnold Foundation) have released a statement calling for further technology development and the creation of a central resource for preprints, which is being provisionally called the Central Service (CS). The CS will be a database that aggregates preprints from multiple sources, making them easier to read by humans and machines. These features will enable scientists to find new knowledge that can accelerate their research. The CS will be overseen by a scientist-led governing body, which will ensure its mission in serving the scientific community and the public good.

ASAPbio (a scientist-driven organization to promote the productive use of preprints in biology) has released a Request for Applications (RFA) for the development of this service, which is open to all. After independent reviewers select the preferred applicants(s), and pending commitment of funders, the CS is expected to launch in 2018. Here we discuss why the Central Service is needed and its potential for advancing knowledge dissemination in the life sciences.

Preprints in the life sciences come from multiple sources, making them difficult to discover

Preprints in physics have coalesced into one location – arXiv.org. The critical mass and “one stop shopping” offered by a single distribution site has been critical for arXiv’s success. The relatively few preprints in biology are distributed on several servers (bioRxiv, Peer J, F1000, qbio section of arXiv), with bioRxiv receiving the most submissions. In addition, more parties are considering developing preprint services including prominent journal publishers such as PLOS. Thus, preprint entities in biology are more likely to expand rather than collapse into one source. The expansion of entry points could dramatically increase preprint numbers and offer unique features for authors. However, fragmentation of preprint sources makes the knowledge harder to find and potentially creates more ambiguity of preprint quality, reuse and preservation. Collecting preprints from multiple intake sources into a unified database (the Central Service) would provide a single repository for searching and mining preprints.

Preprints represent a potential rich source of data but are difficult to read by humans and machines

The scientific knowledge base is growing at a staggering rate, making it increasingly difficult for scientists to find information that might be relevant for their work. Currently, preprints take the form of an author-submitted PDF, which can be cumbersome to read on the web and is difficult for machines to search for content. To circumvent these limitations, the CS will develop a conversion tool that will receive manuscripts in word processing formats currently used by life scientists and convert them into XML-based formats that are easier for humans and and computers to read. This advance will make it easier for scientists to find the content that they need.

Preprints should be a starting point for innovation

The corpus of preprints should be made available for creative use, discovery and innovation. The CS will provide interfaces for programmatic access (via open APIs) so that third parties have access to all content and can build innovative tools for scientists. Such tools could include better search algorithms, tools that aggregate knowledge and customize it for individual use, links to data or reagents, or annotation. The CS will provide a platform for the emergence of new innovations that can help scientists, which in turn would help to drive the adoption of preprint communication.

The future development of preprints should be overseen by scientists

Preprints are emerging as a new method of communication in the life sciences, and questions related to standards, licensing, and best practices of use will continuously arise. These issues should be addressed by an international, scientist-led governance body that acts in the best interests of the research community and the public. The creation of the CS will be accompanied by the simultaneous creation of a Governance Body whose function is to oversee the work of the CS and define standards for work that should be included in the CS. Without a CS, oversight and mandating standard practices of many preprint sources will be difficult, as is the case in the present journal system.

Funding agencies, universities, scientific societies want clarity on a “respected preprint source”

To be useful for scientists for career advancement, preprints should be citable in grants and promotion packages. However, funding agencies, universities and scientific societies have expressed concern about the quality of preprints, especially if they come from multiple sources. Material in the CS will adhere to common standards for ethics, metadata, and scholarly preservation. This will simplify the “definition of a preprint” for funding agencies, universities, and scientific societies.

Preprints need to be stored in perpetuity

For preprints to serve a major role in scholarly communication, they need to be stably preserved. However, because currently most preprint servers are run without profit as a service to the community, their indefinite continuation cannot be assumed. The Central Service will maintain stable backups of all content to ensure that the knowledge contained in preprints is always available.

Preprint software development should be open source and provide community resources

Preprints embody the philosophy of open communication and knowledge sharing. The infrastructure for preprints should reflect these values. The creation of a CS with an open source mandate will ensure that developed software and APIs will be shared with the goal of lowering barriers for innovation in scholarly communication.

Explanation of parties contributing to the Central Service for Preprints

The Central Service: Is a provisional name for the database of preprints that will provide 1) intake from multiple sources according to standards (format, content, and licensing) and ethical guidelines established by the governing body, 2) document conversion through open source software, 3) data storage, 4) search tools and limited display for scientists, and 5) an open API to make its content available to third-party innovators.

The Central Service Provider(s): Is/are the entities that receive grants or contracts from ASAPbio to provide the implementation of the central service.

ASAPbio: Is a non-profit corporation whose mission is to advocate for the productive use of preprints in biology and to administer grants/contracts for the operation of the CS. ASAPbio will collect funds from the funders’ consortium and distribute them to CS providers. It will act as a secretariat to the governing body and organize their meetings and reports. ASAPbio.org will serve as information portal to the scientific community.

The Governing Body (GB): Is an independent, international body supported by funds and administrative services provided by ASAPbio but with its own bylaws. It will set standards for intake and operations of the CS, evaluate the CS, and prepare reports for the funders.

Funders: We anticipate that an international consortium of funders will provide support for the CS, with commitments to be established later in 2017.