Document 4: What does IT infrastructure for a next generation preprint service look like?

Authored by Jo McEntyre and Phil Bourne

Goal: To satisfy the fundamental requirements of establishing scientific priority rapidly and cheaply through providing the ability to publish and access open preprints, balanced with the desire to support open science innovation around publishing workflows.

Approach: An internationally supported, open, archive (or platform) for preprints as infrastructure is ideal because (a) should the use of preprints become widespread, there is potential to reap long-term open science benefits, as is the case for public data resources and (b) some core functions only need to be done once not over and over (think: CrossRef, ORCID, INSDC, PDB, PMC/EuropePMC). Ideally this would involve working with existing preprint servers to provide a core platform and archival support.

Some assumptions

  • No point-of-service cost to post a preprint for the author.
  • Licenses that support reuse (ie CC-BY) of posted articles.
  • Preprints will be citable (have DOIs).
  • Should be embedded with related infrastructures such as Europe PMC/PMC, ORCID, CrossRef and public data resources.
  • Reuse and integration as core values – by various stakeholder groups including publishers, algorithm developers, text miners, other service providers
  • Standard implementations of key requirements across multiple stakeholders e.g. version control, events notification (such as publication in a journal or preprint citation), article format standards (JATS)
  • All preprints basically discoverable and minable through a single search portal.
  • Transparent reporting/management builds trust and authority around priority
  • International and representative governance
  • Metrics to provide data on meaningful use of the content
  • Tools to manage submissions e.g. triage, communication etc. in keeping with existing manuscript submission systems
  • Public commentary on submissions
  • Linkage with final published version of the article (when it exists)

Data Ingest

The preprint server should be considered an active archive. This means that all content can be accessed at any time and certain core services are provided to enable access by both people and machines.

  1. Basic submission support and support for standard automated screening.
  2. Possible limited branding on submission portals.
  3. Competition on screening methods, or other author services by existing preprint servers (or others) is possible.
  4. Advantages: simplified content flow, standards implementation, content in one place for future use.

Basic Services

  • A stand-alone archive. Initial submission needs to be very quick for author: ie basic metadata plus files establishes priority.
  • Files rapidly published as PDFs with DOI and posted after screening/author services.
  • Ingest mechanisms could be diversified through existing preprint servers – but always some basic [automated] criteria would need to be met (automated to retain speed). For example it could e.g. require an ORCID for the submitting author as a simple trust mechanism, with further validation against grant IDs. Algorithms working on content (plagiarism, detection of poor animal welfare, scope, obscenity) could operate. There is scope for automated screening to be phased in and improved over time.
  • This model provides the opportunity for innovation around screening algorithms by the platform as well as third parties. It also provides business opportunities around author services.
  • Importantly, it also provides opportunities for innovation around coordinated submission for other materials relevant to the article, for example, data or software. But any integration of this nature would need to be lightweight for submitting authors as the speed of publication is a non negotiable feature of the preprint service.
  • Core version management would be required, both regarding new versions of the same article and linking with any future published versions of the article in journals.

Authenticated Content

  • After basic services get content in and published, the preprint service could generate JATS XML for authenticated content. Authenticated content could be defined by a number of criteria but could be e.g. PIs funded by funding organisations that support the infrastructure, popular preprints.
  • There is a cost to generating JATS XML. Limiting this added value to authenticated content could help control costs and give some confidence around that content for promoting discoverability via existing infrastructures.
  • Conversion to JATS XML will take some time, and would require input from the submitter to sign off on the resulting converted article. However it has the bonus of being integrity checked (e.g. all the figures are present), available for deep indexing, integration, and more widely discoverable via Europe PMC/PMC. Wider discoverability could be an incentive to authors to take the modest amount of extra time required to provide this data quality.
  • Note this could be an ingest point into the archive for XML content from other services/platforms.
  • In the future this more rigorous treatment may be extended to basic services, as work on methods that directly convert Word to Scholarly HTML and JATS XML mature, improve, and costs lower. However it is likely that publishing speed will be an issue for some time and a degree of submittor involvement will always be required.
  • The availability of content in JATS XML provides many opportunity for innovation around the provision of more structure in articles for integration purposes (e.g. tagging reagents, data citations and other deep linking mechanisms).

Post publication screening, filtering and sorting on the preprint platform

  • All content would be available for post-publication human screening such as user reporting of problematic content, commenting and so on.
  • More sophisticated algorithms that rank, sort and filter search results based on trust, content or other criteria could be developed by the platform and most importantly, by 3rd parties.

Data Out

  • All content available for bulk download (PDFs and XML), and via APIs as well as through website search & browse
  • Authenticated content could be made available via established archives (e.g. Europe PMC (PMC)) as a clear subset.
  • Core services managed centrally for example, content sharing with journals (this could be in collaboration with e.g. CrossRef since they already have some infrastructure around this)
  • There are possibilities for sharing article XML across publication workflows, comments/reviews, with journals/other platforms thus saving processing costs.
  • There are countless opportunities to support further innovation on the content by both commercial and academic parties with an open platform approach.