Addressing information overload in scholarly literature

Blog post by Christine Ferguson and Martin Fenner

Information overload is the difficulty in understanding an issue and effectively making decisions when one has too much information about that issue, and is generally associated with the excessive quantity of daily information. – Wikipedia [1]

Information overload is a common problem, and it is an old problem. It is not a problem of the internet age, and it is not specific to scholarly literature, but the growth of preprints in the last five years presents us with a proximal example of the challenge.

We want to tackle this information overload problem and have some ideas on how to do this – presented at the end of this post. Are you willing to help? This post tells some of the back story of how preprints solve part of the problem – speedy access to academic information – yet add to the growing information that we need to filter to find results that we can build on. It is written to inspire the problem solvers in our community to step forward and help us to realise some practical solutions.

Using journals to find relevant information

In a classic presentation in 2008 [2], the writer Clay Shirky argued that while information overload might be a problem as old as the 15th century when the printing press was invented by Gutenberg, the rise of the internet for the first time had radically changed how we address this problem. Publishing used to be expensive, complicated and therefore risky, and this was addressed by only publishing content that was selected by the publisher to be “worth publishing”. Scientific publishing worked – and still works – in similar ways. One important change occurred with the dramatic growth of scientific publishing after World War II, when filtering by staff editors became unsustainable, and external peer review by academic experts slowly became the norm from the 1960s to the 1990s (e.g. Nature in 1973 and The Lancet in 1976) [3].

Clay Shirky coined the phrase “It’s Not Information Overload. It’s Filter Failure” in his 2008 presentation and made the point that publishing in the internet age has become so cheap that publication no longer needs to be the critical filtering step, rather that filtering can happen after publication. We can see this pattern in many mainstream industries, from movies to online shopping, with organizations such as Netflix and Amazon [4] investing heavily in recommender systems that substantially contribute to their revenues.

Cameron Neylon applied these considerations to scholarly communication and found the scholarly communication community at an early stage in the transition to “publish first, filter later” [5]. Ten years later his findings for the most part still hold true, as scholarly discovery services still for the most part focus on publications that have gone through a “filter” step by a scholarly publisher.

Preprints: an alternative to the ‘journal as a filter’

Preprints are the most visible implementation of the “publish first, filter later” approach. Preprints in some disciplines, including high-energy physics, astrophysics, mathematics, and computer science, increasingly became the norm in the last 25 years, and currently the majority of high-energy physics papers are first published as preprints on the arXiv. In the life sciences the preprints server E-Biomed [6] was proposed by NIH director Harold Varmus in 1999, but the project was killed after a few months, not least because of strong and vocal opposition by biomedical publishers and societies. Instead, PubMed Central launched in 2000 to host open access journal publications instead of preprints [7]. After a delay of more than 15 years, preprints in the life sciences finally took off, and although they have grown considerably in number in the last five years, preprints still only represent a small fraction (6.4% in Figure 1) of all publications in biology:

The notion of “publish first, filter later” is now being promoted by a range of publishers who no longer penalise authors for publicising their submissions as preprints, but rather encourage submitting authors to post their manuscripts as preprints whilst these are being put through peer review by the journal. Some publishers are even more wedded to preprints as the publication of the future [9,10].

Coming back to the original problem, preprints now add to the journal articles that researchers are tasked with filtering. That information overload poses a problem was recognised in a survey of stakeholders (such as librarians, journalists, publishers, funders, research administrators, students, clinicians, and more) conducted last year by ASAPbio. The problem is exacerbated given the number of servers hosting relevant preprints – ASAPbio’s preprint platform directory lists 56 preprint servers that host potentially relevant material.

Filtering for relevant content (to get around the information overload issue)

While preprints in general, and specifically in the life sciences, are lowering the cost of and delays to sharing information, filtering for relevant content is still at a relatively early stage. To go deeper into the details of how relevant preprints can be discovered, it is important to make the important distinction between

Discovering relevant preprints at any point in time independent of peer review status
Discovering relevant preprints that have undergone peer review
Discovering relevant preprints immediately (days) after posting

The first category includes discovery services that also include preprints as part of their content, including for example Europe PMC [11] and Meta [12]. Discovery strategies relevant for journal content can also be applied to preprints, e.g. search by keyword and/or author.

The second category focuses on peer-reviewed preprints, and is covered extensively elsewhere [13].

The third category is the focus of this post – discovery of relevant preprints of interest to a researcher right after their posting, which rules out traditional peer review. The following filter strategies are possible:

Filter by subject area, keyword or author name
Filter by personal publication history
Filter by attention immediately after publication: social media (Twitter, Mendeley, etc.) and usage stats
Filter by recommendations, e.g. from subject matter experts

These filters can of course also be combined. The particular challenge is that they must work almost immediately (within days) after the preprint has been posted. This assumes a high level of automation, and a focus on immediacy. A combination of filters 1 and 3 works well with this approach: the information required for filter 1 is available in the metadata (e.g. via Crossref) when the content is posted, and attention (filter 3) can be determined immediately after the preprint is posted – Twitter is widely used for sharing links to bioRxiv/medRxiv preprints, see examples in Figure 2. The Crossref Event Data service found 15,598 tweets for bioRxiv/medRxiv preprints the week starting June 7, 2021.

Fig. 2. The use of twitter to bring attention to newly posted preprints

For filter 3, we’ve considered ‘bookmarking preprints in Mendeley’ but these cannot currently be tracked in open APIs such as the Crossref Event Data service. Usage stats are another alternative, but are currently not available via API in the early days after publication.

Another consideration is how to best inform researchers of these potentially relevant preprints. Given that cost and speed are the primary concerns, we consider the most appropriate approach to be dissemination of these filtering results via a regular (daily or weekly) RSS feed or newsletter.

In summary, realising a list of biomedical preprints that have been filtered by a minimal number of tweets in the days after posting, and broken down by subject area, is a good initial filtering strategy to identify relevant preprints immediately after they have been posted. Interested researchers can access a filtered corpus via newsletter.

Existing efforts that track discovery of relevant preprints right after their posting

A few examples

https://twitter.com/PreprintBot – new this year, “a bot that tweets preprints and comments from BioRxiv and MedRxiv” [14]
https://twitter.com/PromPreprint – this has been running for a while; “A bot tweeting @biorxivpreprint publications reaching the top 10% Altmetric score within their first month after publication”[15]
http://arxiv-sanity.com/toptwtr – this started as a new way to list all arXiv preprints, but they added social media data at some point [16]
https://scirate.com – a free and open access scientific collaboration network that allows users to follow arXiv.org categories and see the highest ranked new papers [17]
https://rxivist.org – a free and open website that enables users to identify preprints from bioRxiv and medRxiv based on download count or mentions on Twitter. One can, for example, pick the most tweeted preprints in the last 7 days – and this presents a list of preprints that may have been posted at any point since the servers began [18].

Our strategy for filtering life science preprints builds on these existing efforts but picks up only those preprints posted in the past week that have received tweets and proposes to use a newsletter as the primary communication channel. We propose to run this newsletter as a community experiment, where we iterate over the implementation based on researcher feedback on how helpful the newsletter is in addressing information overload. Other considerations: Can we focus more on who is tweeting rather than the number of tweets, or should we add an element of human curation? Can we filter life science preprints from additional servers?

Call to action

If you want to help to tackle the information overload problem in the life sciences then leave a comment below or DM us. If enough folk are interested in working with us, we could generate a community group under the auspices of ASAPbio to work on information overload.

We thank Rich Abdill (University of Minnesota), Iratxe Puebla and Jessica Polka (both ASAPbio) for providing valuable feedback when writing this blog post.

References

Information overload. In: Wikipedia. ; 2021. Accessed May 18, 2021. https://en.wikipedia.org/w/index.php?title=Information_overload&oldid=1023377809
O’Reilly. Web 2.0 Expo NY: Clay Shirky (Shirky.Com) It’s Not Information Overload. It’s Filter Failure.; 2008. Accessed May 18, 2021. https://www.youtube.com/watch?v=LabqeJEOQyI
A brief history of peer review. F1000 Blogs. Published January 31, 2020. Accessed May 18, 2021. https://blog.f1000.com/2020/01/31/a-brief-history-of-peer-review/
Amazon’s Recommendation Engine: The Secret To Selling More Online. Accessed May 18, 2021. https://rejoiner.com/resources/amazon-recommendations-secret-selling-online/
Neylon C. It’s not filter failure, it’s a discovery deficit. Serials. 2011;24(1):21-25. doi:10.1629/2421
Homan JM. E-biomed. Bull Med Libr Assoc. 1999;87(4):485-486. Accessed May 18, 2021. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC226626/
Kling R, Spector LB, Fortuna J. The real stakes of virtual publishing: the transformation of E-biomed into PubMed central. J Am Soc Inf Sci Technol. 2004;55(2):127-148. doi:10.1002/asi.10352
Xie B, Shen Z, Wang K. Is preprint the future of science? A thirty year journey of online preprint services. ArXiv210209066 Cs. Published online February 17, 2021. Accessed May 18, 2021. http://arxiv.org/abs/2102.09066
Eisen MB, Akhmanova A, Behrens TE, Harper DM, Weigel D, Zaidi M. Implementing a “publish, then review” model of publishing. eLife. 2020;9:e64910. doi:10.7554/eLife.64910
About Wellcome Open Research | How It Works | Beyond A Research Journal. Accessed June 27, 2021. https://wellcomeopenresearch.org/about#section-box-4
Preprints – About – Europe PMC. Accessed May 18, 2021. https://europepmc.org/Preprints#preprint-servers
Meta | Expand Your Research. Meta | Expand Your Research. Accessed June 27, 2021. https://www.meta.org/
Polka, Jessica, Strasser, Carly, Taraborelli, Dario. Shared Technology Needs for Preprints. Zenodo; 2021. doi:10.5281/ZENODO.4700570
Preprint Bot (@PreprintBot) / Twitter. Twitter. Accessed June 30, 2021. https://twitter.com/PreprintBot
PromisingPreprints (@PromPreprint) / Twitter. Twitter. Accessed June 30, 2021. https://twitter.com/PromPreprint
ArXiv Sanity Preserver. ArXiv Sanity Preserver. Accessed June 30, 2021. http://arxiv-sanity.com/
Top arXiv papers. SciRate. Accessed June 30, 2021. https://scirate.com/
Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints | eLife. Accessed June 30, 2021. https://elifesciences.org/articles/45133

3 Comments

Thomas Krichel says:

2021-07-07 at 10:43 pm

Bims: Biomed News at http://biomed.news is an expertise sharing system that allows users to build weekly subject-specific report issues. It uses all the additions to PubMed in the last week. To the extent that PubMed covers some preprints, Bims is a system that addresses your problem since the 5th of February 2017. Bims’ data is fully available in bulk for reuse.

Pingback: Tackling information overload: identifying relevant preprints and reviewers – ASAPbio
Jonathan Heppner says:

2021-08-30 at 3:51 am

This is a bit late! But I just saw this article published in the Wire. Please contact me for a discussion on how to solve the information overload problem in scientific publishing in general and in the life sciences in particular.