An informal history of how coders and researchers have been trying to answer one of academic life’s biggest questions — how can we stay on top of new publications?
A regenerative medicine professor at King’s College London once told me that “A PhD student should be reading at least five papers a week, and expect to find that four of the five aren’t actually relevant to their work.” My personal experience is more akin to 24 of 25 papers not being directly relevant to my PhD thesis—though obviously I had to read them to make sure they weren’t.
“A PhD student should be reading at least 5 papers a week, and expect to find that 4 of the 5 aren’t actually relevant to their work.”
The Internet has enabled instantaneous, low-cost dissemination of content, but with more and more information available, and increasingly multidisciplinary approaches to academic research, the problem becomes that of finding needles in ever growing haystacks.
Researchers, junior and senior alike, are perpetually haunted by the worry that they may have missed new publications of importance to their own work. What’s being done to keep us afloat in this sea of information we find ourselves in?
For those of you lucky enough to have experienced the ’90s, you may remember the rom-com classics and era-defining boy bands, but also the popularisation of Internet usage and the advent of Google Search, which celebrated its 20th birthday this September.
Academics have, in some ways, been spoilt ever since with digital publication formats, electronic databases, and a concerted effort by innovators across the academic, publishing and technology communities to build increasingly sophisticated tools (400 and counting, according to this study by librarians Bianca Kramer and Jeroen Bosman of Utrecht University).
In my view of the last two decades, technologies to tackle the information influx have evolved in a number of overlapping spurts.
Think librarianship in overdrive.
Online digital libraries with basic search interfaces emerged in the 90s, with the pioneering arXiv launching in 1991 as a repository of pre-prints (papers not yet formally published in academic journals) for the quantitative sciences, followed by JSTOR in 1995 for subjects as broad as literature to plant sciences, and then PubMed in 1996 for the biomedical sciences. All of these platforms are still widely used by academics today.
An obvious, though labour-intensive, way to improve search results for basic keyword-matching algorithms is to improve the quality of keywords associated with content items. Many literature databases to this date rely on tagging articles with a system of pre-approved keywords, such as PubMed’s MeSH (Medical Subject Headings). Most academic journals still require their authors to submit keywords for their publications.
What can I trust a computer to do without me?
Satellites offer a promising solution, but satellite data are affected by cloud cover, solar light input angle, and how often the satellite passes overhead.
The two main satellite-based technologies for tracking ships are the Automatic Information System (AIS) and Vessel Monitoring System (VMS).
Both are cooperative systems where a transponder is installed on a vessel and communicates with the shipboard Global Positioning System (GPS). AIS is an open, non-proprietary system with international standards that usually transmits continuously, but it can easily be turned off or hacked. VMS is more difficult to tinker with, but it’s proprietary and has high barriers to data access because the country or shipowner has to hand over the data. More about ships and satellites here.
The meteoric rise of Google Search can be attributed to:
The launch of Google Scholar in 2004 gave researchers the convenience of the Google search algorithm paired with a formidable content database of tens of millions of scholarly publications and legal cases, dwarfing its open-access predecessor, CiteSeer, the 1997 brainchild of American and Australian researchers at the NEC Research Institute.
The digitisation of journals and literature repositories meant that it was now possible to set up RSS feeds to track new content matching your keywords of interest, or subscribe to emails detailing the latest table of contents (eTOCs) for your favourite journals. A great help until you change institute and realise that you have to change the email address for over 50 journal eTOC subscriptions one by one.
What’s the most relevant of them all?
Coinciding with the 1960 launch of the first-ever citation index for journal papers as a measure of impact by Eugene Garfield’s Institute for Scientific Information , researchers in the ’60s were already toying with the idea of using computers to understand the relationships between academic papers via their citations. Many widely used academic search engines — including the free-to-use Google Scholar and subscription-only Web of Science and Scopus — still rely on such citation graphs to display related papers to their users.
The recent artificial intelligence (AI) boom might make it difficult to imagine the poetically named “AI winter ” of the early ’90s — a widespread loss of faith in AI. Scientists eventually shifted from attempting to build an all-purpose, super AI and started tackling more specific, more tractable problems. Momentum picked up towards the turn of the century with the application of statistical techniques to help machines understand natural language, leading to successes like spam email filters as early as 1998.
The last decade has witnessed the rise of scientific recommendation tools emphasising the analysis of full texts, including:
Though machine learning has evolved in leaps and bounds, the most likely way to be notified of a new, useful publication is still an ad hocrecommendation from a well-meaning colleague or collaborator.
“Though machine learning has evolved in leaps and bounds in the past decade, the most likely way to be notified of a new, useful publication is still an ad hoc recommendation from a well-meaning colleague or collaborator.”
The frontier right now is how to harness human intelligence at scale and combine this with machine learning algorithms to enhance the accuracy of computer-generated recommendations.
F1000Prime (launched in 2002) takes the idea of expert recommendations to the extreme by enlisting over 10,000 handpicked academics to publish recommendations of articles in the fields of biology and medicine and opine on why these articles are relevant — e.g. innovative methods, good for teaching, genuine breakthrough?
Sparrho, on the other hand, encourages the crowd to curate their own public collections of research articles (pinboards) and write short summaries to explain why these papers belong together, tapping into the unique ability of humans to make unexpected connections between research in different fields. The result is a new way for experts as well as newcomers to explore the literature and hear directly from researchers.
But alas, unavoidable bias?
Reddit’s science forum (fondly named “The New Reddit Journal of Science” to rival the highly regarded New England Journal of Medicine) has almost 18 million subscribers and 1,500 moderators, all contributing for free.
Imagine the power of peer review and recommendation at this scale.
But the simplicity of this system is also its biggest flaw: anybody can recommend articles to the forum, anybody can promote or demote a recommended article, and anybody can comment on the article.
Whenever you introduce an element of popularity of into any algorithm, there’s the chance that something undeserving will get recommended by merely by being popular. The same happens with citations (the most highly cited papers continue to get more citations) and ‘fake news’ on social networks (the most shared posts continue to get more shares).
An extension of this is bias towards popular references or subjects. Say in the field of genetics, when a researcher writes about a lesser known gene,there’s a tendency to relate the work to a more well known gene to help others understand how the work fits into the bigger picture.
This leads to certain topics and entities amassing a disproportionate number of mentions within the literature. As the existing literature is our biggest and richest data set for training recommendation algorithms, we have to be careful that we’re not preventing people from discovering the more esoteric streams of research.
Ethical algorithm developers need to consider at least these two biases to avoid creating tunnel vision for their users.
What’s clear so far is that although academics today are armed with more intelligent technologies, none of the above have fully resolved the issue at hand: how can I be sure that I haven’t missed a crucial publication?
There’s no supercomputer out there that can replace speaking to a seasoned scientist about what they think are seminal, must-read papers and what are the most interesting publications of late.
What recommendation tools can do, however, is to help make your haystacks smaller, so that you can sift through them more effectively, and show you other promising haystacks that could hold the needles you want. With more researchers sharing their knowledge publicly, the hope is that soon we’ll no longer be looking for these needles alone.
Given my own experience, and what I heard at the 2017 SpotOn Festival last Saturday, here’s the minimal quartet of things that we all need to be doing to reduce our ‘unknown unknowns’ and stay on top of the latest science:
SW: This article is by no means a comprehensive study, so please get in touch with us if you feel that I’ve missed or misinterpreted something important.
Partnerships @Synthace / biochemist / occasionally writes