NIST and OSTP Launch Effort to Improve Search Engines for COVID-19 Research

April 20, 2020

GAITHERSBURG, Md. — Today, the U.S. Department of Commerce’s National Institute of Standards and Technology (NIST) and the White House Office of Science and Technology Policy (OSTP) launched a joint effort to support the development of search engines for research that will help in the fight against COVID-19. The project was developed in response to the March 16 White House Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset.

“Our nation’s scientific enterprise is mobilized to defeat the invisible enemy that is COVID-19,” said Secretary of Commerce Wilbur Ross. “Our scientists — and the businesses and institutions that provide them with advanced digital research technologies — are to be commended for their unwavering dedication to finding a cure for this insidious disease.”

“AI experts worldwide are responding to the White House’s call to action, developing approaches that help scientists gain insights from thousands of articles of COVID-19 scholarly literature,” said Michael Kratsios, U.S. chief technology officer. “The TREC-COVID program expands upon these efforts by creating powerful and accurate search engines that extract knowledge from this literature, tailored to the needs of the health-care and medical research communities. We thank NIST for this valuable contribution as part of the Trump administration’s whole-of-America response to the coronavirus.”

In this effort, NIST will work initially with the Allen Institute for Artificial Intelligence, the National Library of Medicine, Oregon Health & Science University (OHSU), and the University of Texas Health Science Center at Houston (UT Health). The team will apply the successful, long-running program of expert engagement and technology assessment called the Text Retrieval Conference, or TREC, to the COVID-19 Open Research Dataset (CORD-19), a resource of more than 44,000 research articles and related data about COVID-19 and the coronavirus family of viruses. The TREC-COVID program goals include creating datasets and using an independent assessment process that will help search engine developers to evaluate and optimize their systems in meeting the needs of the research and health-care communities.

“The TREC program has provided an effective way to evaluate and advance search engine technologies since 1992, and has led directly to the powerful search capabilities and internet-based efficiencies we now often take for granted,” said Under Secretary of Commerce for Standards and Technology and NIST Director Walter G. Copan. “We are pleased to apply this infrastructure to the challenge of working with massive amounts of data to help researchers better understand and ultimately to combat this deadly novel coronavirus and related threats.”

The team will first release a series of sample queries for the biomedical research community, developed by team members at the National Library of Medicine, OHSU and UT Health. Registered participants in TREC-COVID will use their information retrieval and search systems to run the queries against the CORD-19 document set and return their results to NIST. Biomedical experts will then review test results, including document relevance rankings, to assess the overall performance of the retrieval systems.

Using proven TREC protocols, NIST will score the submissions and post the scores, the retrieval results themselves, and the lists of key reference documents to the TREC-COVID website. These “test collections” can then be used by information retrieval researchers to evaluate and enhance the performance of their own search engines. This effort is intended to help researchers understand how search systems could best support medical researchers when available information is developing quickly, as in the current pandemic.

The Allen Institute for Artificial Intelligence has been releasing an expanded CORD-19 document set each Friday to capture the most recent articles on COVID-19 and related coronaviruses. Later rounds of TREC-COVID will use the larger releases of CORD-19 and expanded query sets.

Participants will have one week to submit their search results, and within about a week NIST will post results, with an expected spacing of about two weeks between each new dataset round being released. The team initially anticipates conducting five consecutive rounds of search system assessments.

The National Library of Medicine (NLM) is a leader in research in biomedical informatics and data science and the world’s largest biomedical library. NLM conducts and supports research in methods for recording, storing, retrieving, preserving, and communicating health information. NLM creates resources and tools that are used billions of times each year by millions of people to access and analyze molecular biology, biotechnology, toxicology, environmental health, and health services information. Additional information is available at https://www.nlm.nih.gov.