Natural Language Processing to explore Covid-19 literature

Due to the rapid acceleration in new coronavirus literature, it becomes difficult for the medical research community to keep up. There is a growing urgency for innovative approaches, like recent advances in Natural Language Processing, to understand and analyze the abundance of medical/scientific articles.

At AE, we created a powerful Natural Language Processing engine that allows us to perform recent text and data mining techniques (which, by the way, also forms the foundation of our AI Powered Customer Care Agent, ADORE). With both the expertise and motivation in place to help researchers find answers that will ultimately help them stop the virus, we composed a team of data scientists to start working on the Kaggle COVID-19 Open Research Dataset Challenge.

Kaggle COVID-19 Open Research Data Challenge

The COVID-19 Open Research Dataset (CORD-19) is a resource of over 51.000 articles, including over 40.000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. It's a goldmine of new insights to support the ongoing fight against the virus, at least if you have the expertise to search well and fast.

In a first iteration, we explored the medical/scientific articles by applying Topic Modelling, a Natural Language Processing technique that uncovers the hidden topics in a collection of documents. Which topics are covered within the literature? But also, what is the meaning of each topic, how prevalent is it and how do topics relate to each other?

Interactive Visualisation

In order to tackle these questions we created an interactive visualization which allow researchers to flexibly explore the topics within the COVID-19 literature.  We used LDAvis, an interactive visualization system, that enables deep inspection of topics and keywords relationships in a LDA analytical model.

Basically, the visualization has two main pieces:

On the left panel, a global view of the discovered topics are displayed. Each circle represents a topic and the area of the circles indicate the overall prevalence of the topics among the research papers. The closer the topics are located to each other, the more they are related to each other.

On the right panel on the other hand, the individual keywords are listed which are most useful for interpreting the currently selected topic. So selecting a topic in the left panel reveals the most useful keywords in the right panel for interpreting that selected topic. Note that it is also possible to click on individual keywords to get more insights in which topics they occur.

Click on the figure below to open a new browser tab and see for yourself how it works!

LDAvis - Interactive Topic visualisation for Covid-19 literature

LDAvis - Interactive Topic visualisation for COVID-19 literature

We used pyLDAvis that has been designed to help users interpret the topics in a topic model that has been fit to a corpus of text data, in our case cientific COVID-19 literature (over 51K papers). The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization. The visualization can be saved to a stand-alone HTML file for easy sharing as illustrated above.

Next Steps

Ultimately, we want to help researchers to easily find articles in the existing literature that describe exactly those topics they are interested in. To do so, we will need to be able to identify which papers are researching which aspects of the novel corona virus and, ideally, pinpoint readers to specific paragraphs of interest in a paper. To start with, we'll focus on the incubation period, the risk factors and mortality rate of COVID-19.
 
Stay tuned as we will be publishing more results and discoveries in the weeks to come.

In the meantime, stay safe, take care of yourself and each other!