Electronic health records (EHR) need a up-to-date public relations manager. Ten years ago, the United States government passed legislation that strongly encouraged the adoption of electronic health records to improve and streamline care. The huge amount of information contained in these now digital documents can be used to answer very specific questions beyond the scope of clinical trials: What is the appropriate dose of this drug for patients of this height and weight? What about patients with a specific genomic profile?
Unfortunately, most of the data that could answer these questions is trapped in doctors’ notes, full of jargon and acronyms. These notes are tough for computers to understand using current techniques – extracting the information requires training multiple machine learning models. Models trained for one hospital also do not perform well in others, and training each model requires domain experts to label enormous amounts of data, which is a time-consuming and exorbitant process.
The ideal system would exploit a single model that can extract multiple types of information, perform well across multiple hospitals, and learn from a diminutive amount of labeled data. But how? Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), led by Monica Agrawal, a doctoral candidate in electrical engineering and computer science, believed that to unravel the data they had to turn to something bigger: enormous language models. To extract this critical medical information, a very enormous GPT-3-style model was used to perform tasks such as unpacking overloaded jargon and acronyms and extracting treatment regimens.
For example, the system takes input, which in this case is a clinical note, and “prompts” the model a question about the note, e.g., “expand this shortcut, CTA.” The system returns an output signal such as “clear to auscultation”, unlike, say, CT angiography. The team says the goal of extracting this immaculate data is to ultimately enable more personalized clinical recommendations.
Medical data is, understandably, a somewhat tough resource to navigate freely. Using public resources to test the performance of enormous models involves a lot of bureaucracy due to restrictions on data exploit, so the team decided to collect their own. Using a set of compact, publicly available clinical snippets, they combined a diminutive dataset to enable evaluation of the extraction performance of enormous language models.
“The challenge is to develop a single, general clinical natural language processing system that meets everyone’s needs and is stalwart to the huge variations that exist in health datasets. As a result, to date, most clinical notes are not used for further analysis or for live decision support in electronic health records. These approaches based on large-language models have the potential to transform clinical natural language processing,” says David Sontag, professor of electrical engineering and computer science at MIT, principal investigator at CSAIL and the Institute of Medical Science and Engineering, and supervising author of a paper on this work that will be presented at Conference on Empirical Methods in Natural Language Processing. “The research team’s advances in zero-shot clinical information extraction are scalable. Even if you have hundreds of different exploit cases, that’s no problem – you can build each model in just a few minutes of work, instead of marking up a ton of data for that specific task.
For example, without any labels, the researchers found that these models could achieve 86 percent accuracy when expanding overloaded acronyms, and the team developed additional methods to escalate this accuracy to 90 percent, still without the need for labels.
Trapped in the EHR
Experts have been consistently building enormous language models (LLM) for a long time, but they entered the mainstream with GPT-3 extensively discussed ability to complete sentences. These LLM students study huge amounts of text from the Internet to complete sentences and predict the next most likely word.
While previous, smaller models such as earlier versions of GPT or BERT provided good performance for extracting medical data, they still require significant effort to manually label the data.
For example, the note “pt will dc vanco due to n/v” means that this patient (pt) was taking the antibiotic vancomycin (vanco) but developed nausea and vomiting (n/v) severe enough for the care team to discontinue treatment ( dc) medicine The team’s research avoided the status quo of training separate machine learning models for each task (retrieving medications, side effects from the registry, disambiguating common abbreviations, etc.). In addition to expanding abbreviations, four other tasks were examined, including: checked whether the models could analyze clinical trials and extract detailed treatment regimens.
“Previous work has shown that these models are sensitive to the precise wording of the prompts. Part of our technical contribution is how to format the hints so that the model produces results in the correct format,” says Hunter Lang, CSAIL PhD student and author of the paper. “For extraction problems, there are structured exit spaces. The output space is not just a string of characters. It could be a list. This may be a quote from the original input. So there is more structure than just free text. Part of our research contribution is to encourage the model to produce a well-structured result. This significantly reduces post-processing time.”
This approach cannot be applied to off-the-shelf health data in a hospital: it requires sending private patient information over the open Internet to an LLM provider such as OpenAI. The authors showed that this problem could be circumvented by dividing the model into a smaller one that could be used on site.
The model – sometimes just like people – is not always faithful to the truth. Here’s what a potential problem might look like: Let’s say you’re asking about why someone took medication. Without proper guardrails and checks, the model may simply report the most common reason for taking the medication if it is not explicitly mentioned in the note. This led to the team’s efforts to force the model to extract more quotes from the data and less free text.
The team’s future work includes expanding the project to languages other than English, creating additional methods to quantify uncertainty in the model, and obtaining similar results using open-source models.
“Clinical information hidden in unstructured clinical notes poses unique challenges compared to general domain text, primarily due to the heavy use of acronyms and inconsistent text patterns across healthcare settings,” says Sadid Hasan, director of artificial intelligence at Microsoft and former executive director of AI at CVS Health, who was not involved in the research. “To this end, this work presents an captivating paradigm for harnessing the power of general large-language models for several critical clinical zero/few-throw NLP tasks. In particular, the proposed rapid guided LLM design to generate more structured results may lead to further development of smaller, deployable models through iterative exploit of model-generated pseudo-labels.
“Over the last five years, artificial intelligence has accelerated to the point where these large models are able to predict contextual recommendations that benefit a variety of domains, such as suggesting new drug formulations, understanding unstructured text, recommending codes, or creating works of art inspired by any number of artists and styles,” says Parminder Bhatia, who was previously director of machine learning at AWS Health AI and is now director of machine learning for low-code applications using enormous language models at AWS AI Labs.
As part of the MIT Abdul Latif Health Machine Learning Clinic, Jameel, Agrawal, Sontag, and Lang wrote the paper with Yoon Kim, an MIT assistant professor and principal investigator of CSAIL, and Stefan Hegselmann, a visiting graduate student from the University of Muenster. First author Agrawal’s research was supported by a Takeda Fellowship, the MIT Deshpande Center for Technology Innovation, and the MLA@CSAIL initiative.