Over the last 10 years, it has become much more common for physicians to keep records electronically. Such documentation can contain a wealth of medically useful data: hidden correlations between symptoms, treatments, and outcomes, for example, or clues that patients are promising candidates for up-to-date drug trials.
However, most of this data is hidden in doctors’ arbitrary notes. One of the difficulties in extracting data from unstructured text is what computer scientists call word sense disambiguation. For example, in a doctor’s notes, the word “discharge” may refer to a bodily discharge, but it may also refer to a discharge from the hospital. The ability to infer the intended meaning of words makes it much easier for computers to find useful patterns in mountains of data.
Next week at the American Medical Informatics Association’s (AMIA) annual symposium, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory will present a up-to-date system for disambiguating the meaning of words used in doctors’ clinical notes. The system is on average 75% true in disambiguating words with two meanings, which is a marked improvement over previous methods. But more importantly, says Anna Rumshisky, an MIT postdoc who helped lead the up-to-date research, it represents a fundamentally up-to-date approach to word disambiguation that could lead to much more true systems while drastically reducing the amount of human effort needed to develop them.
Indeed, Rumshisky says, the paper that was originally accepted at the AMIA symposium described a system that used a more conventional approach to word disambiguation, with an average accuracy of only about 63 percent. “In our opinion, it wasn’t enough to make it actually usable,” says Rumshisky. “So instead, we tried something that had been tried before in the general domain, but never in the biomedical or clinical domain.”
Topical utilize
Specifically, Rumshisky explains, she and her coauthors—graduate student Rachel Chasin, whose master’s thesis is the basis for the up-to-date work; Peter Szolovits, a professor of computer science and engineering and health sciences and technology at MIT; and researcher Özlem Uzuner, who earned his Ph.D. at MIT and is now an assistant professor at the University at Albany—adapted algorithms from a research area known as topic modeling. Topic modeling aims to automatically identify the topics of documents by inferring relationships between salient words.
“The bias we’re trying to transfer from the general domain is treating instances of the target word as documents and the senses as latent topics that we’re trying to infer,” Rumshisky says.
While a regular topic modeling algorithm searches huge corpuses of text to identify clusters of words that tend to appear in close proximity to each other, Rumshisky and her colleagues’ algorithm identifies correlations not only between words, but also between words and other text “features” – such as the syntactic roles of words. For example, if the word “discharge” is preceded by an adjective, it is much more likely to refer to a bodily discharge than to an administrative event.
Typically, topic-modeling algorithms assign different weights to different topics: for example, a single news article might be 50 percent about politics, 30 percent about the economy, and 20 percent about foreign affairs. Similarly, the up-to-date algorithm from MIT researchers assigns different weights to the different possible meanings of ambiguous words.
One advantage of topic-modeling algorithms is that they are unsupervised: They can be deployed on enormous text collections without human supervision. As a result, researchers can continuously improve their algorithm to include more features, and then publish it on unscripted medical records to draw their own conclusions. The more features it includes, the more true it should be, Rumshisky says.
Recommended attractions
Among the features the researchers plan to incorporate into the algorithm are lists in the massive dictionary of medical terms developed by the National Institutes of Health, called the Unified Medical Language System (UMLS). Indeed, word associations in UMLS were the basis for the researchers’ original algorithm—the one that achieved 63 percent accuracy. The problem there was that the length and structure of the paths from one word to another in UMLS did not always correspond to the semantic differences between the words. But the up-to-date system internally identifies only those correspondences that occur with enough frequency to be likely to be useful.
“Parts [UMLS] important for sensory discrimination would basically float to the top,” says Rumshisky. “It kind of gives you that association, if it’s valid, for free. If it’s not important, it just doesn’t matter.”
Scientists are also experimenting with additional syntactic and semantic features that may aid with word disambiguation and word associations established under the NIH Medical Subject Headings paper classification scheme. “It’s still not perfect because we haven’t integrated all the language features we wanted,” says Rumshisky. “But I have a feeling this is the right way.”
“About 80 percent of clinical information is hidden in clinical notes,” says Hongfang Liu, assistant professor of health informatics at Mayo Clinic. “Many words or phrases are ambiguous there. So to get the correct interpretation, you have to go through a word disambiguation phase.”
Liu says that although some computational linguists have applied topic modeling algorithms to the problem of word disambiguation, “I feel like they’re working on toy problems. I think in this case it could actually be applied to production-scale natural language processing systems.”