Tuesday, December 24, 2024

3 Questions: Should we label AI systems like prescription drugs?

Share

Nature Computational Science

MIT News

Q: Why do we need labels for responsible exploit of AI systems in healthcare settings?

AND: We have an fascinating situation in the healthcare environment where physicians often rely on technology or treatments that are not fully understood. Sometimes this lack of understanding is fundamental—for example, the mechanism of action of acetaminophen—but other times it is simply a limitation of specialization. We do not expect clinicians to know how to operate an MRI machine. Instead, we have certification systems through the FDA or other federal agencies that certify the exploit of a medical device or drug in a particular setting.

Importantly, medical devices also have service agreements—the manufacturer’s technician will repair your MRI machine if it’s miscalibrated. For approved drugs, there are post-marketing surveillance and reporting systems so that side effects or incidents can be addressed, for example if many people taking the drug appear to have a condition or allergy.

Models and algorithms, whether they include AI or not, bypass many of these validation processes and long-term monitoring, and that’s something we need to be wary of. Many previous studies have shown that predictive models require more stringent evaluation and monitoring. In the case of newer generative AI, we cite work that has shown that generation is not guaranteed to be correct, strong, or unbiased. Because we don’t have the same level of oversight over predictions or model generation, it would be even more challenging to catch problematic model responses. Generative models used by hospitals today can be biased. Usage labels are one way to ensure that models are not automating biases learned by clinicians or miscalibrated clinical decision support results from the past.

Q: This article describes several elements of a responsible AI label, as outlined by the FDA’s approach to prescription drug labeling, including approved exploit, ingredients, potential side effects, and more. What key messages should these labels convey?

AND: The label should indicate when, where, and how the model is intended to be used. For example, the user should know that the models were trained at a specific time using data from a specific point in time. For example, does it include data that did or did not include the Covid-19 pandemic? There were very different health practices during Covid that could have affected the data. That is why we advocate for disclosure of the model’s “ingredients” and “completed studies.”

In terms of location, we know from previous research that models trained in one location tend to perform worse when moved to another location. Knowing where the data came from and how the model was optimized for that population can aid ensure that users are aware of “potential side effects,” any “warnings and precautions,” and “adverse reactions.”

In the case of a model trained to predict a single outcome, knowing when and where the training was done can aid make clever decisions about implementation. However, many generative models are extremely elastic and can be used for multiple tasks. Here, time and place may not be as informative, and more detailed guidance on “labeling conditions” and “approved use” versus “unapproved use” comes into play. If the creator evaluated the generative model for reading a patient’s clinical notes and generating potential billing codes, it may reveal that it tends to over-bill for certain conditions or under-bill for others. A user would not want to exploit the same generative model to decide who gets a referral to a specialist, even if they could. This flexibility is why we advocate for additional detail on how the models should be used.

In general, we recommend that you train the best model you can with the tools available to you. But even then, there should be plenty of disclosure. No model will be perfect. As a society, we now understand that no pill is perfect—there are always risks. We should have the same understanding of AI models. Every model—AI or not—is narrow. It can make realistic, well-trained predictions of potential futures, but take that with a grain of salt.

Q: If AI labels were introduced, who would do the labeling and how would the labels be regulated and enforced?

AND: If you do not intend to exploit your model in practice, then the disclosures you would make for the publication of high-quality research are sufficient. But when you intend to deploy your model in a human-facing environment, developers and implementers should do some initial labeling, based on some established framework. Validation of these claims should be done before deployment; in a safety-critical environment such as healthcare, multiple Department of Health and Human Services agencies may be involved.

I think that for modelers, the awareness of the need to flag the limitations of the system makes them consider the process more carefully. If I know that at some point I will have to reveal the population on which the model was trained, I would not want to reveal that it was trained only on dialogue from male chatbot users, for example.

By considering issues such as who is collecting data, over what time period, the sample size, and how decisions were made about what data to include and what to exclude, potential implementation issues can emerge.

Latest Posts

More News