The U.S. Department of Defense’s Chief Office of Digital and Artificial Intelligence and Humane Intelligence, a technology nonprofit, announced the completion of the agency’s pilot Crowdsourced Artificial Intelligence Red-Teaming Assurance Program, which focuses on testing chatbots using multilingual models used in military medical services .
The findings could ultimately improve military medical care by following all required risk management practices when using artificial intelligence, Defense Department officials say.
WHY IT’S IMPORTANT
In announcement On Thursday, the Department of Defense said the latest red team test under the CAIRT program involved more than 200 clinical service agencies and health care analysts to compare three LLMs for two potential applications: a clinical note summary and a medical advice chatbot.
They found more than 800 potential vulnerabilities and bugs in areas where LLMs are being tested to improve military medical care.
The goal of the CAIRT project was to build a community of practice around algorithmic assessments in collaboration with the Defense Health Agency and the Program Executive Office for Defense Healthcare Management Systems. In 2024, the program also offered, among others: financial reward for AI bias we focused on unknown threats in LLM, starting with open source chatbots.
Crowdsourcing creates a wide network that can generate gigantic amounts of data from many stakeholders. The Department of Defense stated that the findings of all red teaming activities under the CAIRT program will be critical to informing policy and best practices for the responsible apply of generative AI.
The Department of Defense also stated that continued testing of LLM and AI systems through the CAIRT security assurance program is critical to accelerating AI capabilities and establishing confidence in DoD GenAI apply cases.
A BIGGER TREND
Trust is crucial for clinicians to apply AI. To apply genAI in clinical care, LLMs must meet critical performance expectations to best assure providers that the tools are useful, lucid, understandable and secure, Dr. Sonya Makhni, medical director of applied informatics at Mayo Clinic, recently said Platform.
Despite the enormous potential for positive apply of AI in healthcare delivery, “unlocking that is the challenge,” Makhni said at the HIMSS AI in Healthcare Forum last September.
Because “at every stage of the AI development lifecycle, assumptions and decisions are made, and if they are incorrect, they can lead to systematic errors,” which allows biases to creep in, Makhni explained when asked about how to ensure the safe and sound apply of AI .
“Such errors can skew the final result of the algorithm for a subset of patients and ultimately pose a risk to health care equity,” she continued. “This phenomenon has been demonstrated in existing algorithms.”
To test performance and eliminate algorithmic bias, clinicians and developers must collaborate “throughout the AI development lifecycle and solution implementation,” advises Makhni.
“Active engagement by both parties is necessary in anticipating potential areas of error and/or suboptimal outcomes,” she added. “This knowledge will help clarify contexts that are better suited to a given AI algorithm and those that may require greater monitoring and oversight.”
ON RECORDING
“As the use of GenAI for such purposes within the Department of Defense is in earlier stages of pilot and experimentation, this program serves as an important path marker for generating masses of test data, identifying areas for consideration, and validating mitigation options that will shape future research, development, and confidence GenAI systems that may be deployed in the future,” said Dr. Matthew Johnson, CAIRT program manager, in a Jan. 2 statement about the initiative.