Artificial intelligence is increasingly used to optimize decision-making in high-stakes situations. For example, an autonomous system can determine a power distribution strategy that minimizes costs while maintaining a stable voltage.
But while these AI-powered results may be technically optimal, are they fair? What happens if a economical energy distribution strategy makes disadvantaged neighborhoods more prone to blackouts than higher-income areas?
To lend a hand stakeholders quickly identify potential ethical dilemmas before implementation, MIT researchers have developed an automated evaluation method that balances the interplay between measurable outcomes, such as cost or reliability, and qualitative or subjective values, such as fairness.
The system separates objective ratings from user-defined human values, using the Large Language Model (LLM) as a proxy for humans to capture and address stakeholder preferences.
The adaptive framework selects the best scenarios for further evaluation, streamlining a process that typically requires costly and time-consuming manual work. These test cases can show situations in which autonomous systems align well with human values, as well as scenarios that unexpectedly do not meet ethical criteria.
“We can put a lot of rules and guardrails into AI systems, but those safeguards can only prevent events that we can imagine. It’s not enough to say, ‘Let’s use the AI because it has been trained on this information.’ “We wanted to develop a more systematic way to discover unknown unknowns and predict them before something bad happens,” says senior author Chuchu Fan, associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro) and principal investigator at MIT’s Laboratory for Information and Decision Systems (LIDS).
The fan is connected to paper by lead author Anjali Parashar, mechanical engineering graduate; Yingke Li, AeroAstro PhD student; and others at MIT and Saab. The research results will be presented at the International Conference on Learning Representations.
Ethics assessment
In a large system such as a power grid, assessing the ethical fit of an AI model’s recommendations in an all-objective way is particularly difficult.
Most testing frameworks are based on pre-collected data, but it is often difficult to obtain data marked with subjective ethical criteria. Additionally, as both ethical values and AI systems are constantly evolving, static assessment methods based on written codes or regulatory documents require frequent updates.
Fan and her team approached this problem from a different perspective. Building on their previous work on the evaluation of robotic systems, they developed an experimental design framework to identify the most informative scenarios, which stakeholders will then further evaluate.
Their two-part system, called Scalable Experimental Design for System-level Ethical Testing (SEED-SET), includes quantitative metrics and ethical criteria. Can identify scenarios that effectively meet measurable requirements and align well with human values and vice versa.
“We don’t want to spend all our resources on random evaluations. So it’s very crucial to focus the framework on the test cases we care about most,” says Li.
Importantly, SEED-SET does not need existing evaluation data and adapts to multiple purposes.
For example, a power grid may span several user groups, including a large rural community and a data center. While both groups may want cheap and reliable energy, each group’s priorities from an ethical perspective may differ significantly.
These ethical criteria may not be well defined, so they cannot be measured analytically.
The power grid operator wants to find the most cost-effective strategy that best meets the subjective ethical preferences of all stakeholders.
SEED-SET addresses this challenge by dividing the problem into two parts according to a hierarchical structure. The objective model considers system performance based on measurable metrics such as cost. Then, the subjective model, which takes into account stakeholder evaluations such as perceived fairness, is based on objective assessment.
“The objective part of our approach is linked to the artificial intelligence system, while the subjective part is linked to the users who rate it. By distributing preferences in a hierarchical manner, we can generate desired scenarios with fewer ratings,” says Parashar.
Coding subjectivity
To perform a subjective assessment, the system uses the LLM as a proxy for evaluators. Researchers encode the preferences of each user group into natural language prompts for the model.
The LLM uses these instructions to compare two scenarios, selecting the preferred design based on ethical criteria.
“After viewing hundreds or thousands of scenarios, an evaluator may become fatigued and inconsistent in his or her assessments, so we use an LLM-based strategy instead,” explains Parashar.
SEED-SET uses the selected scenario to simulate the entire system (in this case the power distribution strategy). The simulation results help you find the next best scenario to test.
Ultimately, SEED-SET intelligently selects the most representative scenarios that meet or do not meet objective indicators and ethical criteria. This way, users can analyze the performance of the AI system and adjust its strategy.
For example, SEED-SET can highlight cases of power distribution that prioritize higher-income areas during periods of peak demand, making disadvantaged neighborhoods more vulnerable to outages.
To test SEED-SET, researchers evaluated realistic autonomous systems such as an AI-driven power grid and an urban traffic routing system. They measured the extent to which the generated scenarios complied with ethical criteria.
The system generated more than twice as many optimal test cases as baseline strategies in the same time frame, while uncovering many scenarios missed by other approaches.
“As the user’s preferences changed, the set of scenarios generated by SEED-SET changed drastically. This tells us that the evaluation strategy is well suited to the user’s preferences,” says Parashar.
To measure the usefulness of SEED-SET in practice, researchers will need to conduct a user study to see whether the scenarios it generates lend a hand in real-world decision-making.
In addition to conducting such a study, the researchers plan to explore the utilize of more effective models that can scale to larger problems with more criteria, such as assessing LLM decision-making.
This research was funded in part by the U.S. Defense Advanced Research Projects Agency.
