To train agents to interact well with humans, we need to be able to measure progress. However, human interaction is sophisticated, and measuring progress is arduous. In this work, we developed a method, called the Standardized Test Suite (STS), to evaluate agents in time-extended multimodal interactions. We analyzed interactions in which human participants asked agents to perform tasks and answer questions in a simulated 3D environment.
The STS methodology places agents in a set of behavioral scenarios extracted from real human interaction data. Agents see the recreated scenario context, receive instructions, and then are given control to complete the interaction offline. These agent continuations are recorded and then sent to human raters to rate as success or failure. Agents are then ranked according to the proportion of scenarios in which they succeed.
Many of the behaviors that come as second nature to humans in our everyday interactions are arduous to put into words and impossible to formalize. This is why the mechanism that reinforcement learning relies on to solve games (such as Atari, Go, DotA, and Starcraft) won’t work when we try to teach agents to interact smoothly and successfully with humans. For example, think of the difference between the two questions, “Who won this game of Go?” and “What are you looking at?” In the first case, we can write a piece of computer code that counts the stones on the board at the end of the game and confidently determines the winner. In the second case, we have no idea how to codify this: the answer might depend on the speakers, the sizes and shapes of the objects involved, whether the speaker is joking, and other aspects of the context in which the utterance is made. Humans intuitively understand myriad essential factors involved in answering this seemingly mundane question.
Interactive evaluation by participants can serve as a benchmark for understanding agent performance, but it is clamorous and pricey. It is arduous to control the exact instructions that humans give agents when interacting with them for evaluation. This type of evaluation also occurs in real time, so it is too snail-paced to rely on for rapid progress. Previous work has relied on proxies for interactive evaluation. Proxies such as losses and scripted probe tasks (e.g., “pick x,” where x is randomly selected from the environment and the success function is carefully hand-crafted) are useful for quickly gaining insight into agents, but in reality they do not correlate all that well with interactive evaluation. Our fresh method has advantages, mainly in providing control and speed of the metric, which closely aligns with our ultimate goal—creating agents that interact well with humans.
The development of MNIST, ImageNet, and other human-annotated datasets has been crucial to the advancement of machine learning. These datasets have allowed researchers to train and evaluate classification models at the one-time cost of human input. The STS methodology aims to do the same for human-agent interaction studies. This evaluation method still requires humans to annotate agent continuities; however, early experiments suggest that automation of these annotations may be possible, enabling rapid and productive automated evaluation of interactive agents. In the meantime, we hope that other researchers will be able to apply the methodology and system design to accelerate their own research in this area.