The modern method uses feedback from the community to assist train robots

To teach an AI agent a modern task, such as opening a kitchen cabinet, researchers often employ reinforcement learning, a trial-and-error process in which the agent is rewarded for taking actions that bring it closer to a goal.

In many cases, the human expert must carefully design the reward function, which is the incentive mechanism that provides the agent with the incentive to explore. The human expert must iteratively update this reward function as the agent explores and tries different actions. This can be time-consuming, incompetent, and arduous to scale, especially when the task is elaborate and involves multiple steps.

Researchers from MIT, Harvard University and the University of Washington have developed a modern approach to reinforcement learning that does not rely on a professionally designed reward function. Instead, it uses crowdsourced feedback from many inexperienced users to guide the agent in learning how to achieve its goal.

While some other methods also employ non-expert feedback, this modern approach allows the AI agent to learn faster, despite the fact that the data it collects from users is often full of errors. This cacophonous data may cause other methods to fail.

Additionally, this modern approach enables asynchronous feedback so that non-experts from around the world can contribute to teaching the agent.

“One of the most time-consuming and challenging parts of designing a robot agent today is designing the reward function. Currently, reward functions are designed by experienced researchers – a paradigm that does not scale if we want to teach our robots many different tasks. “Our work proposes a way to scale robot learning by crowdsourcing the design of reward functions and allowing non-experts to provide actionable feedback,” says Pulkit Agrawal, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science (EECS) who directs the Improbable AI Lab at MIT’s Computer Science and Artificial Intelligence Laboratory ( CSAIL).

In the future, this method could assist a robot quickly learn to perform specific tasks in a user’s home, without the owner having to show the robot physical examples of each task. The robot could explore on its own, using feedback from non-experts to guide its exploration.

“In our method, the reward function tells the agent what to investigate, rather than telling it exactly what it should do to complete the task. So even if human supervision is somewhat imprecise and noisy, the agent is still able to explore, which helps it learn much better,” explains lead author Marcel Torne ’23, a research assistant at the Improbable AI Lab.

Torne is joined in the article by his MIT advisor, Agrawal; senior author Abhishek Gupta, assistant professor at the University of Washington; and others at the University of Washington and MIT. The study’s results will be presented next month at a conference on neural information processing systems.

Noisy feedback

One way to gather user feedback on reinforcement learning is to show the user two pictures of the states reached by the agent and then ask the user which state is closer to the goal. Perhaps the purpose of the robot is, for example, to open a kitchen cabinet. One image may show that the robot has opened a cabinet, and the other may show that it has opened a microwave oven. The user would select a photo showing a “better” condition.

Some previous approaches have tried to employ this binary feedback from crowds to optimize the reward function that the agent would employ to learn the task. However, because non-experts often make mistakes, the reward function can become very cacophonous, causing the agent to get stuck and never achieve its goal.

“Basically, an agent would take the reward function too seriously. It would try to perfectly match the reward function. So instead of directly optimizing the reward function, we simply use it to tell the robot which areas it should explore,” says Torne.

He and his colleagues split the process into two separate parts, each guided by its own algorithm. They call their modern reinforcement learning method HuGE (Human Guided Exploration).

On the one hand, the target selector algorithm is constantly updated based on human feedback. The feedback is not used as a reward function, but rather as a cue for the agent’s exploration. In a sense, non-expert users drop breadcrumbs that gradually lead the agent to the goal.

On the other hand, the agent conducts research independently, in a self-supervised manner, directed by the target selector. It collects images or videos of the actions it is trying, which are then sent to people and used to update the target selector.

This narrows the agent’s exploration area, leading it to more promising areas that are closer to the goal. However, if there is no feedback or it takes some time for it to appear, the agent will continue learning on its own, albeit at a slower rate. This enables infrequent and asynchronous feedback collection.

“The Exploration Loop can operate autonomously because it will simply explore and learn new things. And when you get a better signal, it will start exploring in a more specific way. You can just make them spin at their own pace,” adds Torne.

And because feedback only gently guides the agent’s behavior, it will eventually learn to perform the task even if users provide incorrect answers.

Learn faster

The researchers tested this method in a series of simulated and real tasks. In the simulation, they used HuGE to effectively learn tasks consisting of long sequences of actions, such as placing blocks in a specific order or navigating a enormous maze.

In real-world tests, they used HuGE to train robotic arms to draw the letter “U” and select and place objects. For these tests, they obtained data from 109 non-expert users in 13 different countries on three continents.

In real and simulated experiments, HuGE helped agents learn to achieve a goal faster than other methods.

The researchers also found that data from non-experts produced better performance than synthetic data that was curated and labeled by researchers. For inexperienced users, it took less than two minutes to tag 30 photos or videos.

“This makes it very promising in terms of being able to scale up this method,” adds Torne.

In a related paper that the researchers presented at a recent conference on robot learning, they improved HuGE so that an AI agent can learn to perform a task and then reset the environment on its own to continue learning. For example, if an agent learns how to open a cabinet, this method will also instruct him how to close the cabinet.

“Now we can make it learn completely autonomously, without the need for a human to reset it,” he says.

The researchers also emphasize that in this and other learning approaches, it is critical to ensure that AI agents are consistent with human values.

In the future, they want to continue to improve HuGE so that the agent can learn from other forms of communication, such as natural language and physical interactions with the robot. They are also interested in using this method to teach multiple agents at the same time.

This research is funded in part by the MIT-IBM Watson AI Lab.

Categories

The modern method uses feedback from the community to assist train robots

Openai is preparing to launch GPT-4.1

Suppliers Notebook: Google and IBM make clinical and business movements AI

Canva is now in the coding and spreadsheet industry

Trump’s trade war strengthens the gentle power of China

Openai updates chatgpt to refer to your previous chats

More News

Hopping gives this little robot a leg

Elastic robot can lend a hand in rescue in searching of debris

Seaperto: Robot with a mission

3D strings printing approach Together lively objects for you

Openai is preparing to launch GPT-4.1

Suppliers Notebook: Google and IBM make clinical and business movements AI

Canva is now in the coding and spreadsheet industry