Management in artificial intelligence companies can I like to tell us That Agi is almost here, but the latest models still require additional tutoring to assist them be as clever as possible.
Scale AI, a company that has played a key role in helping the company’s frontier AI in building advanced models, has developed a platform that can automatically test the model in thousands of reference and tasks, indicating weaknesses and determining additional training data that should assist improve their skills. Of course, the scale will provide the required data.
The scale increased to importance, ensuring human work for training and testing advanced AI models. Enormous language models (LLM) are trained in the scope of a scraped text from books, networks and other sources. Changing these models into helpful, coherent and well -raised chatbots requires additional “training” in the form of people who provide feedback about the model output.
The scale of supply of employees who are an expert in researching models of problems and restrictions. The recent tool, called scale assessment, automates some of these works using their own Scale machine learning algorithms.
“In large laboratories there are all these random ways to follow some of the weaknesses of the model,” says Daniel Berrios, product head to assess the scale. The recent tool “It’s a way [model makers] To go through the results, cut them and their ankle to understand where the model does not work well, “says Berrios,” then use it to direct data campaigns to improve. “
Berrios claims that several AI Frontier companies are already using this tool. He says that most utilize it to improve the ability to reason their best models. AI’s reasoning includes a model that tries to break the problem into components to solve it more effectively. This approach is largely based on after training from users to determine whether the model has solved the problem correctly.
In one case, Berrios claims that the assessment of the scale revealed that the skills of the model’s reasoning fell when he was fed with non -English hints. “One sec [the model’s] The possibilities of general purpose reasoning were quite good and worked well on references, they tended to degradation when the hints were not in English, “he says. Scale Evolution emphasized the problem and allowed the company to collect additional training data to solve this problem.
Jonathan Franle, the main scientist of AI in Databicks, a company that builds huge AI models, says that the ability to test one foundation model against another sound basically. “Everyone who moves the ball forward in terms of evaluation helps us build better artificial intelligence,” says Franke.
In recent months, the scale has contributed to the development of several recent reference points designed to worry about AI models to become smarter and examine more precisely how they may behave badly. They include EnigmaevalIN MultichallengeIN MASKAND The last exam of humanity.
Scale claims that the measurement of improvements in AI models is more complex because they become better in performing existing tests. The company claims that its recent tool offers a more comprehensive image, combining many different reference points and can be used to develop non -standard model skills tests, such as reasoning in different languages. Own artificial intelligence Scale can take a given problem and generate more examples, enabling a more comprehensive model skills test.
The company’s recent tool can also inform about efforts to standardize the testing of AI models for improper behavior. Some researchers say that the lack of standardization means that some Jailbreak models remain undisclosed.
In February, the American National Institute of Standards and Technologies announced that the scale will assist her develop testing models methodologies to ensure that they are safe and sound and trustworthy.
What mistakes did you notice in the results of generative AI tools? What do you think are the biggest dead models? Let us know by sending e -mail to hello@wired.com or commenting below.