From inert classifiers to inference engines: the modern OpenAI model rethinks content moderation

Share

Enterprises looking to secure whatever AI models they operate follow safety and safe and sound operate rules policies, adjust LLMs so that they do not respond to unsolicited inquiries.

However, most protection and red teaming activities occur pre-deployment, “rolling in” policies before users fully test the capabilities of models in production. OpenAI believes it can provide businesses with a more pliant option and encourage more businesses to adopt a security policy.

As part of the research announcement, the company released two open-weight models that it claims will make enterprises and models more agile when it comes to security. gpt-oss-safeguard-120b and gpt-oss-safeguard-20b will be available under the permissive Apache 2.0 license. The models are refined versions of the open source OpenAI gpt-oss, released in Augustmarking the first release in the oss family since the summer.

In blog postOpenAI said that oss-safeguard uses reasoning “to directly interpret developer-provider policy at the time of inference – classifying user messages, completions, and full chats according to developer needs.”

The company explained that because the model uses chain of thought (CoT), developers can get explanations of the model’s decisions for review.

“Additionally, policies are exposed during inference rather than learning against the model, so developers can easily iteratively adjust policies to improve performance,” OpenAI said in its post. “This approach, which we initially developed for internal use, is much more flexible than the traditional method of training a classifier to indirectly infer a decision boundary from a large number of labeled examples.”

Developers can download both models from Face Hugging.

Flexibility and burning sensation inside

At first, AI models will not know the company’s preferred security triggers. While the model suppliers make the red team models and platformsthese safeguards have wider application. Companies like Microsoft AND Amazon Online Services even platforms offer bring handrails for AI applications and agents.

Enterprises operate security classifiers to support train a model to recognize patterns of good and bad inputs. This helps models learn what queries they should not answer. This also helps ensure that models do not drift and respond accurately.

“Traditional classifiers can be characterized by high performance, low latency and operational costs,” says OpenAI. “However, collecting enough training examples can be time-consuming and expensive, and updating or changing the rules requires retraining the classifier.”

The models take two inputs at once and then draw conclusions about where the content fails. Classification according to its guidelines requires policy and content. OpenAI has found that models perform best in situations where:

Potential harms are emerging or evolving and policies need to adapt quickly.
This field is very nuanced and complex for smaller classifiers to navigate.
Developers don’t have enough samples to train a high-quality classifier for every risk on their platform.
The delay is less critical than creating high-quality, easy-to-explain labels.

The company said gpt-oss-safeguard “is different because its inference capabilities allow developers to apply any policy,” even those they wrote during inference.

The models are based on OpenAI’s internal tool, Safety Reasoner, which allows teams to more iteratively set guardrails. They often start with very stringent security policies “and use relatively large amounts of computing power as needed,” and then adjust the policies as the model moves through production and risk assessments change.

Ensuring safety

OpenAI found that gpt-oss-safeguard models outperform the GPT-5 approach and original gpt-oss models in multi-policy accuracy based on benchmarks. He also ran the models on the ToxicChat public benchmark, where they performed well, although the GPT-5 approach and Safety Reasoner slightly outperformed them.

However, there are concerns that this approach may centralize security standards.

“Security is not a well-defined concept. Any implementation of security standards will reflect the values and priorities of the organization creating them, as well as the limitations and shortcomings of its models,” said John Thickstun, an assistant professor of computer science at Cornell University. “If the industry as a whole adopts the standards developed by OpenAI, we risk institutionalizing one particular view of security and disrupting broader research on the security needs for AI deployments across many sectors of society.”

It should also be noted that OpenAI has not released a base model for the oss family of models, so developers cannot fully iterate on them.

However, OpenAI believes that the developer community can support improve gpt-oss security. A hackathon will be held on December 8 in San Francisco.

The AI Sckool

Categories

From inert classifiers to inference engines: the modern OpenAI model rethinks content moderation

Flexibility and burning sensation inside

Ensuring safety

The Trump administration does not rule out further action against Anthropic

3 questions: Building predictive models to characterize cancer progression

Run miniature AI models locally with BitNet – a beginner’s guide

ChatGPT can now create interactive visualizations to facilitate you understand math and science concepts

From gaming to biology and beyond: 10 years of AlphaGo’s impact

More News

The Trump administration does not rule out further action against Anthropic

Yann LeCun is raising $1 billion to create artificial intelligence that understands the physical world

OpenAI and Google Employees File Amicus Brief in Support of Anthropic Against US Government

Nvidia plans to launch an open-source AI agent platform

The Trump administration does not rule out further action against Anthropic

3 questions: Building predictive models to characterize cancer progression

Run miniature AI models locally with BitNet – a beginner’s guide