On Monday, OpenAI announced a fresh flagship generative AI model it calls GPT-4o – the “o” stands for “omni” and refers to the model’s ability to handle text, speech and video. Over the next few weeks, GPT-4o will be rolled out “iteratively” across the company’s developer and consumer products.
OpenAI CTO Mira Murati said that GPT-4o provides “GPT-4-level intelligence” but improves GPT-4’s capabilities across multiple modalities and media.
“The reasons for GPT-4o are around voice, text and image,” Murati said during a Monday presentation streamed from OpenAI’s offices in San Francisco. “This is extremely important as we look to the future of interactions between us and machines.”
GPT-4 Turbo, OpenAI’s previous leading “most advanced” model, was trained on a combination of images and text and could analyze images and text to perform tasks such as extracting text from images or even describing the contents of those images. But GPT-4o adds speech to the mix.
What does this make possible? Different things.
GPT-4o significantly improves the performance of OpenAI’s AI-powered chatbot, ChatGPT. The platform has long offered a voice mode that transcribes chatbot responses using a text-to-speech model, but GPT-4o improves on this feature by allowing users to interact with ChatGPT more like an assistant.
For example, users can ask ChatGPT powered by GPT-4o a question and interrupt ChatGPT while it is responding. The model provides real-time responsiveness, OpenAI claims, and can even pick up nuances in a user’s voice, generating voices with “different emotional styles” (including singing) in response.
GPT-4o also improves ChatGPT’s vision capabilities. Given a photo – or a computer screen – ChatGPT can now quickly answer related questions ranging from “What’s going on in this software code?” to “What brand of shirt is this person wearing?”
Murati says these features will continue to evolve in the future. While today GPT-4o can see a menu image in another language and translate it, in the future this model could allow ChatGPT to, for example, “watch” a live sports match and explain the rules to you.
“We know that these models are becoming more and more complex, but we want the interaction experience to actually become more natural, easier, and so that you don’t focus on the user interface at all, but only on the interaction with ChatGPT.” Murati said. “We’ve been really focused on improving the intelligence of these models over the last few years… But for the first time, we’re really making a huge step forward in terms of ease of use.”
OpenAI claims that GPT-4o is also more multilingual and provides better performance in around 50 languages. In Microsoft’s OpenAI API and Azure OpenAI Service, GPT-4o is twice as speedy, half the price, and has higher rate limits than GPT-4 Turbo, the company says.
Currently, voice is not part of the GPT-4o API for all clients. OpenAI, citing the risk of abuse, says it plans to first release support for GPT-4o’s fresh audio capabilities to a “small group of trusted partners” in the coming weeks.
GPT-4o is available starting today in the free ChatGPT tier and to subscribers of the ChatGPT Plus and Team OpenAI Premium plans with “5x higher” message limits. (OpenAI notes that ChatGPT will automatically switch to GPT-3.5, an older and less competent model, when users reach the rate limit.) ChatGPT’s improved GPT-4o-based voice support will arrive in alpha for Plus users next month or thus, alongside enterprise-focused options.
In related news, OpenAI announced that it is releasing a refreshed ChatGPT interface on the web with a fresh, “more conversational” home screen and messaging layout, as well as a desktop version of ChatGPT for macOS that allows users to ask questions via a keyboard shortcut or “give and take” discuss screenshots. ChatGPT Plus users will be the first to access the app today, with a Windows version coming later this year.
Elsewhere, GPT Store, the OpenAI library and tools for creating third-party chatbots built on AI models, are now available to users of the free ChatGPT tier. Free users can exploit ChatGPT features that were previously paid, such as memory, which allows ChatGPT to “remember” preferences for future interactions, upload files and photos, and search the Internet for answers to current questions.