OpenAI announces new multimodal desktop GPT with new voice and vision capabilities

After weeks of speculation, ChatGPT creator OpenAI announced a new desktop version of ChatGPT and a user interface upgrade called GPT-4o that allows consumers to interact using text, voice, and visual prompts.

GPT-4o can recognize and respond to screenshots, photos, documents, or charts uploaded to it. The new GPT-4o model can also recognize facial expressions and information written by hand on paper. OpenAI said the improved model and accompanying chatbot can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, “which is similar to human response time in a conversation.”

The previous versions of GPT also had a conversational Voice Mode, but they had latencies of 2.8 seconds (in GPT-3.5) and 5.4 seconds (in GPT-4) on average.

GPT 4o now matches the performance of GPT-4 Turbo (released in November) on text in English and code, with significant improvement on text in non-English languages, while also being faster and 50% cheaper in the API version, according to OpenAI Chief Technology Officer Mira Murati.

“GPT-4o is especially better at vision and audio understanding compared to existing models,” OpenAI said in its announcement.

During an on-stage event, Murati said GPT-4o will also have new memory capabilities, giving it the ability to learn from previous conversations with users and add that to its answers.

Chirag Dekate, a Gartner vice president analyst, said that while he was impressed with OpenAI’s multimodal large language model (LLM), the company was clearly playing catch-up to competitors, in contrast to its earlier status as an industry leader in generative AI tech.

“You’re now starting to see GPT enter into the multimodal era,” Dekate said. “But they’re playing catch-up to where Google was three months ago when it announced Gemini 1.5, which is its native multimodal model with a one-million-token context window.”

Still, the capabilities demonstrated by GPT-4o and its accompanying ChatGPT chatbot are impressive for a natural language processing engine. It displayed a better conversational capability, where users can interrupt it and begin new or modified queries, and it is also versed in 50 languages. In one onstage live demonstration, the Voice Mode was able to translate back and forth between Murati speaking Italian and Barret Zoph, OpenAI’s head of post-training, speaking English.

During a live demonstration, Zoph also wrote out an algebraic equation on paper while ChatGPT watched through his phone’s camera lens. Zoph then asked the chatbot to talk him through the solution.

While the voice recognition and conversational interactions were extremely human-like, there were also noticeable glitches in the interactive bot where it cut out during conversations and picked things back up moments later.

The chatbot then was asked to tell a bedtime story. The presenters were able to interrupt the chatbot and have it add more emotion to its voice intonation and even change to a computer-like rendition of the story.

In another demo, Zoph brought up software code on his laptop screen and used ChatGPT 4o’s voice command app to have it evaluate the code, a weather charting app, and determine what it was. GPT-4o was then able to read the app’s chart and determine data points on it related to high and low temperatures.

From left to right, OpenAI CTO Mira Murati, head of Frontiers Research Mark Chen, and head of post-training Barret Zoph demonstrate GPT-4o’s ability to interpret a graphic’s data during an onstage event.

OpenAI

Murati said GPT-4o text and images capabilities will be rolled out iteratively with extended “red team” access starting today.

Paying ChatGPT Plus users will have up to five times higher message limits. A new version of Voice Mode with GPT-4o will arrive in alpha in the coming weeks, Murati said.

Model developers can also now access GPT-4o in the API as a text and vision model. The new model is two times faster, half the price, and has five times higher rate limits compared to GPT-4 Turbo, Murati said.

“We plan to launch support for GPT-4o’s new audio and video capabilities to a small group of trusted partners in the API in the coming weeks,” she said.

Zoph demonstrates using his smartphone’s camera how GPT-4o can read math equations written on paper and assist a user in solving them.

OpenAI

What was not clear in OpenAI’s GPT-4o announcement, Dekate said, was the context size of the input window, which for GPT-4 is 128,000 tokens. “Context size helps define the accuracy of the model. The larger the context size, the more data you can input and the better outputs you get,” he said.

Google’s Gemini 1.5, for example, offers a one-million-token context window, making it the longest of any large-scale foundation model to date. Next in line is Anthropic’s Claude 2.1, which offers a context window with up to 200,000 tokens. Google’s larger context window translates into being able to fit an application’s entire code base for updates or upgrades by the genAI model; GPT-4 had the ability to accept only about 1,200 lines of code, Dekate said.

An OpenAI spokesperson said GPT-4o’s context window size remains at 128k.

Mistral also announced its LLaVA-NeXT multimodal model last week earlier this month. And Google is expected to make further Gemini 1.5 announcements at its Google I/O event tomorrow.

“I would argue in some sense that OpenAI is now playing catch-up to Meta, Google, and Mistral,” Dekate said.

Nathaniel Whittemore, CEO of AI training platform Superintelligent, called OpenAI’s announcement “the most divisive” he’d ever seen.

“Some feel like they’ve glimpsed the future; the vision from Her brought to real life. Others are left saying, ‘that’s it?’ he said in an email reply. “Part of this is about what this wasn’t: it wasn’t an announcement about GPT4.5 or GPT-5. There is so much attention on the state of the art horserace that for some, anything less than that was going to be a disappointment no matter what.”

Murati said OpenAI recognizes that GPT-4o will also present new opportunities for misuse of the real-time audio and visual recognition. She said the company will continue to work with various entities, including the government, the media, and the entertainment industry to try to address the security issues.

The previous version of ChatGPT (4.0) also had a Voice Mode that used three separate models: one model transcribes audio to text, another takes in text and outputs text, and a third model that converts that text back to audio. That model, Murati explained, can observe tone, multiple speakers, or background noises, but it can’t output laughter, singing, or express emotion. GPT-4o, however, uses a single end-to-end model across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network for more of a real-time experience.

“Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations,” Murati said. “Over next few weeks, we will continue iterative deployments to bring to you.”

Emerging Technology, Generative AI

OpenAI announces new multimodal desktop GPT with new voice and vision capabilities

Apple makes a deal to open iPhone to Generation GenAI

Extracting data from encrypted virtual disks: six methods