July 31, 2024 – On the 30th of this month, OpenAI made an announcement that it has begun rolling out the GPT-4o voice mode, currently in its Alpha version, to a select group of ChatGPT Plus subscribers. The feature is set to be gradually made available to all ChatGPT Plus members by autumn this year.
During a speech in May, OpenAI’s Chief Technology Officer, Mira Murati, revealed details about GPT-4o. She explained that they had trained an end-to-end unified model spanning text, vision, and audio. This signifies that all inputs and outputs are processed by a single neural network.
Since GPT-4o represents their inaugural model integrating these multiple modalities, Murati admitted that they are still in the initial stages of exploring its capabilities and limitations.
Initially, OpenAI had intended to invite a small group of ChatGPT Plus users to test GPT-4o’s voice mode by the end of June. However, the company announced a delay in June, citing the need for additional time to refine the model and enhance its ability to detect and reject certain content.
Previous information indicated that while GPT-3.5 had an average voice feedback latency of 2.8 seconds, GPT-4 exhibited a higher delay of 5.4 seconds, making it less than ideal for voice communication. The upcoming GPT-4o, however, promises to significantly reduce this latency, enabling almost seamless dialogue.
The GPT-4o voice mode boasts features like quick responsiveness and life-like voice quality. Furthermore, OpenAI claims that GPT-4o can detect emotional tones in speech, such as sadness, excitement, or even singing.
According to OpenAI spokesperson Lindsay McCallum, ChatGPT cannot imitate the voices of individuals or public figures and is programmed to prevent outputs that deviate from preset voices.