§ feed · storyline

Promptable Prosody, SOTA ASR, and Semantic VAD: OpenAI revamps Voice AI

OpenAI launches three audio models via its API — gpt-4o-transcribe, gpt-4o-mini-tts with promptable prosody, and a semantic VAD update — while adding audio support to its Agents SDK.

Mar 20 · 23:51:24 · primary fetch1 sourceupdated Mar 20 · 23:51:24

OpenAI has launched three new state-of-the-art audio models in their API, including gpt-4o-transcribe, a speech-to-text model outperforming Whisper, and gpt-4o-mini-tts, a text-to-speech model with promptable prosody allowing control over timing and emotion. The Agents SDK now supports audio, enabling voice agents. OpenAI also updated turn detection for real-time voice activity detection (VAD) based on speech content.

Additionally, OpenAI's o1-pro model is available to select developers with advanced features like vision and function calling, though at higher compute costs. The community shows strong enthusiasm for these audio advancements, with a radio contest for TTS creations underway. Meanwhile, Kokoro-82M v1.0 emerges as a leading open weights TTS model with competitive pricing on Replicate.

read full article on news.smol.ai ↗

§ sources1 publication · timeline below

news.smol.aiPromptable Prosody, SOTA ASR, and Semantic VAD: OpenAI revamps Voice AIprimary23:51:24