OpenAI launched three new real-time voice models in the API: GPT-Realtime-2 (first voice model with GPT-5-class reasoning), GPT-Realtime-Translate (live translation across 70+ languages), and GPT-Realtime-Whisper (streaming speech-to-text). These models separate reasoning, translation, and transcription into specialized components, enabling developers to build more natural voice agents.
What Makes GPT-Realtime-2 Different From Previous Voice Models?
GPT-Realtime-2 is OpenAI’s first voice model with GPT-5-class reasoning capabilities. It can handle multi-step instructions, call tools, handle interruptions, and maintain natural conversational flow without the cognitive lag typical of voice-to-text-to-voice pipelines. A 128K-token context window supports longer, more coherent sessions.
How Does the New Voice Architecture Work?
Instead of routing every interaction through one massive model, developers can now route specific tasks to specialized models:
- GPT-Realtime-2 for conversational reasoning and complex task orchestration
- GPT-Realtime-Translate for live multilingual translation (70+ input languages, 13 output languages)
- GPT-Realtime-Whisper for low-latency streaming transcription
What Are the Pricing and Availability Details?
GPT-Realtime-2 is priced at $32 per 1M audio input tokens and $64 per 1M audio output tokens. GPT-Realtime-Translate costs $0.034 per minute. GPT-Realtime-Whisper costs $0.017 per minute. All models are available in the Realtime API.
Key Takeaways
- GPT-Realtime-2: first voice model with GPT-5-class reasoning
- GPT-Realtime-Translate: live translation across 70+ input and 13 output languages
- GPT-Realtime-Whisper: streaming transcription at $0.017/minute
- 128K token context window for longer sessions
- Modular architecture lets developers route tasks to specialized models
- Available now in the Realtime API
Frequently Asked Questions
Can GPT-Realtime-2 handle interruptions? Yes. The model is designed to handle corrections, interruptions, and context switches naturally, keeping the conversation moving while reasoning through requests.
Which languages does GPT-Realtime-Translate support? It supports more than 70 input languages and 13 output languages, useful for customer support, cross-border sales, education, events, and global platforms.
How does pricing compare to previous models? The new pricing reflects the specialized nature of each model. GPT-Realtime-2 is priced for premium conversational reasoning, while Whisper offers low-cost streaming transcription.