Cerebras Systems, fresh off the largest tech IPO of 2026, demonstrated its wafer-scale chips running Moonshot AI’s trillion-parameter Kimi K2.6 model at 981 output tokens per second — 6.7x faster than the next-fastest GPU cloud provider and 23x faster than the median. The result, independently verified by Artificial Analysis, positions Cerebras as a major contender in the rapidly growing AI inference market.
What Did Cerebras Achieve?
Cerebras announced it is now running Kimi K2.6 — a trillion-parameter open-weight model developed by Beijing-based Moonshot AI — for enterprise customers at nearly 1,000 tokens per second. The independently verified benchmark clocked 981 output tokens per second, making Cerebras 6.7x faster than the next-fastest GPU-based cloud provider and 23x faster than the median. For a standard agentic coding request, Cerebras delivered the full response in 5.6 seconds compared to 163.7 seconds on the official Kimi endpoint.
What Is Kimi K2.6?
Released on April 20 by Moonshot AI, K2.6 is a trillion-parameter Mixture-of-Experts model that has rapidly established itself as the most capable open-weight model for coding and agentic tasks. It tops SWE-Bench Pro at 58.6, outperforming Claude Opus 4.6 and matching GPT-5.4. Its architecture uses 32 billion activated parameters per token out of 1 trillion total, with 384 experts and a 256,000-token context window.
Why Does Inference Speed Matter Now?
As AI agents proliferate in enterprise software, inference speed directly determines how useful those agents are in practice. The inference market is rapidly overtaking training as the most commercially important compute workload. Nvidia’s recent $20 billion acquisition of Groq for its inference technology underscores the strategic importance of fast inference.
How Does Cerebras’s OpenAI Deal Fit?
Cerebras CEO Andrew Ng confirmed that Cerebras serves OpenAI’s “internal coding models forthcoming” as part of a deal reportedly worth more than $20 billion for computing capacity. Neither party has publicly detailed the technical arrangement, but the relationship highlights Cerebras’s unique position as both a competitor and supplier to major AI labs.
Key Takeaways
- 981 tokens per second on a trillion-parameter model — 6.7x faster than GPU clouds
- Achieved on Kimi K2.6, a Moonshot AI Mixture-of-Experts model
- Independently verified by benchmarking firm Artificial Analysis
- Cerebras completed the largest tech IPO of 2026 days before this announcement
- Nvidia’s $20B Groq acquisition validates inference as the key market
- Cerebras has a $20B+ deal with OpenAI for computing capacity
Frequently Asked Questions
How does Cerebras achieve such speed? Cerebras uses wafer-scale integration, building a single massive chip the size of a wafer rather than stitching together many smaller GPUs, which eliminates much of the communication overhead.
Is Kimi K2.6 available to anyone on Cerebras? Yes, Cerebras is offering the model to enterprise customers on its cloud platform.
Does this mean GPUs are obsolete for inference? Not obsolete, but Cerebras’s results show that specialized architectures can significantly outperform general-purpose GPUs for specific inference workloads.