Cerebras compresses 163 seconds into 5 seconds, says GPU era is over

Cerebras Systems unveiled AI acceleration technology approaching 1,000 tokens per second. [Photo: Cerebras]

[DigitalToday intern reporter Kyung-min Hong (홍경민)] AI chip design company Cerebras put the 1 trillion-parameter open-weight model Kimi K2.6 into an inference service for enterprise customers and achieved what it called the world’s fastest speed of 981 tokens per second.

On May 19 (local time), Cerebras and blockchain outlet Cryptopolitan said the service set records simultaneously in speed, performance and model scale, signaling a shift in the agentic coding landscape.

Cerebras is pushing ahead with an initial public offering and is rapidly building its presence. Its filing shows 2025 revenue of $510 million, up 76 percent from a year earlier, and net profit of $238 million, marking a return to profitability.

It also signed a long-term computing contract worth $20 billion with OpenAI in January through 2028. In March, it signed a deal with Amazon Web Services to deploy Cerebras systems in its data centers.

Cerebras’ standout inference speed is behind its selection by those companies. AI benchmarking group Artificial Analysis measured Cerebras’ K2.6 inference speed at 981 tokens per second. That is 6.7 times faster than the second-fastest GPU-based cloud and 23 times faster than the median among inference services.

The gap to completing an actual response is even more dramatic than raw output speed. For a 10,000-token input, Cerebras took 5.6 seconds to complete a 500-token output, while the official Kimi endpoint took 163.7 seconds. That is a 29-fold difference based on time to reach the final answer.

That speed is supported by K2.6’s model quality. K2.6 is rated as the top open-weight model in coding and agentic tasks. It scored 58.6 on SWE-bench Pro, beating Claude Opus 4.6, and shows performance comparable to GPT-5.4. The performance goes beyond simple code generation, covering full-stack workflows from frontend design to authentication, database handling and long-running agent execution.

Cerebras’ proprietary hardware architecture made that performance possible. Cerebras implemented it with a CS-3 cluster based on its Wafer Scale Engine (WSE). It stores K2.6’s original 4-bit weights while performing computation in 16-bit floating point, with weights distributed across multiple wafers. Inter-wafer communication runs on an on-wafer network fabric with more than 200 times the bandwidth of NVLink NVL72, and custom kernels combined with speculative decoding lifted final speed.

The implications of speed go beyond numbers to changes in development itself. Agentic coding is currently the highest-value use case for LLMs and the most speed-sensitive workload. At speeds approaching 1,000 tokens per second, developers can build in real time instead of repeating cycles of waiting and reviewing, and can reduce inefficiency from switching among multiple agents running in parallel.

Cerebras is currently running K2.6 as a trial service for enterprise customers. With inference speed emerging as a core competitive edge in agentic AI, attention is on whether the existing GPU-centered inference market will be shaken.

Kyung-min Hong hongm@d-today.co.kr

Keyword