MLPerf inference v6.0 benchmark released as AI inference datacentre chip race intensifies

Inside NHN Cloud's Gwangju National AI Data Center built with GPUs including Nvidia's H100 [Photo: NHN Cloud]

MLCommons released MLPerf Inference v6.0 benchmark results on April 1 local time. Twenty-three companies submitted 451 results, directly comparing datacentre accelerator performance from major chipmakers including Nvidia, AMD and Intel.

This round added large generative AI models such as DeepSeek-R1 and Llama 3.1 405B as new benchmarks. It also included Qwen3-VL 235B, a video-generation model, and Wan 2.2 for the first time, reflecting a trend of expanding datacentre inference benchmarks beyond text generation into video and multimodal areas.

Nvidia posted top results in many benchmarks, led by GB300 and B300 accelerators based on its Blackwell architecture. In the DeepSeek-R1 server scenario, it used 72 nodes of GB300, with 4 units per node for a total of 288, processing 1.55 million tokens per second. On a single node with 8 B300 units, it achieved 107,317 tokens per second in the Llama2 70B server scenario and 42,721 tokens per second on DeepSeek-R1. Partners including Cisco and Asustek also submitted B300-based systems delivering about 100,000 to 110,000 tokens per second on Llama2 70B.

AMD entered a cluster built with Instinct MI355X GPUs across 11 nodes, with 8 units per node for a total of 87. It recorded 1,016,375 tokens per second in the Llama2 70B server scenario. On a single node with 8 MI355X units, it also reached 100,282 tokens per second, showing a level comparable to Nvidia's B300 single-node result. Dell, HPE, Giga Computing, Supermicro and Oracle participated with MI355X-based systems, posting results of about 93,000 to 98,000 tokens per second. Cisco and MiTAC recorded 76,000 to 77,000 tokens per second based on MI350X.

An AMD official said the AMD Instinct MI355X GPU achieved performance of more than 1 million tokens per second on new generative AI workloads and demonstrated scalable inference performance. The official said the results highlighted a large generational leap in throughput and proved broad competitiveness on a per-GPU basis for key LLMs such as Llama 2 70B. The official added that stable multi-node scalability was also verified based on a strong partner ecosystem including Dell, HPE and Cisco.

Intel submitted results combining Xeon 6 processors with Arc Pro B-series GPUs. A system using 4 Arc Pro B60 GPUs recorded 1,106 tokens per second in the Llama2 70B server scenario, while a configuration with 4 Arc Pro B70 GPUs posted 1,698 tokens per second. Throughput is lower than with GPU-only accelerators, but the entry is seen as part of expanding a portfolio targeting the CPU-based inference market. Intel also submitted a result showing its Xeon 6980P processor alone processed 9.6 tokens per second in the Llama 3.1 8B offline scenario.

Anil Nanduri (아닐 난두리), vice president of AI products and GTM for Intel's Data Center Group, said the combination of Intel Xeon 6 and Intel Arc Pro B-series GPUs is Intel's investment to expand customer options and value. He said it offers practical solutions and the best performance and value for graphics professionals and AI developers worldwide, covering workloads from large language models to traditional machine learning workloads.

Daegeon Seok d2dg@d-today.co.kr

Keyword