DeepSeek-V3/R1: Efficient GPU inference system to improve AGI performance

Author: LoRA Time: 03 Mar 2025 782

Today, with the rapid development of artificial intelligence (AI) technology, the DeepSeek team has launched its new DeepSeek-V3/R1 inference system. This system aims to promote the efficient development of AGI (General Artificial Intelligence) through higher throughput and lower latency. To achieve this, DeepSeek adopts Expert Parallelism (EP) technology, significantly improving the computing efficiency of the GPU and expanding batch processing scale while reducing latency.

The core of DeepSeek-V3/R1 is its extremely high sparsity. Only 8 of 256 experts are activated per layer in the model, so a very large batch size is required to ensure that each expert has sufficient processing power. The architecture of this system adopts the prefill-decode disaggregation method, and adopts different degrees of parallelization strategies during the prefill and decoding stages.

During the pre-filling phase, the system hides the communication costs through a double batch overlap strategy, which means that when processing one batch of requests, the communication costs of another batch can be masked by the calculation process, thereby improving overall throughput. In the decoding stage, in response to the time imbalance in different execution stages, DeepSeek adopts a five-level pipeline method to achieve seamless communication and computing overlap.

To cope with the load inequality caused by large-scale parallelism, the DeepSeek team has set up multiple load balancers. These load balancers are committed to balancing computing and communication loads across all GPUs, avoiding a single GPU becoming a performance bottleneck due to overload operations, and ensuring efficient utilization of resources.

In terms of service performance, the DeepSeek-V3/R1 inference service runs on the H800GPU, using matrix multiplication and transmission formats consistent with the training process. According to the latest statistics, the system has processed 608 billion input tokens in the past 24 hours, with the highest node occupancy rate of 278, and the average daily occupancy rate of 226.75, and the overall service performance is good.

Through efficient architectural design and intelligent load management, the DeepSeek-V3/R1 inference system not only improves the inference performance of artificial intelligence models, but also provides strong infrastructure support for future AGI research and application.

Project: https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md

Tips & Information

Deepseek Tutorial

Deepseek is an AI data search and analysis tool. This article introduces the functions, applications and usage methods of Deepseek in detail.

DeepSeek-V3/R1: Efficient GPU inference system to improve AGI performance

Manus Invitation Code Application Guide

Character.AI launches AvatarFX: AI video generation model allows static images to "open to speak"

Manychat completes US$140 million Series B financing, using AI to accelerate global social e-commerce layout

Google AI Overview Severely Impacts SEO Click-through Rate: Ahrefs Research shows traffic drop by more than 34%

Deepseek Tutorial