DeepSeek's "Server Busy" is driving everyone crazy. What's going on?

This article is machine translated
Show original
Here is the English translation of the text, with the specified terms retained as is:
Stuck in the Token.

Image source: Generated by Boundless AI

The frequent "server is busy, please try again later" responses from DeepSeek are driving users crazy across the country.

DeepSeek, which was not well known to the public before, gained prominence due to the launch of its language model V3, which was benchmarked against GPT 4o, on December 26, 2024. On January 20, DeepSeek also released the language model R1, which was benchmarked against OpenAI o1. Thereafter, due to the high-quality answers generated in "deep thinking" mode and the positive signals indicating a potential sharp drop in the upfront cost of model training, the company and its application have gained widespread attention. Subsequently, DeepSeek R1 has been experiencing congestion, with its online search function intermittently crashing and the deep thinking mode frequently prompting "server is busy", causing great inconvenience to a large number of users.

A few weeks ago, DeepSeek started experiencing server interruptions. On January 27th, noon, the DeepSeek website displayed "DeepSeek web/api unavailable" multiple times. On that day, DeepSeek became the iPhone app with the highest downloads over the weekend, surpassing ChatGPT in the US download rankings.

On February 5th, DeepSeek's mobile app had been online for 26 days, with a daily active user count of over 40 million, while ChatGPT's mobile app had 54.95 million daily active users, making DeepSeek 74.3% of ChatGPT's. Almost as soon as DeepSeek emerged from its steep growth curve, complaints about its server congestion started pouring in, with users around the world encountering the inconvenience of experiencing crashes after just a few questions. Various alternative access methods have also started to appear, such as DeepSeek's alternative websites, with major cloud service providers, chip manufacturers, and infrastructure companies all launching their own versions, and personal deployment tutorials also appearing everywhere. However, people's frustration has not subsided: almost all major manufacturers worldwide have claimed to support the deployment of DeepSeek, but users in various regions are still complaining about the instability of the service.

What exactly happened behind the scenes?

1. People used to ChatGPT cannot tolerate the inaccessibility of DeepSeek

People's dissatisfaction with "DeepSeek server is busy" stems from their previous experience with the AI top-stream application ChatGPT, which rarely experienced lags.

Since the launch of the OpenAI service, ChatGPT has experienced a few P0-level (the most severe incident level) downtime incidents, but overall, it has been relatively reliable and has gradually become a key component of traditional cloud services, having found a balance between innovation and stability.

ChatGPT has not experienced many large-scale downtime incidents

The reasoning process of ChatGPT is relatively stable, including two steps: encoding and decoding. The encoding stage converts the input text into vectors containing semantic information, and the decoding stage uses the previously generated text as context to generate the next word or phrase through the Transformer model until a complete sentence is generated. The large model itself is a Decoder architecture, and the decoding stage is the output process of one token after another.

For example, if you ask ChatGPT, "How are you feeling today?", ChatGPT will encode the sentence, generate attention representations for each layer, and based on the attention representations of all previous tokens, predict the first output token "I". Then it will decode, concatenate "I" to "How are you feeling today?", get a new attention representation, and then predict the next token "am", and so on, until it generates the complete sentence "How are you feeling today? I am doing well."

The container orchestration tool Kubernetes is the "behind-the-scenes conductor" of ChatGPT, responsible for scheduling and allocating server resources. When the influx of users exceeds the capacity of the Kubernetes control plane, it can lead to a complete paralysis of the ChatGPT system.

The total number of ChatGPT outages is not too many, but this is supported by its powerful resources, and the stable operation is backed by strong computing power, which is often overlooked.

In general, since the data scale in the inference process is usually relatively small, the demand for computing power is not as high as in the training process. Industry insiders estimate that in the normal large model inference process, the main occupation of video memory is the model parameters, accounting for more than 80%. The reality is that the default model sizes embedded in ChatGPT are smaller than the 671B of DeepSeek-R1, and ChatGPT has much more GPU computing power than DeepSeek, naturally exhibiting more stable performance than DS-R1.

DeepSeek-V3 and R1 are both 671B models, and the model startup process is the inference process. The computing power reserve required for inference needs to match the user volume, such as 100 million users requiring 100 million users' worth of GPUs, which is not only massive but also independent of the computing power reserve for training, unrelated. From the information available, DS's GPU and computing power reserves are clearly insufficient, resulting in frequent lags.

This contrast makes users accustomed to the smooth experience of ChatGPT less tolerant, especially as their interest in R1 is growing.

2. Stuck, stuck, and still stuck

Moreover, a closer comparison reveals that the situations encountered by OpenAI and DeepSeek are quite different.

The former has Microsoft as a backer, with Azure cloud services hosting ChatGPT, Dalle-E 2 image generator, and GitHub Copilot auto-coding tool, forming a classic cloud+AI model that has quickly become an industry standard. The latter, although a startup, relies mainly on self-built data centers, similar to Google, rather than relying on third-party cloud computing providers. After reviewing public information, Silicone People found that DeepSeek has not initiated any real cooperation with cloud vendors or chip manufacturers (although cloud vendors announced during the Spring Festival that they would allow DeepSeek models to run on their platforms, but they did not actually engage in any meaningful cooperation).

Moreover, DeepSeek has encountered unprecedented user growth, which means it had less time to prepare for surge situations compared to ChatGPT.

DeepSeek's good performance comes from its overall optimization at the hardware and system levels. DeepSeek's parent company Phantom Quantitative invested 200 million yuan in 2019 to build the Firefly One supercomputer cluster, which by 2022 had quietly accumulated tens of thousands of A100 GPUs. To enable more efficient parallel training, DeepSeek has self-developed the HAI LLM training framework. Industry insiders believe that the Firefly cluster may use thousands to tens of thousands of high-performance GPUs (such as NVIDIA A100/H100 or domestic chips) to provide powerful parallel computing capabilities. Currently, the Firefly cluster supports the training of models such as DeepSeek-R1 and DeepSeek-MoE, which perform close to GPT-4 level in tasks such as mathematics and coding.

The Firefly cluster represents DeepSeek's exploration of new architectures and methods, and also leads outsiders to believe that through such innovative technologies, DS can reduce training costs and train R1 models with performance comparable to top AI models using only a fraction of the computing power of Western models. SemiAnalysis estimates that DeepSeek actually has a massive computing power reserve: DeepSeek has amassed a total of 60,000 NVIDIA GPUs, including 10,000 A100s, 10,000 H100s, 10,000 "special edition" H800s, and 30,000 "special edition" H20s.

This seems to suggest that R1's GPU capacity is relatively sufficient. But in reality, as an inference model benchmarked against OpenAI's O3, R1 requires more computing power to be deployed for the response process, and it is unclear whether the computing power saved on the training cost side can meet the surge in computing power on the inference cost side.

It is worth mentioning that DeepSeek-V3 and DeepSeek-R1 have different modes of operation. DeepSeek-V3 is an instruction model, similar to ChatGPT, receiving prompts and generating corresponding text in response. But DeepSeek-R1 is an inference model, where when users ask questions, it will first go through a large amount of reasoning process, and then generate the final answer. The tokens generated by R1 will first appear as a large amount of thought process, with the model explaining the problem, decomposing the problem, and all these reasoning processes being quickly generated in the form of tokens before generating the answer.

Here is the English translation:

In the view of Wen Tingcan, Vice President of Yaotu Capital, the huge computing power reserve of the aforementioned DeepSeek is referring to the training stage. The computing power of the training team can be planned and expected, and it is not easy to have insufficient computing power. However, the inference computing power has greater uncertainty, as it mainly depends on the user scale and usage, and is relatively more elastic. "The inference computing power will grow according to a certain rule, but as DeepSeek becomes a phenomenal product, the user scale and usage will grow explosively in a short period of time, leading to an explosive growth in the demand for inference computing power, which causes the stuttering."

Gui Cang, an active model product designer and independent developer, agrees that the card shortage is the main cause of DeepSeek's stuttering. He believes that as the mobile application with the highest global downloads in 140 markets, DeepSeek's current stuttering cannot be supported no matter what new cards are used, because "it takes time to build new cloud-based cards".

"The cost of running an NVIDIA A100 or H100 chip for an hour has a fair market price, and the inference cost of DeepSeek's output Token is more than 90% cheaper than OpenAI's similar models, which is not much different from everyone's calculations. Therefore, the MOE (Mixture of Experts) model architecture itself is not the main problem, but the number of GPUs DeepSeek has determines the maximum number of Tokens they can produce per minute. Even if they can use more GPUs for inference services for users instead of pre-training research, the upper limit is still there," said Chen Yunfei, the developer of the AI-native application Xiaomaobuliang.

Some industry insiders have also mentioned to Silicon People that the essence of DeepSeek's stuttering is that the private cloud has not been well-prepared.

Hacker attacks are another driving factor for the R1 stuttering. On January 30, the media learned from the cybersecurity company Qi'an Information that the intensity of attacks on DeepSeek's online services suddenly increased, with the attack instructions increasing more than a hundredfold compared to January 28. The Qi'an Xlab observed that at least 2 botnets were involved in the attack.

But for this R1 self-service stuttering, there seems to be a relatively obvious solution, which is to provide services through third parties. This is also the most lively scene we witnessed during the Spring Festival - various manufacturers have deployed services to meet people's demand for DeepSeek.

On January 31, NVIDIA announced that NVIDIA NIM can now use DeepSeek-R1, and NVIDIA was affected by DeepSeek, with its market value evaporating nearly $600 billion overnight. On the same day, Amazon Web Services (AWS) users can deploy the latest DeepSeek-R1 base model on their AI platforms, Amazon Bedrock and Amazon SageMaker AI. Subsequently, AI application newcomers including Perplexity and Cursor also batch-accessed DeepSeek. Microsoft rushed ahead of Amazon and NVIDIA, and was the first to deploy DeepSeek-R1 on its cloud service Azure and Github.

Starting on February 1st, the fourth day of the Lunar New Year, Huawei Cloud, Alibaba Cloud, Bytedance's Volcano Engine, and Tencent Cloud also joined in, generally providing full-series and full-size model deployment services for DeepSeek. After that, AI chip manufacturers such as Biren Technology, Horizon Robotics, Ascend, and Sunmoon also claimed to have adapted the original version or smaller distilled versions of DeepSeek. On the software company side, Yonyou, Kingdee, and others have integrated DeepSeek models into some of their products to enhance their product capabilities. Finally, terminal manufacturers such as Lenovo, Huawei, and Honor have integrated DeepSeek models into some of their products for use as personal assistants and smart car cabins.

So far, DeepSeek has attracted a comprehensive and huge circle of friends by virtue of its own value, encompassing domestic and foreign cloud vendors, operators, securities firms, and national-level platforms such as the National Supercomputing Internet Platform. Since DeepSeek-R1 is a completely open-source model, the service providers who have accessed it have all become beneficiaries of the DS model. On the one hand, this has greatly increased the volume of DS, and at the same time, it has also caused more frequent stuttering phenomena, with service providers and DS itself becoming increasingly troubled by the influx of users, and they have not yet found the key to solving the problem of stable use.

"The R1 heat remains high, and service providers need to take into account the access of other models, so the cards they can provide for R1 are very limited. When R1 is hot, whoever provides it at a relatively low price will be overwhelmed," explained Gui Cang, the model product designer and independent developer, to Silicon People.

Model deployment optimization is a broad field covering many aspects, from training completion to actual hardware deployment, involving multi-faceted work. But for the DeepSeek stuttering incident, the reasons may be simpler, such as the overly large model and insufficient optimization preparation before launch.

Before the launch of a popular large model, it will face challenges involving technology, engineering, business, and other aspects, such as the consistency between training data and production environment data, the impact of data latency and real-time on model inference performance, the high online inference efficiency and resource occupation, the lack of model generalization capability, as well as engineering aspects such as service stability, API and system integration.

Many popular large models pay high attention to inference optimization before launch, because of the problems of computation time and memory. The former refers to the too long inference latency, causing poor user experience and even unable to meet the latency requirements, which is the stuttering phenomenon. The latter refers to the large model parameters, consuming a lot of video memory, and even a single GPU card cannot accommodate it, which will also cause stuttering.

Wen Tingcan explained to Silicon People that the service providers encountered challenges in providing R1 services, which is essentially due to the special structure of the DS model, the overly large model + MOE (Mixture of Experts) architecture. "(The service providers) need time to optimize, but the market heat has a time window, so they choose to go online first and then optimize, rather than fully optimize before going online."

For R1 to run stably, the key now lies in the reserve and optimization capabilities on the inference side. DeepSeek needs to find ways to reduce the cost of inference and the number of cards output per Token.

At the same time, the stuttering also indicates that DeepSeek's own computing power reserve may not be as huge as SemiAnalysis claimed. Fantom Capital needs to use cards, and the DeepSeek training team also needs to use cards, so there are not many cards left for users. Based on the current development situation, DeepSeek may not have the motivation to spend money renting services and providing better experiences to users for free in the short term. They are more likely to wait until the first wave of consumer business models is clarified, and then consider the issue of service leasing, which also means that the stuttering may continue for a considerable period of time.

"They probably need two actions: 1) Implement a paid mechanism to limit the model usage of free users; 2) Cooperate with cloud service providers to use their GPU resources," said developer Chen Yunfei, whose temporary solution has gained consensus in the industry.

However, at present, DeepSeek does not seem to be too anxious about this "server busy" problem. As a company pursuing AGI, DeepSeek seems unwilling to focus too much on the influx of user traffic. It is possible that users will have to get used to the "server busy" interface for a considerable period of time in the future.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments