Earlier this week, DeepSeek, a well-funded Chinese AI lab, released an “open” AI model that beats many competitors in popular benchmarks. The DeepSeek V3 model is immense but powerful and handles text tasks such as coding and essay writing with ease.
It also appears to be ChatGPT.
Posts ON X — and TechCrunch’s own tests — show that DeepSeek V3 identifies itself as ChatGPT, a chatbot platform powered by OpenAI artificial intelligence. When asked to elaborate, DeepSeek V3 insists that it is the 2023 version of OpenAI’s GPT-4 model.
This actually plays as of today. In 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4), while only 3 times it claims to be DeepSeekV3.
It gives a gritty idea of the distribution of some training data. https://t.co/Zk1KUppBQM pic.twitter.com/ptIByn0lcv
— Lucas Beyer (bl16) (@giffmana) December 27, 2024
The illusions run deep. If you ask DeepSeek V3 a question about the DeepSeek API, you will receive instructions on how to employ the API. DeepSeek V3 even says a bit of the same jokes as GPT-4 – until the punch line.
So what’s going on?
Models like ChatGPT and DeepSeek V3 are statistical systems. Trained on billions of examples, they learn patterns from them to make predictions — for example, how “to whom” in an email usually precedes “may concern.”
DeepSeek hasn’t revealed much about the source of DeepSeek V3 training data. But it is there is no shortage public datasets containing text generated by GPT-4 via ChatGPT. If DeepSeek V3 had been trained on them, the model could have remembered some of the GPT-4 output and now repeats it verbatim.
“Obviously the model receives raw responses from ChatGPT at some point, but it’s unclear where that is,” Mike Cook, a research fellow at King’s College London specializing in artificial intelligence, told TechCrunch. “It could be ‘accidental’… but unfortunately we have seen cases of people directly training their models on the output of other models to try to leverage their knowledge.”
Cook noted that the practice of training models based on the output of competing AI systems can be “very detrimental” to model quality because it can lead to hallucinations and misleading answers like the above. “Just like making a photocopy of a photocopy, we lose more and more information and connection with reality,” Cook said.
This may also conflict with the terms of service of these systems.
OpenAI’s terms and conditions prohibit users of its products, including ChatGPT customers, from using the results to develop models that compete with OpenAI’s own solutions.
OpenAI and DeepSeek did not immediately respond to requests for comment. However, OpenAI CEO Sam Altman posted what appeared to be a file dig at DeepSeek and other competitors on X Friday.
“It’s relatively easy to copy something you know works,” Altman wrote. “It is extremely difficult to do something new, risky and difficult when you don’t know if it will work.”
It’s true that DeepSeek V3 is not the first model to misidentify itself. Google Gemini and others Sometimes they claim that these are competitive models. For example, the Mandarin prompt, Gemini says that it is a Wenxinyiyan chatbot from the Chinese company Baidu.
That’s because the internet, where AI companies get most of their training data, is becoming more and more popular cluttered with AI slop. Content farms employ artificial intelligence to create clickbait. Bots are flooding in Reddit AND X. About one estimateby 2026, 90% of the network could be generated by artificial intelligence.
This “contamination”, so to speak, has come to fruition quite difficult to accurately filter AI results from training datasets.
It is certainly possible that DeepSeek trained DeepSeek V3 directly on text generated by ChatGPT. Google used to be accused finally doing the same thing.
Heidy Khlaaf, principal artificial intelligence researcher at the nonprofit AI Now Institute, said the savings from “distilling” knowledge of an existing model could be attractive to developers regardless of the risk.
“Even in a situation where web data is currently populated with AI results, other models that happened to train on ChatGPT or GPT-4 results would not necessarily show output that resembles OpenAI’s custom messages,” Khlaaf said. “If it were the case that DeepSeek did the distillation partially using OpenAI models, that wouldn’t be surprising.”
However, it is more likely that a lot of ChatGPT/GPT-4 data ended up in the DeepSeek V3 training set. This means, for example, that the model cannot be trusted to self-identify. More troubling, however, is the possibility that DeepSeek V3, by indiscriminately absorbing and iterating GPT-4 outputs, could worsen some of the model’s problems prejudice AND defects.