technology technology technology technology
OpenAI says it is reviewing evidence that the Chinese start-up DeepSeek broke its terms of service by harvesting large amounts of data from its A.I technologies.
The San Francisco-based start-up, which is now valued at $157 billion, said that DeepSeek may have used data generated by OpenAI technologies to teach similar skills to its own systems.
This process, called distillation, is common across the A.I. field. But OpenAI’s terms of service say that the company does not allow anyone to use data generated by its systems to build technologies that compete in the same market.
“We know that groups in the P.R.C. are actively working to use methods, including what’s known as distillation, to replicate advanced U.S. A.I. models,” OpenAI spokeswoman Liz Bourgeois said in a statement emailed to The New York Times, referring to the People’s Republic of China.
“We are aware of and reviewing indications that DeepSeek may have inappropriately distilled our models, and will share information as we know more,” she said. “We take aggressive, proactive countermeasures to protect our technology and will continue working closely with the U.S. government to protect the most capable models being built here.”
DeepSeek did not immediately respond to a request for comment.
DeepSeek spooked Silicon Valley tech companies and sent the U.S. financial markets into a tailspin earlier this week after releasing A.I. technologies that matched the performance of anything else on the market.
The prevailing wisdom had been that the most powerful systems could not be built without billions of dollars in specialized computer chips, but DeepSeek said it had created its technologies using far fewer resources.
Like any other A.I. company, DeepSeek built its technologies using computer code and data corralled from across the internet. A.I. companies lean heavily on a practice called open sourcing, freely sharing the code that underpins their technologies — and reusing code shared by others. They see this is as way of accelerating technological development.
They also need massive amounts of online data to train their A.I. systems. These systems learn their skills by pinpointing patterns in text, computer programs, images, sounds and videos. The leading systems learn their skills by analyzing just about all of the text on the internet.
Distillation is often used to train new systems. If a company takes data from proprietary technology, the practice may be legally problematic. But it is often allowed by open source technologies.
OpenAI is now facing more than a dozen lawsuits accusing it of illegally using copyrighted internet data to train its systems. This includes a lawsuit brought by The New York Times against OpenAI and its partner Microsoft.
The suit contends that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information. Both OpenAI and Microsoft deny the claims.
A Times report also showed that OpenAI has used speech recognition technology to transcribe the audio from YouTube videos, yielding new conversational text that would make an A.I. system smarter. Some OpenAI employees discussed how such a move might go against YouTube’s rules, three people with knowledge of the conversations said.
An OpenAI team, including the company’s president, Greg Brockman, transcribed more than one million hours of YouTube videos, the people said. The texts were then fed into a system called GPT-4, which was widely considered one of the world’s most powerful A.I. models and was the basis of the latest version of the ChatGPT chatbot.