PANews reported on January 9 that according to TechCrunch, Elon Musk said in a live conversation with Stagwell Chairman Mark Penn that the current training of AI models has basically exhausted real-world data. "We have exhausted the cumulative sum of human knowledge. This happened last year." Musk agrees with former OpenAI chief scientist Ilya Sutskever, who proposed at the NeurIPS machine learning conference that the AI industry has reached a "data peak" and may need to change the way models are developed in the future.
Musk believes that synthetic data will be a way to supplement real data, and AI will achieve self-learning by generating and self-evaluating data. This trend has been adopted by technology giants including Microsoft, Meta, OpenAI and Anthropic. For example, Microsoft's Phi-4 model and Google's Gemma model both combine real data and synthetic data for training. Gartner predicts that in 2024, about 60% of the data in AI and analysis projects will be synthetically generated.
The advantages of synthetic data include cost savings, such as AI startup Writer only spent about $700,000 to develop its Palmyra X 004 model, which is almost entirely based on synthetic data. In comparison, the development cost of a similar-sized OpenAI model was about $4.6 million. However, synthetic data also has risks, including reduced model creativity, increased output bias, and potential model collapse, especially when the training data itself is biased, the generated results may also be affected.