Source: Stone Study Notes
Editor’s Note:
At the end of 2024, the domestic large-scale model companies launched new products in groups, which showed that AI is still hot. In Silicon Valley, AI practitioners summarized some consensus and many "non-consensus" in the AI industry in 2025 after heated discussions. For example, investors in Silicon Valley believe that AI companies are "new species" and AI applications are the investment hotspots in 2025.
From January 11 to 15, Jinqiu Fund held the "Scale with AI" event in Silicon Valley, inviting experts from A16Z, Pear VC, Soma Capital, Leonis Capital, Old Friendship Capital, OpenAI, xAI, Anthropic, Google, Meta, Microsoft, Apple, Tesla, Nvidia, ScaleAI, Perplexity, Character.ai, Midjourney, Augment, Replit, Codiuem, Limitless, Luma, and Runway to communicate together.
After the communication, we also summarized the opinions of these experts and formed these 60 insights.
01Model
1. The pre-training phase of LLM has reached a bottleneck
But there are still many opportunities for post-training
In the pre-training stage, scaling becomes slower and it will take some time to reach saturation.
Reasons for the slowdown: Structure > Computing Power > Data (Single-Model).
But in Multi-model: data = computing power > structure.
For MultiModel, it is necessary to select combinations on multiple modes. Pre-training can be considered completed under the existing architecture, but a new architecture can be changed.
The reason for less investment in pre-training now is more due to limited resources, and the marginal benefit of post-training will be higher.
2. Relationship between Pre-training and RL
Pre-training does not care much about data quality.
Post-training has high requirements on data quality, but due to computing power limitations, high-quality data is provided in the last few parts.
Pre-training is imitation, which means you can only imitate.
RL is about creation and can do different things
Pre-training comes first, then RL in post-training. The model must have basic capabilities so that RL can be targeted.
RL does not change the intelligence of the model, but more of a thinking pattern. For example, using RL to optimize Engagement in C.AI has been very effective.
3. Large model optimization will affect product capabilities
Generally, we mainly help with a lot of safety in the post-training part. For example, when solving the problem of child suicide, C.AI will use different models to serve different groups of people of different ages.
The second is the Multiagent framework. The model will think about what to do to solve the problem, and then assign it to different agents to do. After each agent is done, it will serve the task, and finally the result will be optimized.
4. Some non-consensus issues may reach consensus next year
Is it necessary to have a large model for everything? There were many good small models before, so it may not be necessary to make another one.
The large model now will become a small model in one year.
The model architecture may change. Sacling law has already arrived. The issue to be discussed in the future, decoupling of knowledge models, may be discussed faster.
5. With the end of Scaling law in the LLM field, the gap between closed source and open source is narrowing.
6. Video generation is still at the time of GPT1 and 2
The current video level is close to the SD1.4 version. In the future, there will be an open source version with similar performance to commercial versions.
The current difficulty lies in the dataset. Images are based on the LIAON dataset, which everyone can clean. However, there is no large public dataset for videos due to copyright issues. How each company obtains, processes, and cleans data will vary greatly, resulting in different model capabilities and different levels of difficulty for the open source version.
The next difficult point of the DiT solution is how to improve the compliance with physical laws rather than just statistical probabilities.
The efficiency of video generation is a bottleneck. Currently, it takes a long time to run on high-end graphics cards, which is an obstacle to commercialization and a direction that the academic community is exploring.
Similar to LLM, although the speed of model iteration is slowing down, the application is not slowing down. From a product perspective, it is not a good direction to only focus on literary videos. Related editing and creative products will emerge in an endless stream, and there will be no bottleneck in the short term.
7. Choosing different technology stacks for different scenarios will be a trend
When Sora first came out, everyone thought it would converge to DiT, but in fact there are still many technical paths being worked on, such as the path based on GAN and the real-time generation of AutoRegressive, such as the recently popular project Oasis, and the combination of CG and CV to achieve better consistency and control. Each company has different choices, and choosing different technology stacks for different scenarios will be a trend in the future.
8. The Scaling Law in the video is far from the LLM level
The scaling law of video exists within a certain range, but it is far from the level of LLM. The current maximum model parameter is 30b, and it has been proven to be effective within 30b, but there are no successful cases at the level of 300b.
The current technical solutions are converging, and the approaches are not very different. The main difference is in the data, including the data ratio.
It will take 1-2 years to reach the saturation of the DiT technology route. There are many areas that can be optimized in the DiT route. A more efficient model architecture is very important. Take LLM as an example. At the beginning, everyone was working on a larger model. Later, it was found that after adding MOE and optimizing data distribution, it was not necessary to use such a large model.
More research is needed. Simply scaling up DiT is not efficient. If we include YouTube and TikTok, the amount of video data is very large and it is impossible to use all of it for model training.
At this stage, there is relatively little open source work, especially open source work in data preparation. The cleaning methods of each company are very different, and the data preparation process has a great impact on the final effect, so there are still many points that can be optimized.
9. How to increase the speed of video generation
The simplest method is to generate low-resolution, low-frame-rate images. The most commonly used method is step distillation. Diffusion reasoning has steps. Currently, image generation still requires at least 2 steps. If it can be distilled to 1 step reasoning, it will be much faster. Recently, there is also a paper that generates videos in one step. Although it is only a POC now, it is worth paying attention to.
10. Priority of video model iteration
In fact, clarity, consistency, controllability, etc. have not reached saturation, and it is not yet the point where one part is improved at the expense of another. It is currently in the stage of simultaneous improvement in the pre-training stage.
11. Technical solutions to speed up the generation of long videos
We can see the limits of DiT's capabilities. The larger the model and the better the data, the clearer the generated results, the longer the time and the higher the success rate.
There is no answer to how large the DiT model can scale. If a bottleneck appears at a certain size, a new model architecture may emerge. From an algorithmic perspective, DiT has made a new inference algorithm to support fast speed. The more difficult part is how to add these during training.
The current model's understanding of physical laws is statistical, and the phenomena seen in the data set can be simulated to a certain extent, but it does not really understand physics. There are some discussions in the academic community, such as applying some physical rules to video generation.
12. Fusion of video models and other modalities
There will be two aspects of unification: one is the unification of multimodality, and the other is the unification of generation and understanding. For the former, representation must be unified first. For the latter, both text and speech can be unified. The unification of VLM and diffusion is currently believed to have an effect of 1+1<2. This work will be more difficult, not necessarily because the model is not smart enough, but because the two tasks themselves are contradictory. How to achieve a delicate balance is a complex issue.
The simplest idea is to tokenize everything and put it into the transformer model, and finally unify the input and output. But my personal experience is that it is better to do a single specific modality than to combine all together.
In industrial practice, people don’t put them together to do it. The latest MIT paper potentially shows that if all the modalities are unified, the effect may be better.
13. There is actually a lot of training data for video modality
There is actually a lot of video data, so it is important to efficiently select high-quality data.
The number depends on the understanding of copyright. But computing power is also a bottleneck. Even if there is so much data, there may not be enough computing power to do it, especially high-definition data. Sometimes it is necessary to reverse the required high-quality data set based on the computing power at hand.
High-quality data has always been in short supply, but even if there is data, a big problem is that people don’t know what kind of image description is correct and what keywords should be included in the image description.
14. The future of long video generation lies in storytelling
The current video generation is about materials. The future is about stories. Video generation has a purpose. Long videos are not about how long they are, but about the story they tell. In the form of tasks.
For video editing, the speed will be higher. Because now a stuck point is that the speed is too slow. Now it is in the minute level (generated in a few seconds). So even if there is a good algorithm, it is not usable. (Editing does not mean editing, but image editing, such as changing people, actions, such technology does exist, but the problem is that the speed is slow and unusable.)
15. The aesthetic improvement of video generation mainly depends on post training
It mainly relies on the post-training stage, such as Hailuo, which uses a lot of film and television data. The authenticity is the basic model ability.
16. The two difficulties in video comprehension are long context and latency.
17. The visual modality may not be the best modality to achieve AGI
Text modal - you can also change the text into a picture, and then into a video
Text is a shortcut to intelligence. The efficiency gap between video and text is hundreds of times
18. End-to-end speech model is a great improvement
No need for manual labeling and judgment of data, precise emotional understanding and output can be achieved
19. Multimodal models are still in their early stages
Multimodal models are still in their early stages. It is already difficult to predict the next 5 seconds based on the first 1 second of a video, and adding text later may be even more difficult.
Theoretically, it is best to train with video and text together, but it is difficult to do it as a whole.
Multimodality cannot improve intelligence at present, but it may be able to in the future. Compression algorithms can learn the relationships between data sets. They only need pure text and pure image data, and then they can make videos and text understand each other.
20. The multimodal technology path has not yet fully converged
The Diffsion model is of good quality, and the current model structure is still being modified;
Alter agreeable has good logic.
21. There is no consensus on the alignment of different modalities.
It has not yet been decided whether video tokens are discrete or continuous.
There aren't many high-quality alignments right now.
At present, it is not known whether it is a scientific or engineering problem.
22. It is feasible to generate data for a large model and then train a small model, but the reverse is difficult
The difference between synthetic data and real data is mainly a matter of quality.
You can also use various data to piece together and synthesize, and the effect is also very good. It can be used in the pretraining stage because the data quality requirements are not high.
23. The era of pre-training for LLM is basically over
Now everyone is talking about Post training, which has high requirements for data quality
24. Post training team building
Theoretical team size: 5 people are enough (not necessarily full-time).
One person builds the pipeline (infrastructure).
One person manages the data (data effect).
One person is responsible for the model itself SFT (scientist/paper reader).
One person is responsible for making product judgments on model orchestration and collecting user data.
Products and UI in the AI era, Post training has advantages, AI makes up for product and UI understanding, development is rich, and will not be led astray by AI.
25. Data pipeline construction
Data loop: Data enters the pipeline and generates new data backflow.
Efficient iteration: data labeling combined with pipeline and AB testing, structured data warehouse.
Data input: Efficiently label and enrich user feedback to build a moat.
Initial stage: SFT (keep looping back to this stage).
Subsequent stages: RL (RLFH with more differentiation), scoring-guided RL, DPO method is prone to collapse, SFT simplified version of RL.
02Embodiment
1. Embodied robots have not yet reached a “critical moment” similar to ChatGPT
A core reason is that robots need to complete tasks in the physical world, not just generate text through virtual language.
Breakthroughs in robot intelligence require solving the core problem of "embodied intelligence", that is, how to complete tasks in a dynamic and complex physical environment.
The robot’s “critical moment” needs to meet the following conditions: Versatility: Ability to adapt to different tasks and environments. Reliability: High success rate in the real world. Scalability: Ability to continuously iterate and optimize through data and tasks.
2. The core problem solved by this generation of machine learning is generalization
Generalization is the ability of an AI system to learn patterns from training data and apply them to unseen data.
There are two modes of generalization:
Interpolation: The test data is within the distribution of the training data.
The difficulty of extrapolation lies in whether the training data can cover the test data well, as well as the distribution range and cost of the test data. Here, "cover" or "coverage" is a key concept, which refers to whether the training data can effectively cover the diversity of the test data.
3. Visual tasks (such as face recognition and object detection) are mostly interpolation problems
The work of machine vision is mainly to imitate the perception ability of biological organisms to understand and perceive the environment.
Machine vision models are already very mature for certain tasks (such as cat and dog recognition) because there is a large amount of relevant data to support them. However, for more complex or dynamic tasks, the diversity and coverage of data is still a bottleneck.
Visual tasks (such as face recognition and object detection) are mostly interpolation problems, and the model covers most test scenarios through training data.
However, the model’s ability to extrapolate to new angles or lighting conditions is still limited.
4. The difficulty of generalization of this generation of robots: most of the cases belong to extrapolation cases
Environmental complexity: diversity and dynamic changes in home and industrial environments.
Physical interaction issues: For example, physical characteristics of the door such as weight, angle differences, wear and tear, etc.
Uncertainty in human-computer interaction: The unpredictability of human behavior places higher demands on robots.
5. Robots with full human-like generalization capabilities may not be achieved in the current or future generation
It is extremely difficult for robots to cope with the complexity and diversity in the real world. The dynamic changes in the real environment (such as pets, children, furniture placement in the home, etc.) make it difficult for robots to fully generalize.
Humans themselves are not omnipotent individuals, but they complete complex tasks in society through division of labor and cooperation. Robots also do not necessarily pursue "human-level" generalization capabilities, but are more focused on certain specific tasks, and even achieve "superhuman" performance (such as efficiency and precision in industrial production).
Even seemingly simple tasks (such as sweeping the floor or cooking) have very high generalization requirements due to the complexity and dynamics of the environment. For example, sweeping robots need to cope with the different layouts, obstacles, and floor materials of thousands of households, which increases the difficulty of generalization.
So, do robots need to pick their tasks? For example, robots need to focus on specific tasks rather than pursuing comprehensive human capabilities.
6. Stanford Lab’s choice: Focus on family scenarios
Stanford's robotics lab focuses on tasks in the home, especially household robots related to an aging society. For example, robots can help with daily tasks such as folding blankets, picking up objects, and opening bottle caps.
Reasons for concern: Countries such as the United States, Western Europe, and China are facing serious aging problems. The main challenges brought by aging include: Cognitive deterioration: Alzheimer's disease is a widespread problem, with about half of people over 95 suffering from this disease. Motor deterioration: Diseases such as Parkinson's disease and ALS make it difficult for the elderly to complete basic daily operations.
7. Define generalization conditions based on specific scenarios
Identify the environments and scenarios that the robot needs to handle, such as a home, restaurant, or nursing home.
Once the scenarios are clear, you can better define the scope of the task and ensure that possible item state changes and environmental dynamics are covered in these scenarios.
Importance of scenario debugging: Debugging of robot products is not just about solving technical problems, but about covering all possible situations. For example, in a nursing home, robots need to handle a variety of complex situations (such as slow movement of the elderly, unstable placement of items, etc.). By working with domain experts (such as nursing home managers and caregivers), task requirements can be better defined and relevant data can be collected.
The real-world environment is not as fully controllable as an industrial assembly line, but it can be made "known" through debugging. For example, define the types of objects commonly found in a home environment, their placement, dynamic changes, etc., and cover key areas in both simulation and real environments.
8. The contradiction between generalization and specialization
Conflict between general models and task-specific models: The model needs to have strong generalization capabilities and be able to adapt to a variety of tasks and environments; but this usually requires a large amount of data and computing resources.
Mission-specific models are easier to commercialize, but their capabilities are limited and difficult to extend to other areas.
Future robot intelligence needs to find a balance between versatility and specialization. For example, through modular design, a general model can be used as the basis, and then fast adaptation can be achieved through fine-tuning of specific tasks.
9. The potential of embodied multimodal models
Integration of multimodal data: Multimodal models can simultaneously process multiple inputs such as vision, touch, and language, improving the robot's understanding and decision-making capabilities for complex scenarios. For example, in a grasping task, visual data can help the robot identify the position and shape of an object, while tactile data can provide additional feedback to ensure the stability of the grasp.
The difficulty lies in how to achieve efficient integration of multimodal data in the model and how to improve the robot's adaptability in dynamic environments through multimodal data.
Importance of tactile data: Tactile data can provide robots with additional information to help them complete tasks in complex environments. For example, when grasping a flexible object, tactile data can help the robot perceive the deformation and force of the object.
10. It is difficult to achieve a closed loop of robot data
The field of robotics currently lacks iconic datasets like ImageNet, which makes it difficult to form unified evaluation standards in research.
Data collection is expensive, especially for interactive data involving the real world. For example, collecting multimodal data such as touch, vision, and dynamics requires complex hardware and environment support.
Simulators are considered an important tool to solve the data loop problem, but the "Sim-to-Real Gap" between simulation and the real world is still significant.
11. Challenges of Sim-to-Real Gap
There is a gap between the simulator and the real world in terms of visual rendering, physical modeling (such as friction, material properties, etc.). The robot performs well in the simulation environment, but may fail in the real environment. This gap limits the direct application of simulation data.
12. Advantages and Challenges of Real Data
Real data can more accurately reflect the complexity of the physical world, but its collection cost is high. Data annotation is a bottleneck, especially when it comes to annotation of multimodal data (such as touch, vision, and dynamics).
Industrial environments are more standardized and have clearer mission objectives, which are suitable for the early deployment of robotics. For example, in the construction of solar power plants, robots can complete repetitive tasks such as piling, installing panels, and tightening screws. Industrial robots can gradually improve model capabilities and form a closed loop of data through data collection for specific tasks.
13. In robot operation, tactile and force data can provide critical feedback information
In robotic manipulation, tactile and force data can provide critical feedback information, especially in continuous tasks such as grasping and placing.
Form of tactile data: Tactile data is usually time series data, which can reflect the mechanical changes when the robot contacts the object.
The latest research work is to add touch to the big model.
14. Advantages of Simulation Data
Simulators can quickly generate large amounts of data, which is suitable for early model training and verification. The generation cost of simulation data is low, and it can cover a variety of scenarios and tasks in a short period of time. In the field of industrial robots, simulators have been widely used to train tasks such as grasping and handling.
Limitations of simulation data: The physical modeling accuracy of the simulator is limited, for example, it cannot accurately simulate the material, friction, flexibility and other characteristics of the object. The visual rendering quality of the simulation environment is usually insufficient, which may cause the model to perform poorly in the real environment.
15. Data simulation: Stanford launches a behavior simulation platform
Behavior is a simulation platform centered on home scenarios, supporting 1,000 tasks and 50 different scenarios, covering a variety of environments from ordinary apartments to five-star hotels.
The platform contains more than 10,000 objects, and reproduces the physical and semantic properties of objects (such as cabinet doors that can be opened, clothes that can be folded, glasses that can be broken, etc.) through high-precision 3D models and interactive annotations.
In order to ensure the authenticity of the simulation environment, the team invested a lot of manpower (such as doctoral students annotating data) to carefully annotate the physical properties (mass, friction, texture, etc.) and interactive properties (such as whether it is detachable and whether it will deform). For example, the flexibility of clothes is annotated to support the task of folding clothes, or the wetness effect of plants after watering is annotated.
The Behavior project not only provides a fixed simulation environment, but also allows users to upload their own scenes and objects, and annotate and configure them through the annotation pipeline.
At present, simulation can serve as 80% pretraining, and the remaining 20% needs to be supplemented by data collection and debugging in the real environment.
16. Application of Hybrid Model
Initial training is done with simulated data, and then fine-tuned and optimized with real data. Attempts have been made to scan real scenes into the simulator, allowing the robot to interact and learn in the simulated environment, thereby narrowing the Sim-to-Real Gap.
17. Challenges of robot data sharing
Data is the core asset of a company, and companies are reluctant to share data easily. There is a lack of unified data sharing mechanisms and incentive mechanisms.
Possible solutions:
Data exchange: Task-specific companies contribute data in exchange for the ability to use common models.
Data brokers: Building third-party platforms to collect, aggregate, and distribute data while protecting privacy.
Model sharing: Reduce dependence on raw data through API or model fine-tuning.
At present, some companies are trying these three methods.
18. Choice between dexterous hands and grippers
Advantages of dexterous hands: high degrees of freedom, capable of completing more complex tasks. Dexterous hands can compensate for the inaccuracy of model predictions through adjustments of multiple degrees of freedom.
Advantages of grippers: Low cost, suitable for specific tasks in industrial scenarios. Performs well in material handling tasks on assembly lines, but lacks generalization ability.
19. Co-evolution of hardware and software in embodied robots
The hardware platform and software model need to be iterated synchronously. For example, improving the accuracy of hardware sensors can provide higher quality data for the model. Different companies have different strategies for software and hardware collaboration:
03 AI Application Investment
1. Silicon Valley VCs believe that 2025 will be a big year for AI application investment
Silicon Valley VCs tend to think that 2025 will be a big opportunity for application investment. In the United States, there are basically no killer apps for everyone. People are used to using apps with different functions in different scenarios. The key is to make the user experience as barrier-free as possible.
Last year, basically no one paid attention to application companies, and everyone was looking at LLM and Foundation model.
When investing in applications, VCs will ask, what's your moat?
One of the criteria that Silicon Valley investors use when investing in AI products is: it is best to focus on one direction and make it difficult for competitors to copy it. There needs to be some network effect; either insights that are difficult to copy; or edge technology that is difficult to copy; or horizontal monopoly capital that others cannot obtain. Otherwise, it is difficult to call it entrepreneurship, and it is more like a business.
2. Silicon Valley VCs believe AI product companies are a new species
As a new species, AI companies are very different from previous SaaS companies. They have found product/market fit and their revenue is booming very quickly. The real value creation before hype is in the seed stage.
3. A minority of VCs believe that they can consider investing in Chinese entrepreneurs under certain conditions
The reason is: the new generation of Chinese founders are very energetic and capable of creating good business models.
But the prerequisite is that it is based in the United States.
China and Chinese entrepreneurs are making many new attempts, but international investors are afraid and do not understand. Xiaozhong believes that this is a value gap.
4. Silicon Valley VCs are trying to establish their own investment strategies
Soma Capital: Connect the best people, let the best people introduce their friends, and create life-long friendships. In the process, inspire, support, and connect these people; build a panoramic map, including market segmentation and project mapping, and want to make data-driven investments. We will invest from Seed to C rounds and observe success/failure samples.
Leonis Capital: Research-driven venture capital fund, primarily First Check.
OldFriendship Capital: Work first, invest later. Work with the founder first, conduct customer interviews, determine some interview guidelines, and figure out product problems together, similar to consulting work. Invest in Chinese projects, and judge whether the Chinese founder has the opportunity to work with US customers during work.
Storm Venture: I like unlocking growth, and I prefer companies with PMF in round A. They usually get 1-2M in revenue, and then we judge whether there is unlocking growth to support them to grow to 20M. B2B SaaS focuses on wages, which is only applicable in scenarios with very high labor costs. I think the biggest opportunity at the enterprise level is still automation work.
Inference venture: A $50 million fund that believes barriers are built on interpersonal relationships and domain knowledge.
5. Silicon Valley VCs believe that the requirements for MVPs in the AI era are increasing
Engineer, fintech, HR, etc. are AI product directions that cost more money.
White-collar work is very expensive, paying $40 an hour, and the labor cost is very high, with workers only working 25% of the time. In the future, there may be no middle-level managers, as they will be eliminated.
Companies with the highest labor costs are generally in fields that are easily penetrated by AI. Hospital receptionists are basically not Americans, and their hourly wages may be less than US$2, so it is difficult to be competitive using AI.
There will be a change from Service as a software to AI Agent.
6. Leonis Capital, founded by OpenAI researchers, has five predictions for AI in 2025
There will be an AI programming app that becomes popular.
Model providers begin to control costs: entrepreneurs need to choose model/agent to create a unique supply.
Cost per action pricing emerges.
Data centers will cause power shocks, and there may be new architectures. New frameworks will reduce the size of models, and Multi-agent will become more mainstream.
7. Criteria for AI-native startups
Compared with the competition from big companies: no money, no people, and the organizational structure is different from that of traditional SaaS companies. Notion and Canva are more frustrated when using AI, and Notion does not want to suffer losses in core functions.
The customer acquisition cost of AI Native Data is relatively low, and the ROI provided by AI products is relatively clear. In the AI Scaling process, there is no need to recruit many people, and 50 million people may only need 20 people.
In Moat, it is about model architecture and customization.
8. Large models focus on pre-training, while application companies focus more on reasoning
Each industry has a fixed way and method of looking at problems, and each industry has its own unique Cognitive Architecture. The newly emerging AI Agent adds Cognitive Architecture to the basis of LLM.
9. How to reward the reasoning of AI applications in daily life
The reasoning of AI applications in life can be intention.
Rewarding is very difficult to read, math and coding are easy to do.
Consider the topic’s effectiveness and geographical location.
You can only do dynamic reward and work with similar groups.
10. AI-generated content is not very realistic and may be a new form of content
For example, Cat walking and cooking
04 AI Coding
1. Possible ideas for model training of AI Coding companies
One possible idea is to use the model company's better API to achieve better results at the beginning, even if the cost is higher. After accumulating customer usage data, continuously train your own small models in small scenarios, thereby continuously replacing some API scenarios and achieving better results at a lower cost.
2. Differences between Copilot and Agent modes
The main difference between co-pilots and agents is asynchrony: The main difference lies in the degree of asynchrony that AI assistants use when performing tasks. Co-pilots typically require immediate interaction and feedback from users, while agents can work more independently for longer periods of time before seeking user input. For example, code completion and code chat tools require users to watch and respond in real time. Agents, on the other hand, can perform tasks asynchronously and require less feedback, allowing them to complete more tasks.
Initially the agent was designed to work independently for longer periods of time (10-20 minutes) before providing results. However, user feedback showed that they preferred more control and frequent interactions. Therefore the agent was adjusted to work for shorter periods of time (a few minutes) before asking for feedback, striking a balance between autonomy and user engagement.
Challenges in developing fully autonomous agents: Two major obstacles hinder the development of fully autonomous coding agents. The technology is not yet advanced enough to handle complex, long-term tasks without failing and causing user dissatisfaction. Users are still getting used to the concept of AI assistants making major changes across multiple files or repositories.
3. Core Challenges and Improvements of Coding Agent
Key areas that require further development include: 1. Event modeling 2. Memory and world modeling 3. Accurately planning for the future 4. Improving context utilization, especially long contexts (context utilization drops significantly for more than 10,000 tokens), and enhancing reasoning capabilities for extended memory lengths (e.g., 100,000 tokens or more). Ongoing research aims to improve memory and reasoning capabilities for longer contexts.
While world modeling may seem unrelated to coding agents, it plays an important role in solving common problems such as inaccurate planning. Solving world modeling challenges can improve the coding agent’s ability to make more efficient and accurate plans.
4. An important trend in AI Coding is the use of reasoning enhancement technology, similar to O3 or O1 methods
The approach could significantly improve the overall efficiency of code agents. While it currently involves high costs (10-100 times more), it could reduce error rates by half or even a quarter. As language models advance, these costs are expected to drop rapidly, which could make this approach a common technical route.
The O3 performed significantly better than other models in benchmark tests, including the Total Forces test, where the industry scores are generally around 50 points, but the O3 scored in the 70-75 range.
SMV scores have increased rapidly over the past few months. A few months ago, the scores were in the 30s, but now they are in the 50s.
Model performance enhancement techniques: Based on internal testing, applying advanced techniques can further improve the score to approximately 62 points. Leveraging O3 can push the score to 74-75 points. While these enhancements may significantly increase costs, the overall performance improvement is significant.
User experience and latency thresholds: Determining the optimal balance between performance and user experience is challenging. For autocomplete features, response times longer than 215-500 milliseconds may cause users to disable the feature. In chat apps, response times of a few seconds are often acceptable, but waiting 50-75 minutes is impractical. The threshold for acceptable latency varies by application and user expectations.
The two main obstacles to maximizing model quality are computing power requirements and associated costs.
5. GitHub Copilot is seen as a major competitor.
6. Customer success is critical to the adoption of AI coding tools.
Post-sales support, training, onboarding, and adoption are key differentiators. One startup has 60-70 people dedicated to customer success, about half of its workforce. It’s a big investment, but it helps ensure customer satisfaction.