The changes in the Agent development toolchain and the invariance of the application architecture
Wang Chen
|
Oct 22, 2025
|
Almost every month, new commercial products or open-source projects appear in the field of Agent development tools, but the application architecture of Agents remains relatively stable.
The Agent Development Toolchain is Rapidly Evolving
Models bring awareness and autonomy, but they reduce output determinism and consistency. Whether it is the base large model vendors or those providing development toolchains and operational support, the essence is to enhance the reliability of outputs. However, different team genes and industry judgments offer diverse implementation paths. Below, we will review the evolution of the Agent development toolchain through four stages, connecting some well-known development tools.
Stage One: Basic Development Framework
At the end of 2022, the release of ChatGPT allowed the world to intuitively feel the general intelligence potential of large language models for the first time. However, at that time, LLMs were still isolated intelligences, unable to leverage the power of a broad developer base to accelerate industry growth.
Then the first batch of Agent frameworks emerged, such as LangChain and LlamaIndex, which reduced development complexity through modular abstraction, such as model communication, ChatClient, prompts, formatted outputs, embeddings, etc., allowing for the rapid construction of chatbots, connecting contexts, and calling models.
In 2024, Spring AI Aliabab was released, providing a high-level AI API abstraction and cloud-native infrastructure integration solution to help Java developers quickly build AI applications. In the future, it will serve as a part of the AgentScope ecosystem, positioned to connect Spring and AgentScope well, and plans to release the Java version of AgentScope by the end of November this year, aligning with the capabilities of AgentScope Python.
With the rapid development of the industry, various basic development frameworks are also continuously evolving, gradually supporting or integrating capabilities such as retrieval, retrieval-augmented generation (RAG), memory, tools, evaluation, observability, and AI gateways, and providing single agent, workflow, and multi-agent development paradigms, along with deep research agents such as DeepResearch and Jmanus that are based on framework implementations.
Although in the eyes of researchers, what development frameworks do may not be very attractive, they play an irreplaceable role in enabling a wide range of developers to quickly engage in the AI development ecosystem.
Stage Two: Collaboration & Tools
Although large models are intelligent, they lack the tools to extend into the physical world, rendering them both unreadable and unwritable in that realm. Meanwhile, the application development frameworks from the first stage are not friendly to non-programmer groups, which hinders collaboration between teams and the participation of domain experts. During the period from 2023 to 2024, low-code and even no-code development frameworks such as Dify and n8n were pushed into enterprise production environments. These frameworks defined task processing flows through workflows and if/else branches, even using natural language to generate simple frontend pages, improving the collaboration efficiency between domain experts and programmers.
On the tool level, in June 2023, OpenAI officially launched Function Calling. In November 2024, Anthropic released the MCP protocol, achieving cross-model tool interoperability. The emergence of MCP, in particular, significantly activated the developer ecosystem.
Thus, both of these elements jointly pushed the Agent development toolchain into the second stage: Tools & Collaboration.
However, simply lowering the threshold for building applications so that tools can call external applications or systems did not adequately solve the consistency and reliability of outputs. Thus, the evolution of the developer toolchain has entered deeper waters. In 2024, Andrej Karpathy proposed context engineering, resonating within the industry; how to select context, organize context, and dynamically adjust context structures across different tasks becomes key to enhancing output stability, ushering in a phase of reinforcement learning (hereafter referred to as RL).
Stage Three: Reinforcement Learning
System prompts, knowledge bases, tools, and memory are critical components of context engineering; though the mechanisms have matured, outputs remain volatile, relying on RL to turn context engineering from static templates into intelligent dynamic strategies. For example:
RAG Retrieval Ranking: RL optimizes document reordering strategies to make contexts closer to task semantics, reducing irrelevant noise.
Multi-turn Dialogue Memory: RL optimizes strategies for memory retention and forgetting, maintaining coherence in dialogues over long-term interactions.
Tool Invocation: RL learns the timing of invocation and parameter construction, improving the effectiveness and accuracy of tool calls.
RL presents challenges in the industry, relying on algorithmic technology while needing sufficient domain know-how, and facing generality challenges. Yet, there are some preferred practices worth noting.
Recently, Jina.AI was acquired by Elastic, and its CEO, Dr. Xiao Han, shared insights in the article “The Future of Search is Hidden in a Bunch of Small Models,” discussing the foundational model developed by Jina.AI for search, which mainly includes embeddings, rerankers, and readers:
Embeddings are vectorization models designed for multilingual and multimodal data, converting text or images into fixed-length vectors.
Rerankers are fine-tuning models designed based on query-documents, directly inputting a query and a set of documents into the model, then ranking the documents based on relevance to the query.
Readers primarily aim to apply generative small models (SLM) for intelligence on single documents, such as data cleaning, filtering, and extraction, etc.
Additionally, Alibaba Cloud's API gateway provides tool selection and semantic search capabilities based on RL, improving the quality of batch MCP calls and reducing invocation time. Taking tool selection as an example, through reordering and optional query rewriting, the tool list is pre-processed and filtered before requests reach the large language model, enhancing responsiveness and selection accuracy in scenarios with large toolsets and reducing token costs. In evaluations using 50/100/200/300/400/500 different scales of toolsets on Salesforce's open data sets, the results indicate:
Accuracy Improvement: After query rewriting and tool reranking, the accuracy of tool and parameter selection can improve by up to 6%.
Response Speed Improvement: When the scale of toolsets exceeds 50, response time (RT) significantly decreases. In a scenario testing 500 tools, responsiveness can improve by up to 7 times.
Cost Reduction: Token consumption (cost) can be reduced by 4 to 6 times.
These enterprises that excel in RL commonly view this area as a competitive edge for commercial products and a business chargeable item, thus not allowing for quick returns like frameworks and tools do for developers. Therefore, the evolution moved into the fourth stage, where basic model vendors personally got involved in context engineering.
Stage Four: Model Centralization
In October 2025, OpenAI's AgentKit and Apps SDK, along with Claude Skills, were released, marking the entry into an era of model centralization for Agent engineering.
OpenAI AgentKit and Apps SDK: Provide an official-grade Agent development toolchain, directly hosting memory, tool registry, and external application invocation logic at the model end, lowering development thresholds.
Claude Skills: Allow the model itself to load and manage skills, requiring the user only to provide input, with the model internally constructing the context and capability invocation chain.
In particular, Claude Skills can build capabilities, connect to MCP tools, and even execute Python scripts to interface directly with APIs, generating new Skills through the large model. Tasks that were originally the developer's responsibility are now transitioned to the framework side, encompassing construction, execution, and operation.
From the official examples of Claude Skills, it directly helps with the consistency and determinism of outputs, especially in collaborative scenarios. Experiences and requirements can be in the form of xlsx/ppt/words/code… Ultimately given to Claude as a skill.md file as context input. For instance, if you want to build a new system that reuses the permission system and visual specifications of an existing system, you can create a skill.MD file with related code and visual specifications from the current system's permission design according to the official documentation, thus enabling reuse without the need for redesigning or repeated debugging.
Agent Application Architecture is Relatively Stable
In contrast to the rapid changes in the Agent development toolchain, the architecture of Agent applications that map the underlying infrastructure remains relatively stable.
As we shared in the "AI Native Application Architecture White Paper", the AI native application architecture comprises 11 key elements: models, development frameworks, prompts, RAGs, memory, tools, AI gateways, runtimes, observability, evaluation, and security.
AI Gateway: Responsible for aggregation and intelligent scheduling of models and tools, providing access authentication, load balancing, traffic governance, etc. Regardless of the tools that developers use, ultimately, they require a central control hub that includes load balancing, rate limiting, identity authentication, and invocation chain tracing.
Runtime: Provides the operating environment and computational support, responsible for task scheduling, state management, security isolation, timeout management, concurrent tracing, etc., to ensure that Agents operate reliably and economically. Whether it is an on-premise deployed Agent or a hybrid orchestration based on multiple models in a public cloud, it ultimately returns to the allocation of GPU resources, concurrent model inference, and efficiency in container isolation and scheduling. These capabilities do not undergo frequent restructuring due to updates in the tool layer.
Observability: Since Agents are complex systems dynamically composed of multiple elements (models, tools, RAGs, memory, etc.), a lack of unified observability will directly lead to engineering uncontrollability. Currently, there is a high level of consensus in the industry on this layer: there is a need to provide end-to-end log collection and link tracing, offer request throughput, error rates, and resource usage for applications, gateways, and inference engines to ensure stable, secure, and efficient operation. Their structural evolution is relatively stable.
Security: The security of Agents still follows the general logic of cloud computing architectures, including identity authentication, access control, data masking, and protection against privilege escalation. Particularly in multi-model, multi-tenant operating environments, the determinacy of security policies is more important than the flexibility of toolchains.
The quick iteration and innovation of toolchains enhance output reliability; runtime modules like gateways, computing power, observability, and security ensure stable, economical, and safe operation of applications. It is precisely this structure of "fast changes on the upper layer and stability on the lower layer" that guarantees the AI application ecosystem can innovate rapidly without falling into systemic chaos.






