上下文工程

**Context engineering**
**上下文工程**

Part of Machine Learning for Engineers
工程师机器学习系列

3rd August 2025 - 13 minute read
2025年8月3日 - 13分钟阅读

---

As our use of LLMs has changed from conversational chatbots and into integral decision-making components of complex systems, our inference approach must also evolve. The practice of "prompt engineering", in which precise wording is submitted to the LLM to elicit desired responses, has serious limitations. And so this is giving way to a more general practice of considering every token fed into the LLM in a way that is more dynamic, targeted, and deliberate. This expanded, more structured practice is what we now call "context engineering."
随着我们对大型语言模型（LLM）的使用从对话式聊天机器人转变为复杂系统中不可或缺的决策组件，我们的推理方法也必须随之演进。“提示工程”（prompt engineering）这种通过向LLM提交精确措辞以引出期望响应的做法，存在严重的局限性。因此，一种更通用的实践正在取而代之，它以一种更动态、更有针对性和更深思熟虑的方式来考虑输入给LLM的每一个令牌（token）。这种扩展的、更结构化的实践，就是我们现在所说的“上下文工程”（context engineering）。

> Throughout, we'll use a toy example of understanding how an LLM might help us answer a subjective question such as "What is the best sci-fi film?"
> 在全文中，我们将使用一个简单的例子来理解LLM如何帮助我们回答一个主观问题，例如“什么是最好的科幻电影？”

---

### **Context windows**
### **上下文窗口**

An LLM is a machine learning model that understands language by modelling it as a sequence of tokens and learning the meaning of those tokens from the patterns of their co-occurrence in large datasets. The number of tokens that the model can comprehend is a fixed quantity for each model, often in the hundreds of thousands, and is known as the context window.
LLM是一种机器学习模型，它通过将语言建模为令牌序列，并从大型数据集中令牌的共现模式中学习其含义来理解语言。模型能够理解的令牌数量对每个模型来说是固定的，通常可达数十万，这被称为上下文窗口。

LLMs are trained through repeated exposure to coherent token sequences — normally large textual databases scraped from the internet. Once trained, we use the LLM by running "inference" (i.e. prediction) of the next token based on all the previous tokens in a sequence. This sequence of previous tokens is what we used to refer to as the prompt.
LLM通过反复接触连贯的令牌序列进行训练——这些序列通常来自从互联网上抓取的大型文本数据库。一旦训练完成，我们通过基于序列中所有先前的令牌来“推理”（即预测）下一个令牌来使用LLM。这个先前的令牌序列就是我们过去所说的提示（prompt）。

Inference continues the token sequence by adding high-probability tokens to the sequence one at a time.
推理过程通过一次一个地向序列中添加高概率的令牌来延续令牌序列。

> When prompted to complete the sentence "the best sci-fi film ever made is...", the highest probability tokens to be generated might be probably, star, and wars.
> 当被提示完成句子“有史以来最好的科幻电影是…”时，生成概率最高的令牌可能就是“星球”和“大战”。

Early uses of LLMs focused on this mode of "completion", taking partially written texts and predicting each subsequent token in order to complete the text based on the desired lines. While impressive at the time, this was limiting in several ways, including that it was difficult to instruct the LLM exactly how you wished the text to be completed.
LLM的早期应用专注于这种“补全”模式，即接收部分书写的文本，并预测每个后续令牌，以便按期望的方式完成文本。尽管这在当时令人印象深刻，但它在多个方面都存在局限性，包括很难精确地指示LLM按照你希望的方式完成文本。

---

### **Chat framing**
### **对话框架**

To address this limitation, model providers started training their models to expect sequences of tokens that framed conversations, with special tokens inserted to indicate the hand-off between two speakers. By learning to replicate this "chat" framing when generating a completion, models were suddenly far more usable in conversational settings, and therefore easier to instruct.
为了解决这个局限，模型提供商开始训练他们的模型，使其能够预期一种构建对话框架的令牌序列，其中插入了特殊令牌来指示两位说话者之间的交接。通过学会在生成补全内容时复制这种“对话”框架，模型在对话场景中的可用性突然大大增强，因此也更容易被指令。

The context window started to be more greedily filled up by different types of messages — system messages (special instructions telling the LLM what to do), and chat history from both the user and the response from the LLM itself.
上下文窗口开始被不同类型的消息更积极地填充——包括系统消息（告诉LLM该做什么的特殊指令），以及来自用户和LLM自身响应的聊天记录。

> With a chat framing, we can instruct the LLM that it is "a film critic" before "asking" it what the best sci-fi film is. Maybe we'll now get the response tokens blade and runner, as the AI plays the role of a speaker likely to reflect critical rather than popular consensus.
> 通过对话框架，我们可以在“询问”LLM什么是最好的科幻电影之前，指示它扮演“影评人”的角色。这样，我们现在可能会得到“银翼”和“杀手”这两个响应令牌，因为AI扮演了一个可能反映批判性共识而非大众共识的角色。

The crucial point to understand here is that the LLM architecture did not change — it was still just predicting the next token one at a time. But it was now doing that with a worldview learned from a training dataset that framed everything in terms of delimited back-and-forth conversations, and so would consistently respond in kind.
这里的关键点在于，LLM的架构并没有改变——它仍然只是在一次预测一个令牌。但现在，它是在一个从训练数据集中学到的世界观下进行预测，这个世界观将一切都框定在有界限的来回对话中，因此它会以同样的方式作出一致的回应。

---

### **Prompt engineering**
### **提示工程**

In this setting, getting the most out of LLMs involved finding the perfect sequence of prompt tokens to elicit the best completions. This was the birth of so-called "prompt engineering", though in practice there was often far less "engineering" than trial-and-error guesswork. This could often feel closer to uttering mystical incantations and hoping for magic to happen, rather than the deliberate construction and rigorous application of systems thinking that epitomises true engineering.
在这种背景下，要充分利用LLM，就需要找到完美的提示令牌序列以引出最佳的补全。这就是所谓的“提示工程”的诞生，尽管在实践中，这往往更像是反复试错的猜测，而非真正的“工程”。这常常让人感觉更像是念咒语并期盼奇迹发生，而不是体现真正工程精髓的、深思熟虑的构建和严谨的系统思维应用。

> We might try imploring the AI to reflect critical consensus with a smarter system prompt, something like You are a knowledgeable and fair film critic who is aware of the history of cinema awards. We might hope that this will "trick" the LLM into generating more accurate answers, but this hope rests on linguistic probability and offers no guarantees.
> 我们可以尝试用一个更聪明的系统提示来恳求AI反映批判性共识，比如“你是一位博学公正、了解电影奖项历史的影评人”。我们可能希望这能“诱使”LLM生成更准确的答案，但这种希望仅仅建立在语言概率之上，并不能提供任何保证。

---

### **In-context learning**
### **上下文学习**

As LLMs got smarter and more reliable, we were able to feed them more complex sequences of tokens, covering different types of structured and unstructured data. This enabled LLMs to produce completions that displayed "knowledge" of probable token sequences based on novel structures in the prompt, rather than just remembered patterns from their training dataset. This mode of feeding examples to the LLM is known as in-context learning because the LLM appears to "learn" how to produce output purely based on example sequences within its context window.
随着LLM变得越来越智能和可靠，我们能够向它们输入更复杂的令牌序列，涵盖不同类型的结构化和非结构化数据。这使得LLM能够根据提示中的新颖结构生成展现出对可能令牌序列“知识”的补全，而不仅仅是依赖于训练数据中记忆的模式。这种向LLM提供示例的模式被称为上下文学习（in-context learning），因为LLM似乎仅基于其上下文窗口内的示例序列来“学习”如何生成输出。

This approach led to an explosion of different token sequences that we might programmatically include within the prompt:
这种方法导致了我们可以通过编程方式在提示中包含的各种不同令牌序列的爆炸式增长：

*   **Hard-coded examples**, taken from our knowledge domain (documentation, past examples of good output from human or generated sources, toy examples) to encourage predictable output.
    **硬编码示例**，取自我们的知识领域（文档、过去来自人类或生成源的良好输出示例、玩具示例），以鼓励可预测的输出。
*   **Non-text modalities**, with tokens that represented images, audio, or video, that were either directly part of the context window, or first transcribed to text and then tokenised.
    **非文本模态**，使用代表图像、音频或视频的令牌，这些令牌要么直接成为上下文窗口的一部分，要么先被转录为文本然后进行令牌化。
*   **Tool and function calls**, defining external functions that the LLM could tell the caller to invoke to access data or computation from the outside world.
    **工具和函数调用**，定义外部函数，LLM可以告知调用者去调用这些函数以访问外部世界的数据或计算资源。
*   **Documents and summaries**, returned via "RAG" from data sources, or uploaded by users, to feed knowledge into the LLM that lay outside its training dataset.
    **文档和摘要**，通过“RAG”从数据源返回，或由用户上传，将训练数据集之外的知识输入LLM。
*   **Memory and conversation history**, condensing information from prior chats, that allowed continuity between a single user and the "chatbot" over multiple conversations.
    **记忆和对话历史**，浓缩之前聊天的信息，从而在单个用户与“聊天机器人”的多次对话中保持连续性。

> In our sci-fi film example, our prompt could include many things to help the LLM: historic box office receipts, lists of the hundred greatest films from various publications, Rotten Tomatoes ratings, the full history of Oscar winners, etc.
> 在我们的科幻电影例子中，我们的提示可以包含很多东西来帮助LLM：历史票房收入、各大出版物的百佳电影名单、烂番茄评分、奥斯卡获奖者的完整历史等。

Suddenly, our 100,000+ context window isn't looking so generous anymore, as we stuff it with tokens from all kinds of places.
突然之间，当我们用来自四面八方的令牌把它塞满时，我们那10万+的上下文窗口看起来就不那么宽裕了。

This expansion of context not only depletes the available context window for output generation, it also increases the overall footprint and complexity of what the LLM is paying attention to at any one time. This then increases the risk of failure modes such as hallucination. As such, we must start approaching its construction with more nuance — considering brevity, relevance, timeliness, safety, and other factors.
这种上下文的扩展不仅消耗了可用于生成输出的上下文窗口，还增加了LLM在任何时刻需要关注内容的总体规模和复杂性。这继而增加了如幻觉等失败模式的风险。因此，我们必须开始以更精细的方式来构建它——考虑简洁性、相关性、时效性、安全性及其他因素。

At this point, we aren't simply "prompt engineering" anymore. We are beginning to engineer the entire context in which generation occurs.
至此，我们不再仅仅是进行“提示工程”。我们已经开始工程化生成发生的整个上下文。

---

### **From oracle to analyst**
### **从神谕到分析师**

Language encodes knowledge, but it also encodes meaning, logic, structure, and thought. Training an LLM to encode knowledge of what exists in the world, and to be capable of producing language that would describe it, therefore, also produces a system capable of simulating thought. This is, in fact, the key utility of an LLM, and to take advantage of it requires a mindset shift in how we approach inference.
语言编码知识，但它也编码意义、逻辑、结构和思想。训练一个LLM来编码世界存在的知识，并能够生成描述这些知识的语言，因此也产生了一个能够模拟思维的系统。这实际上是LLM的关键效用，而要利用它，就需要我们在如何进行推理方面转变思维模式。

To adopt context engineering as an approach to LLM usage is to reject using the LLM as a mystical oracle to approach, pray to with muttered incantations, and await the arrival of wisdom. We instead think of briefing a skilled analyst: bringing them all the relevant information to sift through, clearly and precisely defining the task at hand, documenting the tools available to complete it, and avoiding reliance on outdated, imperfectly remembered training data.
采用上下文工程作为使用LLM的方法，意味着拒绝将LLM当作一个神秘的神谕来对待——对它喃喃祈祷，然后等待智慧的降临。相反，我们应该把它看作是向一位熟练的分析师做简报：提供所有相关信息供其筛选，清晰而精确地定义手头的任务，记录可用于完成任务的工具，并避免依赖过时的、记忆不全的训练数据。

In practice, our integration of the LLM shifts from "crafting the perfect prompt", towards instead the precise construction of exactly the right set of tokens needed to complete the task at hand. Managing context becomes an engineering problem, and the LLM is reframed as a task solver whose natural output is natural language.
在实践中，我们对LLM的集成从“精心设计完美的提示”，转向了精确构建完成手头任务所需的确切令牌集。管理上下文变成了一个工程问题，而LLM被重新定义为一个以自然语言为自然输出的任务解决者。

---

### **Engineering context for agentic behaviour**
### **为智能体行为工程化上下文**

Let's consider a simple question you might wish an LLM to answer for you:
让我们考虑一个你可能希望LLM为你回答的简单问题：

> What is the average weekly cinema box office revenue in the UK?
> 英国电影院的平均周票房收入是多少？

In "oracle" mode, our LLM will happily quote a value learned from the data in its training dataset prior to its cutoff:
在“神谕”模式下，我们的LLM会很乐意引用它在训练数据截止日期前学到的一个数值：

> As of 2019, the UK box office collects roughly £24 million in revenue per week on average.
> 截至2019年，英国票房平均每周收入约为2400万英镑。

This answer from GPT 4.1 is accurate, but imprecise and outdated. Through context engineering, we can do a lot better. Consider what additional context we might feed into the context window before generating the first token of the response:
这个来自GPT 4.1的答案是准确的，但不精确且已过时。通过上下文工程，我们可以做得更好。考虑一下在生成响应的第一个令牌之前，我们可以向上下文窗口提供哪些额外信息：

*   The date, so we use updated stats (GPT 4.1 thinks now is June 2024)
    日期，这样我们就可以使用更新的统计数据（GPT 4.1认为现在是2024年6月）
*   Actual published statistics such as this BBC News article
    实际发布的统计数据，例如这篇BBC新闻文章
*   Instructions on how to tell the caller to divide two numbers
    关于如何告诉调用者将两个数字相除的指令

The above should be enough for the LLM to know how to: look for data for 2024; extract the total figure of £979 million from the document; and call an external function to precisely divide that by 52 weeks. Assuming the caller then runs that calculation and invokes the LLM again, with all the above context, plus its own output, plus the result of the calculation, we will then get our accurate answer:
以上信息应足以让LLM知道如何：查找2024年的数据；从文档中提取9.79亿英镑的总数；并调用一个外部函数将其精确地除以52周。假设调用者随后运行该计算并再次调用LLM，提供以上所有上下文、LLM自身的输出以及计算结果，我们就能得到准确的答案：

> Across the full year of 2024, the UK box office collected £18.8 million in revenue per week on average.
> 在2024年全年，英国票房平均每周收入为1880万英镑。

Even this trivial example involves multiple ways of engineering the context before generating the answer:
即使是这个微不足道的例子，也涉及在生成答案之前对上下文进行工程化的多种方式：

*   Stating the current date and desired outcome;
    说明当前日期和期望的结果；
*   Searching for and returning relevant documents;
    搜索并返回相关文档；
*   Documenting available calculation operations;
    记录可用的计算操作；
*   Expanding context with intermediary results.
    用中间结果扩展上下文。

Fortunately, we do not need to invent a new approach every single time.
幸运的是，我们不需要每次都发明一种新方法。

---

### **Is this just RAG?**
### **这仅仅是RAG吗？**

Retrieval-augmented generation (RAG) is a fashionable technique for injecting external knowledge into the context window at inference time. Leaving aside implementation details of how to identify the correct documents to include, we can clearly see that this is another specific form of context engineering.
检索增强生成（RAG）是一种在推理时将外部知识注入上下文窗口的流行技术。撇开如何识别要包含的正确文档的实现细节不谈，我们可以清楚地看到，这是上下文工程的另一种具体形式。

This is a useful and obvious way to use pre-trained LLMs in contexts that need access to knowledge outside the training dataset.
对于需要在训练数据集之外获取知识的场景，这是一种使用预训练LLM的有用且显而易见的方法。

For a correct answer, our application needs to be aware of up-to-date film reviews, ratings, and awards, to track new films and critical opinion after the point the model was trained. By including relevant extracts in the context window, we enable our LLM to generate completions with today's data and avoid hallucination.
为了得到正确的答案，我们的应用程序需要了解最新的电影评论、评级和奖项，以追踪模型训练之后出现的新电影和评论观点。通过在上下文窗口中包含相关摘录，我们使LLM能够利用今天的数据生成补全，并避免产生幻觉。

To do this, we can search for relevant documents and then include them in the context window. If this sounds conceptually simple, that is because it is — though reliable implementation is not trivial and requires robust engineering.
要做到这一点，我们可以搜索相关文档，然后将它们包含在上下文窗口中。如果这听起来在概念上很简单，那是因为它确实如此——尽管可靠的实现并非易事，需要稳健的工程。

Complex systems can be brittle and opaque to build. We need a way to scale complexity without harming our ability to maintain, debug, and reason about our code. Fortunately, we can apply the same thinking that traditional software design used to solve this same problem.
复杂的系统在构建时可能既脆弱又不透明。我们需要一种方法来扩展复杂性，同时不损害我们维护、调试和推理代码的能力。幸运的是，我们可以应用传统软件设计解决同样问题时所使用的相同思维。

We can think of RAG as simply the first of many design patterns for context engineering. And just as with other software engineering design patterns, in future we will find that most complex systems will have to employ variations and combinations of such patterns in order to be most effective.
我们可以将RAG仅仅看作是上下文工程众多设计模式中的第一个。就像其他软件工程设计模式一样，未来我们会发现，大多数复杂系统将不得不采用这些模式的变体和组合，以达到最佳效果。

---

### **Composition over inheritance**
### **组合优于继承**

In software engineering, design patterns promote reusable software by providing proven, general solutions to common design problems. They encourage composition over inheritance, meaning systems are built from smaller, interchangeable components rather than rigid class hierarchies. They make your codebase more flexible, testable, and easier to maintain or extend. They are a crucial piece of the software design toolkit, that enable engineers to build large functioning codebases that can scale over time.
在软件工程中，设计模式通过为常见设计问题提供经过验证的通用解决方案来促进软件的重用。它们鼓励组合优于继承，这意味着系统是由更小的、可互换的组件构建的，而不是僵化的类层次结构。这使你的代码库更灵活、可测试、易于维护或扩展。它们是软件设计工具包的关键部分，使工程师能够构建可随时间扩展的大型功能代码库。

Some examples of software engineering design patterns include:
软件工程设计模式的一些例子包括：

*   **Factory**: standardises object creation to make isolated testing easier
    **工厂模式**：标准化对象创建，使隔离测试更容易
*   **Decorator**: extends behaviour without editing the original
    **装饰器模式**：在不修改原始对象的情况下扩展其行为
*   **Command**: passes work around as a value, similar to a lambda function
    **命令模式**：将工作作为值传递，类似于lambda函数
*   **Facade**: hides internals with a simple interface to promote abstraction
    **外观模式**：用简单的接口隐藏内部细节，以促进抽象
*   **Dependency injection**: wires modules externally using configuration
    **依赖注入**：通过配置在外部连接模块

These patterns were developed over a long time, though many were first codified in a single book. Context engineering is a nascent field, but already we see some common patterns emerging that adapt LLMs well to certain tasks:
这些模式是经过长时间发展起来的，尽管许多模式最初是在一本书中被编纂成典的。上下文工程是一个新兴领域，但我们已经看到一些常见的模式正在出现，这些模式能够很好地使LLM适应特定任务：

*   **RAG**: inject retrieved documents based on relevance to user intent
    **RAG**：根据与用户意图的相关性注入检索到的文档
*   **Tool calling**: list available tools and inject results into the context
    **工具调用**：列出可用工具并将结果注入上下文
*   **Structured output**: fix a JSON/XML schema for the LLM completions
    **结构化输出**：为LLM的补全内容固定一个JSON/XML模式
*   **Chain of thought / ReAct**: emit reasoning tokens before answering
    **思维链 / ReAct**：在回答之前发出推理令牌
*   **Context compression**: Summarise long history into pertinent facts
    **上下文压缩**：将长历史摘要为相关事实
*   **Memory**: store and recall salient facts across sessions
    **记忆**：跨会话存储和回忆显著事实

In our examples above, we have already used some of these patterns:
在我们上面的例子中，我们已经使用了一些这些模式：

*   **RAG** for getting film reviews, critics' lists, and box office data
    **RAG** 用于获取电影评论、影评人名单和票房数据
*   **Tool calling** to calculate weekly revenues accurately
    **工具调用** 用于精确计算周收入

Some of the other techniques, such as ReAct, could help our LLM to frame and verify its responses more carefully, counterbalancing the weight of linguistic probability learnt from its training data.
其他一些技术，如ReAct，可以帮助我们的LLM更仔细地构建和验证其响应，从而平衡从其训练数据中学到的语言概率权重。

By seeing each as a context engineering design pattern, we are able to pick the right ones for the task at hand, compose them into an "agent", and avoid compromising our ability to test and reason about our code.
通过将每一种技术都看作是上下文工程的设计模式，我们能够为手头的任务挑选合适的模式，将它们组合成一个“智能体”（agent），并避免损害我们测试和推理代码的能力。

---

### **Extending to multiple agents**
### **扩展到多智能体**

Production systems that rely on LLMs for decision-making and action will naturally evolve towards multiple agents with different specialisations: safety guardrails; information retrieval; knowledge distillation; human interaction; etc. Each of these is a component that interprets a task, then returning a sequence of tokens indicating actions to take, the information retrieved, or both.
依赖LLM进行决策和行动的生产系统，将自然地向具有不同专业分工的多个智能体演进：安全护栏、信息检索、知识蒸馏、人类交互等。每一个智能体都是一个解释任务的组件，然后返回一个令牌序列，指示要采取的行动、检索到的信息，或两者兼有。

For our multi-agent film ranker, we might need several agents:
对于我们的多智能体电影排序器，我们可能需要几个智能体：

*   **Chatbot Agent**: to maintain a conversation with the user
    **聊天机器人智能体**：与用户保持对话
*   **Safety Agent**: to check that the user is not acting maliciously
    **安全智能体**：检查用户是否没有恶意行为
*   **Preference Agent**: recalls if the user wants to ignore some reviews
    **偏好智能体**：记录用户是否想忽略某些评论
*   **Critic Agent**: to synthesise sources and make a final decision
    **评论家智能体**：综合各种来源并作出最终决定

Each of these is specialised for a given task, but this can be done purely through engineering the context they consume, including outputs from other agents in the system.
这些智能体中的每一个都为一个给定任务而专门化，但这可以纯粹通过工程化它们所消费的上下文来实现，包括来自系统中其他智能体的输出。

Outputs are then passed around the system and into the context windows of other agents. At every step, the crucial aspect to consider is the patterns by which token sequences are generated, and how the output of one agent will be used as context for another agent to complete its own task. The hand-off token sequence is effectively the contract for agent interaction — apply as much rigour to it as you would any other API within your software architecture.
然后，输出在系统内传递，并进入其他智能体的上下文窗口。在每一步，关键要考虑的是生成令牌序列的模式，以及一个智能体的输出将如何被用作另一个智能体完成其任务的上下文。这个交接的令牌序列实际上是智能体交互的契约——对其应用的严谨程度应与你软件架构中的任何其他API相同。

---

### **Summary**
### **总结**

Context engineering is the nascent but critical discipline that governs how we are able to effectively guide LLMs into solving the tasks we feed into them. As a subfield of software engineering, it benefits from systems and design thinking, and we can learn lessons from the application of design patterns for producing software that is modular, robust, and comprehensible.
上下文工程是一个新兴但至关重要的学科，它决定了我们如何能有效地引导LLM解决我们交给它们的任务。作为软件工程的一个子领域，它受益于系统和设计思维，我们可以从应用设计模式来生产模块化、健壮和易于理解的软件中吸取教训。

When working with LLMs, we must therefore:
因此，在使用LLM时，我们必须：

*   Treat the LLM as an **analyst**, not an oracle. Give it whatever it needs to solve the task.
    将LLM视为**分析师**，而非神谕。为它提供解决任务所需的一切。
*   Take responsibility for the **entire context window**, not just the system and user prompts.
    对**整个上下文窗口**负责，而不仅仅是系统和用户提示。
*   Use composable, reusable **design patterns** that can be engineered and tested in isolation.
    使用可组合、可重用的**设计模式**，这些模式可以被独立地工程化和测试。
*   Frame the hand-off between agents as an **API contract** between their context windows.
    将智能体之间的交接框定为它们上下文窗口之间的**API契约**。

By doing these, we can control in-context learning with the same rigour as any other engineered software.
通过这样做，我们能够以与其他任何工程化软件同等的严谨性来控制上下文学习