从语言到世界:空间智能是人工智能的下一个前沿 ylc3000 2025-11-11 0 浏览 0 点赞 长文 # From Words to Worlds: Spatial intelligence is AI’s Next Frontier # 从语言到世界:空间智能是人工智能的下一个前沿 **By Fei-Fei Li** **作者:李飞飞** *Nov 10, 2025* *2025年11月10日* In 1950, when computing was little more than automated arithmetic and simple logic, Alan Turing asked a question that still reverberates today: can machines think? It took remarkable imagination to see what he saw: that intelligence might someday be built rather than born. That insight later launched a relentless scientific quest called Artificial Intelligence (AI). Twenty-five years into my own career in AI, I still find myself inspired by Turing’s vision. But how close are we? The answer isn’t simple. 1950年,当计算还仅仅是自动化算术和简单逻辑时,艾伦·图灵提出了一个至今仍在回响的问题:机器能思考吗?他所预见的需要非凡的想象力:智能有朝一日或许可以被构建,而非与生俱来。这一洞见后来开启了一场名为人工智能(AI)的不懈科学探索。在我自己25年的人工智能生涯中,我仍然为图灵的愿景所激励。但我们离这个目标还有多近?答案并不简单。 Today, leading AI technology such as large language models (LLMs) have begun to transform how we access and work with abstract knowledge. Yet they remain wordsmiths in the dark; eloquent but inexperienced, knowledgeable but ungrounded. **Spatial intelligence will transform how we create and interact with real and virtual worlds—revolutionizing storytelling, creativity, robotics, scientific discovery, and beyond. This is AI’s next frontier.** 如今,像大型语言模型(LLM)这样的领先人工智能技术已经开始改变我们获取和处理抽象知识的方式。然而,它们仍然是黑暗中的文字匠人;能言善辩却缺乏经验,知识渊博却脱离现实。**空间智能将改变我们创造和与真实及虚拟世界互动的方式——彻底改变故事叙述、创造力、机器人技术、科学发现等领域。这是人工智能的下一个前沿。** The pursuit of visual and spatial intelligence has been the North Star guiding me since I entered the field. It’s why I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs). It’s why [my academic lab at Stanford](https://svl.stanford.edu/) has spent the last decade combining computer vision with robotic learning. And it’s why my cofounders Justin Johnson, Christoph Lassner, Ben Mildenhall, and I created [World Labs](https://www.worldlabs.ai/) more than one year ago: to realize this possibility in full, for the first time. 自我进入该领域以来,对视觉和空间智能的追求一直是指引我的北极星。正因如此,我花费数年时间建立了ImageNet,这是第一个大规模视觉学习和基准测试数据集,也是促成现代人工智能诞生的三大关键要素之一,另外两个是神经网络算法和如图形处理单元(GPU)这样的现代计算能力。也正因如此,我在[斯坦福的学术实验室](https://svl.stanford.edu/)在过去十年里一直致力于将计算机视觉与机器人学习相结合。这同样是我与联合创始人 Justin Johnson、Christoph Lassner、Ben Mildenhall 在一年多前创立 [World Labs](https://www.worldlabs.ai/) 的原因:为了首次全面实现这一可能性。 In this essay, I’ll explain what spatial intelligence is, why it matters, and how we’re building the world models that will unlock it—with impact that will reshape creativity, embodied intelligence, and human progress. 在本文中,我将解释什么是空间智能,它为何重要,以及我们如何构建将解锁它的世界模型——其影响将重塑创造力、具身智能和人类进步。 ### Spatial Intelligence: The scaffolding of human cognition ### 空间智能:人类认知的支柱 AI has never been more exciting. Generative AI models such as LLMs have moved from research labs to everyday life, becoming tools of creativity, productivity, and communication for billions of people. They have demonstrated capabilities once thought impossible, producing coherent text, mountains of code, photorealistic images, and even short video clips with ease. It’s no longer a question of whether AI will change the world. By any reasonable definition, it already has. 人工智能从未如此激动人心。像LLM这样的生成式AI模型已经从研究实验室走向日常生活,成为数十亿人进行创造、提高生产力和沟通的工具。它们展示了曾经被认为不可能的能力,轻松生成连贯的文本、堆积如山的代码、逼真的图像,甚至短视频片段。AI是否会改变世界已不再是问题。根据任何合理的定义,它已经做到了。 Yet so much still lies beyond our reach. The vision of autonomous robots remains intriguing but speculative, far from the fixtures of daily life that futurists have long promised. The dream of massively accelerated research in fields like disease curation, new material discovery, and particle physics remains largely unfulfilled. And the promise of AI that truly understands and empowers human creators—whether students learning intricate concepts in molecular chemistry, architects visualizing spaces, filmmakers building worlds, or anyone seeking fully immersive virtual experiences—remains beyond reach. 然而,仍有许多事物遥不可及。自主机器人的愿景仍然引人入胜但充满推测,远未成为未来学家长期承诺的日常生活固定设施。在疾病治疗、新材料发现和粒子物理学等领域大规模加速研究的梦想在很大程度上仍未实现。而真正理解并赋能人类创造者的AI——无论是学习分子化学复杂概念的学生、构想空间的建筑师、构建世界的电影制作人,还是任何寻求完全沉浸式虚拟体验的人——的承诺也仍未兑现。 To learn why these capabilities remain elusive, we need to examine how spatial intelligence evolved, and how it shapes our understanding of the world. 要了解为何这些能力仍然难以实现,我们需要审视空间智能是如何演变的,以及它如何塑造我们对世界的理解。 Vision has long been a cornerstone of human intelligence, but its power emerged from something even more fundamental. Long before animals could nest, care for their young, communicate with language, or build civilizations, the simple act of sensing quietly sparked an evolutionary journey toward intelligence. 视觉长期以来一直是人类智能的基石,但其力量源于更为根本的东西。早在动物能够筑巢、照顾幼崽、用语言交流或建立文明之前,简单的感知行为就悄然开启了通往智能的进化之旅。 This seemingly isolated ability to glean information from the external world, whether a glimmer of light or the feeling of texture, created a bridge between perception and survival that only grew stronger and more elaborate as the generations passed. Layer upon layer of neurons grew from that bridge, forming nervous systems that interpret the world and coordinate interactions between an organism and its surroundings. Thus, many scientists have conjectured that **perception and action became the core loop driving the evolution of intelligence**, and the foundation on which nature created our species—the ultimate embodiment of perceiving, learning, thinking, and doing. 这种看似孤立的、从外部世界获取信息的能力,无论是一丝光亮还是纹理的触感,都在感知与生存之间架起了一座桥梁,随着世代更迭,这座桥梁变得越来越坚固和复杂。一层又一层的神经元从这座桥梁上生长出来,形成了能够解释世界并协调生物体与其环境之间互动的神经系统。因此,许多科学家推测,**感知和行动成为推动智能进化的核心循环**,也是大自然创造我们这个物种——感知、学习、思考和行动的终极体现——的基础。 Spatial intelligence plays a fundamental role in defining how we interact with the physical world. Every day, we rely on it for the most ordinary acts: parking a car by imagining the narrowing gap between bumper and curb, catching a set of keys tossed across the room, navigating a crowded sidewalk without collision, or sleepily pouring coffee into a mug without looking. In more extreme circumstances, firefighters navigate collapsing buildings through shifting smoke, making split-second judgements about stability and survival, communicating through gestures, body language and a shared professional instinct for which there’s no linguistic substitute. And children spend the entirety of their pre-verbal months or years learning the world through playful interactions with their environments. All of this happens intuitively, automatically—a fluency machines have yet to achieve. 空间智能在定义我们如何与物理世界互动方面扮演着基础性角色。每天,我们都依赖它完成最普通的行为:通过想象保险杠与路缘之间缩小的间隙来停车,接住扔过房间的钥匙,在拥挤的人行道上穿行而不发生碰撞,或者睡眼惺忪地不看就把咖啡倒入杯中。在更极端的情况下,消防员在不断变化的浓烟中穿行于倒塌的建筑物,对稳定性和生存做出瞬间判断,通过手势、身体语言和一种无法用语言替代的共同职业本能进行交流。而孩子们则在他们学语前的几个月或几年里,通过与环境的嬉戏互动来学习世界。所有这一切都是直观、自动发生的——这是机器尚未达到的流畅程度。 Spatial Intelligence is also foundational to our imagination and creativity. Storytellers create uniquely rich worlds in their minds and leverage many forms of visual media to bring them to others, from ancient cave painting to modern cinema to immersive video games. Whether it’s children building sandcastles on the beach or playing Minecraft on the computer, spatially-grounded imagination forms the basis for interactive experiences in real or virtual worlds. And in many industry applications, simulations of objects, scenes and dynamic interactive environments power countless numbers of critical business use cases from industrial design to digital twins to robotic training. 空间智能也是我们想象力和创造力的基础。故事讲述者在脑海中创造出独特丰富的世界,并利用从古代洞穴壁画到现代电影再到沉浸式视频游戏等多种视觉媒体将其呈现给他人。无论是孩子们在沙滩上堆沙堡,还是在电脑上玩《我的世界》,基于空间的想象力构成了真实或虚拟世界中互动体验的基础。在许多行业应用中,物体、场景和动态互动环境的模拟为从工业设计到数字孪生再到机器人培训等无数关键业务用例提供了动力。 History is full of civilization-defining moments where spatial intelligence played central roles. In ancient Greece, Eratosthenes transformed shadows into geometry—measuring a 7-degree angle in Alexandria at the exact moment the sun cast no shadow in Syene—to calculate the Earth’s circumference. Hargreave’s “Spinning Jenny” revolutionized textile manufacturing through a spatial insight: arranging multiple spindles side-by-side in a single frame allowed one worker to spin multiple threads simultaneously, increasing productivity eightfold. Watson and Crick discovered DNA’s structure by physically building 3D molecular models, manipulating metal plates and wire until the spatial arrangement of base pairs clicked into place. In each case, spatial intelligence drove civilization forward when scientists and inventors had to manipulate objects, visualize structures, and reason about physical spaces - none of which can be captured in text alone. 历史上充满了文明定义的时刻,其中空间智能扮演了核心角色。在古希腊,埃拉托斯特尼将影子转化为几何学——在亚历山大港测量到一个7度的角,而此时太阳在赛印城没有投下任何影子——从而计算出地球的周长。哈格里夫斯的“珍妮纺纱机”通过一项空间洞察彻底改变了纺织制造业:将多个纺锤并排排列在一个框架中,使得一个工人可以同时纺多根线,生产率提高了八倍。沃森和克里克通过物理构建3D分子模型发现了DNA的结构,他们操纵金属板和金属丝,直到碱基对的空间排列“咔哒”一声就位。在每一个案例中,当科学家和发明家需要操纵物体、可视化结构并对物理空间进行推理时,空间智能都推动了文明的进步——而这些都无法仅用文字来捕捉。 **Spatial Intelligence is the scaffolding upon which our cognition is built.** It’s at work when we passively observe or actively seek to create. It drives our reasoning and planning, even on the most abstract topics. And it’s essential to the way we interact—verbally or physically, with our peers or with the environment itself. While most of us aren’t revealing new truths on the level of Eratosthenes most days, we *routinely* think in the same way—making sense of a complex world by perceiving it through our senses, then leveraging an intuitive understanding of how it works in physical, spatial terms. **空间智能是我们认知构建的支柱。** 当我们被动观察或主动创造时,它都在发挥作用。它驱动着我们的推理和规划,即使是在最抽象的话题上。它对于我们互动的方式——无论是口头的还是身体的,与同伴还是与环境本身——都至关重要。虽然我们大多数人平时不会像埃拉托斯特尼那样揭示新的真理,但我们*经常*以同样的方式思考——通过感官感知复杂的世界,然后利用对它在物理、空间层面如何运作的直观理解来认识世界。 Unfortunately, today’s AI doesn’t think like this yet. 不幸的是,今天的人工智能还不会这样思考。 Tremendous progress has indeed been made in the past few years. Multimodal LLMs (MLLMs), trained with voluminous multimedia data in addition to textual data, have introduced some basics of spatial awareness, and today’s AI can analyze pictures, answer questions about them, and generate hyperrealistic images and short videos. And through breakthroughs in sensors and haptics, our most advanced robots can begin to manipulate objects and tools in highly constrained environments. 过去几年确实取得了巨大进展。多模态大型语言模型(MLLM)除了文本数据外,还使用大量的多媒体数据进行训练,引入了一些空间意识的基础。如今的AI可以分析图片、回答相关问题,并生成超逼真的图像和短视频。通过传感器和触觉技术的突破,我们最先进的机器人已经可以在高度受限的环境中开始操作物体和工具。 Yet the candid truth is that AI’s spatial capabilities remain far from human level. And the limits reveal themselves quickly. State-of-the-art MLLM models rarely perform better than chance on estimating distance, orientation, and size—or “mentally” rotating objects by regenerating them from new angles. They can’t navigate mazes, recognize shortcuts, or predict basic physics. AI-generated videos—nascent and yes, very cool—often lose coherence after a few seconds. 然而,坦率地说,人工智能的空间能力仍远未达到人类水平。而且其局限性很快就暴露出来。最先进的多模态大语言模型在估计距离、方向和尺寸,或通过从新角度重新生成物体来进行“心理”旋转等任务上,其表现很少能超过随机猜测。它们无法在迷宫中导航,识别捷径,或预测基本物理现象。人工智能生成的视频——虽然处于初级阶段,而且确实非常酷——但往往在几秒钟后就失去连贯性。 While current state-of-the-art AI can excel at reading, writing, research, and pattern recognition in data, these same models bear fundamental limitations when representing or interacting with the physical world. Our view of the world is holistic—not just what we’re looking at, but how everything relates spatially, what it means, and why it matters. Understanding this through imagination, reasoning, creation, and interaction—not just descriptions—is the power of spatial intelligence. Without it, AI is disconnected from the physical reality it seeks to understand. It cannot effectively drive our cars, guide robots in our homes and hospitals, enable entirely new ways of immersive and interactive experiences for learning and recreation, or accelerate discovery in materials science and medicine. 尽管当前最先进的人工智能在阅读、写作、研究和数据模式识别方面表现出色,但这些模型在表现或与物理世界互动时存在根本性的局限。我们对世界的看法是整体的——不仅仅是我们正在看什么,还包括所有事物在空间上的相互关系,它们的意义以及重要性。通过想象、推理、创造和互动——而不仅仅是描述——来理解这一切,才是空间智能的力量。没有它,人工智能就与它试图理解的物理现实脱节了。它无法有效地驾驶我们的汽车,引导我们家庭和医院中的机器人,为学习和娱乐开启全新的沉浸式和互动体验,也无法加速材料科学和医学领域的发现。 The philosopher Wittgenstein once wrote that “the limits of my language mean the limits of my world.” I’m not a philosopher. But I know at least for AI, there is more than just words. Spatial intelligence represents the frontier beyond language—the capability that links imagination, perception and action, and opens possibilities for machines to truly enhance human life, from healthcare to creativity, from scientific discovery to everyday assistance. 哲学家维特根斯坦曾写道:“我的语言的极限意味着我的世界的极限。”我不是哲学家。但我知道,至少对人工智能而言,世界不仅仅是语言。空间智能代表了超越语言的前沿——它是一种连接想象、感知和行动的能力,为机器真正提升人类生活开辟了可能性,从医疗保健到创造力,从科学发现到日常援助。 ### The next decade of AI: Building truly spatially intelligent machines ### 人工智能的下一个十年:构建真正具备空间智能的机器 So how do we build spatially-intelligent AI? What’s the path to models capable of reasoning with the vision of Eratosthenes, engineering with the precision of an industrial designer, creating with the imagination of a storyteller, and interacting with their environment with the fluency of a first responder? 那么,我们如何构建具备空间智能的人工智能?通往能够像埃拉托斯特尼一样推理、像工业设计师一样精确工程、像讲故事的人一样富有想象力地创造,并像急救人员一样流畅地与环境互动的模型的路径是什么? Building spatially intelligent AI requires something even more ambitious than LLMs: world models, a new type of generative models whose capabilities of understanding, reasoning, generation and interaction with the semantically, physically, geometrically and dynamically complex worlds - virtual or real - are far beyond the reach of today’s LLMs. The field is nascent, with current methods ranging from abstract reasoning models to video generation systems. World Labs was founded in early 2024 on this conviction: that foundational approaches are still being established, making this the defining challenge of the next decade. 构建具备空间智能的人工智能需要比大型语言模型(LLM)更为宏大的目标:世界模型。这是一种新型的生成模型,其理解、推理、生成以及与语义、物理、几何和动态复杂的虚拟或现实世界进行互动的能力,远超当今的LLM。这个领域尚处于萌芽阶段,现有方法从抽象推理模型到视频生成系统不一而足。World Labs于2024年初成立,正是基于这样的信念:基础方法仍在建立之中,这使得它成为未来十年的决定性挑战。 In this emerging field, what matters most is establishing the principles that guide development. For spatial intelligence, I define world models through **three essential capabilities:** 在这个新兴领域,最重要的是建立指导发展的原则。对于空间智能,我通过**三个基本能力**来定义世界模型: #### 1. **Generative:** World models can generate worlds with perceptual, geometrical, and physical consistency #### 1. **生成性:** 世界模型能够生成具有感知、几何和物理一致性的世界 World models that unlock spatial understanding and reasoning must also generate simulated worlds of their own. They must be capable of spawning endlessly varied and diverse simulated worlds that follow semantic or perceptual instructions—*while* remaining geometrically, physically, and dynamically consistent—whether representing real or virtual spaces. The research community is actively exploring whether these worlds should be represented implicitly or explicitly in terms of the innate geometric structures. Furthermore, in addition to powerful latent representations, I believe the outputs of a universal world model must also allow the generation of an explicit, observable state of the worlds for many different use cases. In particular, its understanding of the present must be tied coherently to its past; to the previous states of the world that led to the current one. 能够解锁空间理解和推理的世界模型,也必须能够生成自己的模拟世界。它们必须能够根据语义或感知指令,生成无穷无尽、多种多样的模拟世界,*同时*保持几何、物理和动态上的一致性——无论代表的是真实空间还是虚拟空间。研究界正在积极探索这些世界应该根据其固有的几何结构进行隐式还是显式表示。此外,除了强大的潜在表示,我认为一个通用世界模型的输出还必须允许为许多不同的用例生成世界的显式、可观察状态。特别是,它对当下的理解必须与其过去连贯地联系在一起;即与导致当前状态的世界的先前状态相联系。 #### 2. **Multimodal:** World models are multimodal by design #### 2. **多模态性:** 世界模型天生就是多模态的 Just as animals and humans do, a world model should be able to process inputs—known as “prompts” in the generative AI realm—in a wide range of forms. Given partial information—whether images, videos, depth maps, text instructions, gestures, or actions—world models should predict or generate world states as *complete* as possible. This requires processing visual inputs with the fidelity of real vision while interpreting semantic instructions with equal facility. This enables both agents and humans to communicate with the model about the world through diverse inputs and receive diverse outputs in return. 正如动物和人类一样,一个世界模型应该能够处理多种形式的输入——在生成式AI领域中称为“提示”。在给定部分信息——无论是图像、视频、深度图、文本指令、手势还是动作——的情况下,世界模型应尽可能*完整*地预测或生成世界状态。这要求它既能以真实视觉的保真度处理视觉输入,又能同样轻松地解释语义指令。这使得智能体和人类都能够通过多样化的输入与模型就世界进行交流,并反过来接收多样化的输出。 #### 3. **Interactive:** World models can output the next states based on input actions #### 3. **互动性:** 世界模型能根据输入动作输出下一个状态 Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the *next* state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state. 最后,如果动作和/或目标是世界模型提示的一部分,其输出必须包含世界的*下一个*状态,无论是隐式还是显式表示。当仅给定一个动作(无论有无目标状态)作为输入时,世界模型应产生一个与世界先前状态、任何预定目标状态及其语义、物理定律和动态行为相一致的输出。随着具备空间智能的世界模型在推理和生成能力上变得更加强大和稳健,可以想象,在给定目标的情况下,世界模型本身将不仅能预测世界的下一个状态,还能根据新状态预测下一个动作。 **The scope of this challenge exceeds anything AI has faced before.** **这项挑战的范围超过了人工智能以往所面临的任何挑战。** While language is a purely generative phenomenon of human cognition, worlds play by much more complex rules. Here on Earth, for instance, gravity governs motion, atomic structures determine how light produces colors and brightness, and countless physical laws constrain every interaction. Even the most fanciful, creative worlds are composed of spatial objects and agents that obey the physical laws and dynamical behaviors that define them. Reconciling all of this consistently—the semantic, the geometric, the dynamic, and physical—demands entirely new approaches. The dimensionality of representing a world is vastly more complex than that of a one-dimensional, sequential signal like language. Achieving world models that deliver the kind of universal capabilities we enjoy as humans will require overcoming several formidable technical barriers. At World Labs, our research teams are devoted to making fundamental progress toward that goal. 虽然语言是人类认知中一种纯粹的生成现象,但世界的运作规则要复杂得多。例如,在地球上,引力支配着运动,原子结构决定了光如何产生颜色和亮度,无数的物理定律约束着每一次互动。即使是最奇幻、最有创意的世界,也是由遵守定义它们的物理定律和动态行为的空间物体和智能体组成的。要将所有这些——语义、几何、动态和物理——一致地协调起来,需要全新的方法。表示一个世界的维度要比像语言这样的一维顺序信号复杂得多。要实现能够提供我们人类所享有的那种普适能力的世界模型,需要克服几个巨大的技术障碍。在World Labs,我们的研究团队致力于朝着这个目标取得根本性进展。 Here are some examples of our current research topics: 以下是我们当前研究课题的一些例子: * **A new, universal task function for training:** Defining a universal task function as simple and elegant as next-token prediction in LLMs has long been a central goal of world model research. The complexities of both their input and output spaces make such a function inherently more difficult to formulate. But while much remains to be explored, this objective function and corresponding representations must reflect the laws of geometry and physics, honoring the fundamental nature of world models as grounded representations of both imagination and reality. * **一种新的、通用的训练任务函数:** 定义一个像LLM中下一个词元预测那样简单而优雅的通用任务函数,一直是世界模型研究的核心目标。其输入和输出空间的复杂性使得这样一个函数本质上更难制定。尽管还有很多需要探索,但这个目标函数和相应的表示必须反映几何和物理定律,尊重世界模型作为想象与现实的具象化表示的基本性质。 * **Large-scale training data**: The promising news: massive data sources already exist. Internet-scale collections of images and videos represent abundant, accessible training material—the challenge lies in developing algorithms that can extract deeper spatial information from these two-dimensional image or video frame-based signals (i.e. RGB). Research over the past decade has shown the power of scaling laws linking data volume and model size in language models; the key unlock for world models is building architectures that can leverage existing visual data at comparable scale. In addition, I would not underestimate the power of high-quality synthetic data and additional modalities like depth and tactile information. They supplement the internet scale data in critical steps of the training process. But the path forward depends on better sensor systems, more robust signal extraction algorithms, and far more powerful neural simulation methods. * **大规模训练数据**: 好消息是:海量数据源已经存在。互联网规模的图像和视频集合代表了丰富、可访问的训练材料——挑战在于开发能够从这些二维图像或视频帧信号(即RGB)中提取更深层空间信息的算法。过去十年的研究表明,在语言模型中,连接数据量和模型大小的缩放定律具有强大的威力;世界模型的关键突破在于构建能够以相当规模利用现有视觉数据的架构。此外,我不会低估高质量合成数据以及像深度和触觉信息等附加模态的力量。它们在训练过程的关键步骤中补充了互联网规模的数据。但前进的道路取决于更好的传感器系统、更稳健的信号提取算法以及更强大的神经模拟方法。 * **New model architecture and representational learning:** World model research will inevitably drive advances in model architecture and learning algorithms, particularly beyond the current MLLM and video diffusion paradigms. Both of these typically tokenize data into 1D or 2D sequences, which makes simple spatial tasks unnecessarily difficult - like counting unique chairs in a short video, or remembering what a room looked like an hour ago. Alternative architectures may help, such as 3D or 4D-aware methods for tokenization, context, and memory. For example, at World Labs, our recent work on a real-time generative frame-based model called RTFM has demonstrated this shift, which uses spatially-grounded frames as a form of spatial memory to achieve efficient real-time generation while maintaining persistence in the generated world. * **新的模型架构和表示学习:** 世界模型研究将不可避免地推动模型架构和学习算法的进步,特别是超越当前的多模态大语言模型(MLLM)和视频扩散范式。这两种范式通常将数据标记化为一维或二维序列,这使得简单的空间任务变得不必要地困难——比如在短视频中计算独特的椅子数量,或者记住一个小时前房间的样子。替代性架构可能会有所帮助,例如用于标记化、上下文和记忆的3D或4D感知方法。例如,在World Labs,我们最近关于一个名为RTFM的实时生成性基于帧的模型的工作就展示了这种转变,该模型使用基于空间的帧作为一种空间记忆形式,以实现高效的实时生成,同时保持生成世界中的持久性。 Clearly, we are still facing daunting challenges before we can fully unlock spatial intelligence through world modeling. This research isn’t just a theoretical exercise. It is the core engine for a new class of creative and productivity tools. And the progress within World Labs has been encouraging. We recently shared with a limited number of users a glimpse of Marble, the first ever world model that can be prompted by multimodal inputs to generate and maintain consistent 3D environments for users and storytellers to explore, interact with, and build further in their creative workflow. And we are working hard to make it available to the public soon! 显然,在通过世界模型完全解锁空间智能之前,我们仍面临着艰巨的挑战。这项研究不仅仅是理论练习,它是新一类创意和生产力工具的核心引擎。World Labs内部的进展令人鼓舞。我们最近向少数用户展示了Marble的一瞥,这是有史以来第一个可以通过多模态输入提示来生成和维护一致3D环境的世界模型,供用户和故事讲述者在他们的创意工作流程中探索、互动和进一步构建。我们正在努力使其尽快向公众开放! Marble is only our first step in creating a truly spatially intelligent world model. As the progress accelerates, researchers, engineers, users, and business leaders alike are beginning to recognize its extraordinary potential. The next generation of world models will enable machines to achieve spatial intelligence on an entirely new level—an achievement that will unlock essential capabilities still largely absent from today’s AI systems. Marble只是我们创建真正具备空间智能的世界模型的第一步。随着进展加速,研究人员、工程师、用户和商界领袖都开始认识到其非凡的潜力。下一代世界模型将使机器能够在全新层面上实现空间智能——这一成就将解锁当今AI系统中仍然普遍缺乏的基本能力。 ### Using world models to build a better world for people ### 使用世界模型为人们构建更美好的世界 **It matters what motivates the development of AI.** As one of the scientists who helped usher in the era of modern AI, my motivation has always been clear: AI must augment human capability, not replace it. For years, I’ve worked to align AI development, deployment, and governance with human needs. Extreme narratives of techno-utopia and apocalypse are abundant these days, but I continue to hold a more pragmatic view: AI is developed by people, used by people, and governed by people. It must always respect the agency and dignity of people. Its magic lies in extending our capabilities; making us more creative, connected, productive, and fulfilled. Spatial intelligence represents this vision—AI that empowers human creators, caregivers, scientists, and dreamers to achieve what was once impossible. This belief is what drives my commitment to spatial intelligence as AI’s next great frontier. **推动人工智能发展的动机至关重要。** 作为帮助开启现代人工智能时代的科学家之一,我的动机一直很明确:人工智能必须增强人类能力,而不是取而代之。多年来,我一直致力于将人工智能的开发、部署和治理与人类需求对齐。如今,关于技术乌托邦和末日论的极端叙事比比皆是,但我继续持有一种更务实的观点:人工智能由人开发,由人使用,由人治理。它必须始终尊重人的能动性和尊严。它的魔力在于扩展我们的能力;让我们更有创造力、更紧密地联系、更高效、更充实。空间智能代表了这一愿景——一种赋能人类创造者、护理人员、科学家和梦想家实现曾经不可能之事的人工智能。正是这一信念驱动着我将空间智能作为人工智能下一个伟大前沿的承诺。 The applications of spatial intelligence span varying timelines. Creative tools are emerging now—World Labs’ Marble already puts these capabilities in creators’ and storytellers’ hands. Robotics represents an ambitious mid-term horizon as we refine the loop between perception and action. The most transformative scientific applications will take longer but promise a profound impact on human flourishing. 空间智能的应用横跨不同的时间线。创意工具现已崭露头角——World Labs的Marble已经将这些能力交到了创作者和故事讲述者的手中。随着我们完善感知与行动之间的循环,机器人技术代表了一个雄心勃勃的中期目标。最具变革性的科学应用将需要更长时间,但有望对人类的繁荣产生深远影响。 Across all these timelines, several domains stand out for their potential to reshape human capability. It will take significant collective effort, more than a single team or a company can possibly achieve. It will require participation across the entire AI ecosystem—researchers, innovators, entrepreneurs, companies, and even policymakers—working toward a shared vision. But this vision is worth pursuing. Here’s what that future holds: 在所有这些时间线上,有几个领域因其重塑人类能力的潜力而脱颖而出。这需要巨大的集体努力,远非单个团队或公司所能实现。它需要整个AI生态系统的参与——研究人员、创新者、企业家、公司,甚至政策制定者——共同朝着一个共同的愿景努力。但这个愿景值得追求。以下是未来所蕴含的: #### **Creativity:** Superpowering storytelling and immersive experiences #### **创造力:** 为故事叙述和沉浸式体验注入超能力 “Creativity is intelligence having fun.” This is one of my favorite quotes by my personal hero Albert Einstein. Long before written language, humans told stories—painted them on cave walls, passed them through generations, built entire cultures on shared narratives. Stories are how we make sense of the world, connect across distance and time, explore what it means to be human, and most importantly, find meaning in life and love within ourselves. Today, spatial intelligence has the potential to transform how we create and experience narratives in ways that honor their fundamental importance, and extend their impacts from entertainment to education, from design to construction. “创造力是智慧在享乐。” 这是我个人英雄阿尔伯特·爱因斯坦我最喜欢的一句名言。早在书面语言出现之前,人类就在讲述故事——将它们画在洞穴墙壁上,代代相传,在共同的叙事基础上建立起整个文化。故事是我们理解世界、跨越时空联系、探索人之为人的意义的方式,最重要的是,在我们内心找到生命的意义和爱。今天,空间智能有潜力以尊重其根本重要性的方式,改变我们创造和体验叙事的方式,并将其影响从娱乐扩展到教育,从设计延伸到建筑。 World Labs’ Marble platform will be putting unprecedented spatial capabilities and editorial controllability in the hands of filmmakers, game designers, architects, and storytellers of all kinds, allowing them to rapidly create and iterate on fully explorable 3D worlds without the overhead of conventional 3D design software. The creative act remains as vital and human as ever; the AI tools simply amplify and accelerate what creators can achieve. This includes: World Labs 的 Marble 平台将为电影制作人、游戏设计师、建筑师和各类故事讲述者提供前所未有的空间能力和编辑可控性,让他们能够快速创建和迭代完全可探索的3D世界,而无需传统3D设计软件的开销。创作行为本身仍然像以往一样至关重要且充满人性;人工智能工具只是放大和加速了创作者所能达到的成就。这包括: * Narrative experiences in new dimensions: Filmmakers and game designers are using Marble to conjure entire worlds without the constraints of budget or geography, exploring varieties of scenes and perspectives that would have been intractable to explore within a traditional production pipeline. As the lines between different forms of media and entertainment blur, we’re approaching fundamentally new kinds of interactive experiences that blend art, simulation, and play—personalized worlds where anyone, not just studios, can create and inhabit their own stories. With the rise of newer, more rapid ways to lift concepts and storyboards into full experiences, narratives will no longer be bound to a single medium, with creators free to build worlds with shared throughlines across myriad surfaces and platforms. * 新维度的叙事体验:电影制作人和游戏设计师正在使用 Marble 来创造整个世界,而不受预算或地理的限制,探索在传统制作流程中难以探索的各种场景和视角。随着不同形式的媒体和娱乐之间的界限变得模糊,我们正在接近一种融合了艺术、模拟和游戏的全新互动体验——个性化的世界,在这里任何人,而不仅仅是工作室,都可以创造并栖居于自己的故事中。随着将概念和故事板转化为完整体验的更新、更快捷方式的兴起,叙事将不再局限于单一媒介,创作者可以自由地在无数的界面和平台上构建具有共同主线的世界。 * Spatial narratives through design: Essentially every manufactured object or constructed space must be designed in virtual 3D before its physical creation. This process is highly iterative and costly in terms of both time and money. With spatially intelligent models at their disposal, architects can quickly visualize structures before investing months into designs, walking through spaces that don’t yet exist—essentially telling stories about how we might live, work, and gather. Industrial and fashion designers can translate imagination into form instantly, exploring how objects interact with human bodies and spaces. * 通过设计实现空间叙事:基本上,每一个制造的物体或建造的空间在其实体创造之前都必须在虚拟3D中进行设计。这个过程是高度迭代的,并且在时间和金钱上都非常昂贵。有了具备空间智能的模型,建筑师可以在投入数月进行设计之前快速可视化结构,漫步于尚不存在的空间中——这实质上是在讲述我们未来可能如何生活、工作和聚集的故事。工业和时装设计师可以立即将想象转化为形式,探索物体如何与人体和空间互动。 * New immersive and interactive experiences: Experience itself is one of the deepest ways that we, as a species, create meaning. For the entirety of human history, there has been one singular 3D world: the physical one we all share. Only in recent decades, through gaming and early virtual reality ( VR), have we begun to glimpse what it means to share alternate worlds of our own creation. Now, spatial intelligence combined with new form factors, like VR and extended reality (XR) headsets and immersive displays, elevates these experiences in unprecedented ways. We’re approaching a future where stepping into fully realized multi-dimensional worlds becomes as natural as opening a book. Spatial intelligence makes world-building accessible not just to studios with professional production teams but to individual creators, educators, and anyone with a vision to share. * 全新的沉浸式和互动体验:体验本身是我们作为物种创造意义的最深刻方式之一。在整个人类历史上,只有一个单一的3D世界:我们共同分享的物理世界。仅在近几十年,通过游戏和早期的虚拟现实(VR),我们才开始瞥见分享我们自己创造的另类世界的意义。现在,空间智能与新的形态因素,如VR和扩展现实(XR)头显以及沉浸式显示器相结合,以前所未有的方式提升了这些体验。我们正在接近一个未来,步入完全实现的多维世界将变得像打开一本书一样自然。空间智能使得世界构建不仅对拥有专业制作团队的工作室开放,也对个人创作者、教育工作者以及任何有愿景分享的人开放。 #### **Robotics:** Embodied intelligence in action #### **机器人技术:** 行动中的具身智能 Animals from insects to humans depend on spatial intelligence to understand, navigate and interact with their worlds. Robots will be no different. Spatially-aware machines have been the dream of the field since its inception, including my own work with my students and collaborators at my Stanford research lab. This is also why I’m so excited by the possibility of bringing them about using the kinds of models World Labs is building. 从昆虫到人类,动物都依赖空间智能来理解、导航和与它们的世界互动。机器人也不例外。具备空间意识的机器自该领域诞生以来就一直是梦想,包括我与我在斯坦福研究实验室的学生和合作者所做的工作。这也是为什么我对利用World Labs正在构建的这类模型来实现这一可能性感到如此兴奋。 * **Scaling robotic learning via world models:** The progress of robotic learning hinges on a scalable solution of viable training data. Given the enormous state spaces of possibilities that robots have to learn to understand, reason, plan, and interact with, many have conjectured that a combination of internet data, synthetic simulation, and real-world capture of human demonstration are required to truly create generalizable robots. But unlike language models, training data is scarce for today’s robotic research. World models will play a defining role in this. As they increase their perceptual fidelity and computational efficiency, outputs of world models can rapidly close the gap between simulation and reality. This will in turn help train robots across simulations of countless states, interactions and environments. * **通过世界模型扩展机器人学习:** 机器人学习的进展取决于一个可行的、可扩展的训练数据解决方案。考虑到机器人需要学习理解、推理、规划和互动的巨大可能性状态空间,许多人推测,要真正创造出具有泛化能力的机器人,需要结合互联网数据、合成模拟和真实世界的人类演示捕捉。但与语言模型不同,当今的机器人研究训练数据稀缺。世界模型将在这方面发挥决定性作用。随着它们感知保真度和计算效率的提高,世界模型的输出可以迅速缩小模拟与现实之间的差距。这反过来将有助于在无数状态、互动和环境的模拟中训练机器人。 * **Companions and collaborators:** Robots as human collaborators, whether aiding scientists at the lab bench or assisting seniors living alone, can expand part of the workforce in dire need of more labour and productivity. But doing so demands spatial intelligence that perceives, reasons, plans, and acts while—and this is most important—staying empathetically aligned with human goals and behaviors. For instance, a lab robot might handle instruments so the scientist can focus on tasks needing dexterity or reasoning, while a home assistant might help an elderly person cook without diminishing their joy or autonomy. Truly spatially intelligent world models that can predict the next state or possibly even actions consistent with this expectation are critical for achieving this goal. * **伴侣与合作者:** 机器人作为人类的合作者,无论是辅助实验室的科学家,还是帮助独居的老人,都可以扩充急需更多劳动力和生产力的劳动力队伍。但要做到这一点,需要具备感知、推理、规划和行动的空间智能,同时——这是最重要的——保持与人类目标和行为的共情对齐。例如,一个实验室机器人可以处理仪器,让科学家专注于需要灵巧或推理的任务;而一个家庭助手可以帮助老年人做饭,而不会减少他们的快乐或自主性。真正具备空间智能、能够预测下一个状态甚至可能预测与此期望一致的行动的世界模型,对于实现这一目标至关重要。 * **Expanding forms of embodiment:** Humanoid robots play a role in the world we’ve built for ourselves. But the full benefit of innovation will come from a far more diverse range of designs: nanobots that deliver medicine, soft robots that navigate tight spaces, and machines built for the deep sea or outer space. Whatever their form, future spatial intelligence models must integrate both the environments these robots inhabit and their own embodied perception and movement. But a key challenge in developing these robots is the lack of training data in these wide varieties of embodied form factors. World models will play a critical role in simulation data, training environments, and benchmarking tasks for these efforts. * **扩展具身形式:** 人形机器人在我们为自己构建的世界中扮演着一个角色。但创新的全部益处将来自更多样化的设计:输送药物的纳米机器人、穿越狭窄空间的软体机器人,以及为深海或外太空建造的机器。无论其形式如何,未来的空间智能模型都必须整合这些机器人所处的环境以及它们自身的具身感知和运动。但开发这些机器人的一个关键挑战是,在这些多种多样的具身形态因素中缺乏训练数据。世界模型将在模拟数据、训练环境和这些工作的基准测试任务中发挥关键作用。 #### **The Longer Horizon:** Science, Healthcare, and Education #### **更长远的展望:** 科学、医疗保健和教育 In addition to creative and robotics applications, spatial intelligence’ profound impact will also extend to fields where AI can enhance human capability in ways that save lives and accelerate discovery. I highlight below three areas of applications that can be deeply transformative, though it goes without saying the use cases of spatial intelligence are truly expansive across many more industries. 除了创意和机器人应用,空间智能的深远影响还将延伸到那些AI能够以拯救生命和加速发现的方式增强人类能力的领域。我下面重点介绍三个可能具有深度变革性的应用领域,尽管不言而喻,空间智能的用例在更多行业中都非常广泛。 In **scientific research,** spatially intelligent systems can simulate experiments, test hypotheses in parallel, and explore environments inaccessible to humans—from deep oceans to distant planets. This technology can transform computational modeling in fields like climate science and materials research. By integrating multi-dimensional simulation with real-world data collection, these tools can lower compute barriers and extend what every laboratory can observe and understand. 在**科学研究**中,具备空间智能的系统可以模拟实验,并行检验假设,并探索人类无法进入的环境——从深海到遥远的行星。这项技术可以改变气候科学和材料研究等领域的计算建模。通过将多维模拟与真实世界数据收集相结合,这些工具可以降低计算门槛,并扩展每个实验室所能观察和理解的范围。 In **healthcare**, spatial intelligence will reshape everything from laboratory to bedside. At Stanford, my students and collaborators have spent many years working with hospitals, elder care facilities, and patients at home. This experience has convinced me of spatial intelligence’s transformative potential here. AI can accelerate drug discovery by modeling molecular interactions in multi-dimensions, enhance diagnostics by helping radiologists spot patterns in medical imaging, and enable ambient monitoring systems that support patients and caregivers without replacing the human connection that healing requires, not to mention the potential of robots in helping our healthcare workers and patients in many different settings. 在**医疗保健**领域,空间智能将重塑从实验室到病床的一切。在斯坦福,我的学生和合作者多年来一直与医院、养老院和居家病人合作。这段经历让我确信空间智能在这里具有变革性的潜力。人工智能可以通过多维建模分子相互作用来加速药物发现,通过帮助放射科医生在医学影像中发现模式来增强诊断,并实现环境监测系统,在不取代康复所需的人际联系的情况下支持患者和护理人员,更不用说机器人在许多不同环境中帮助我们的医护人员和患者的潜力了。 In **education,** spatial intelligence can enable immersive learning that makes abstract or complex concepts tangible, and create iterative experiences so essential to how our brains and bodies are wired in learning. In the age of AI, the need for faster and more effective learning and reskilling is particularly important for both school-aged children and adults. Students can explore cellular machinery or walk through historical events in multi-dimenality. Teachers gain tools to personalize instruction through interactive environments. Professionals—from surgeons to engineers—can safely practice complex skills in realistic simulations. 在**教育**领域,空间智能可以实现沉浸式学习,使抽象或复杂的概念变得具体可感,并创造对我们大脑和身体学习方式至关重要的迭代体验。在人工智能时代,对于学龄儿童和成年人来说,更快、更有效的学习和技能再培训的需求尤为重要。学生可以在多维度中探索细胞机制或穿越历史事件。教师可以通过互动环境获得个性化教学的工具。从外科医生到工程师的专业人士可以在逼真的模拟中安全地练习复杂技能。 Across all these domains, the possibilities are boundless, but the goal remains constant: AI that augments human expertise, accelerates human discovery, and amplifies human care—not replacing the judgment, creativity, and empathy that are central for being humans. 在所有这些领域,可能性是无限的,但目标始终如一:增强人类专业知识、加速人类发现、放大人类关怀的人工智能——而不是取代作为人类核心的判断力、创造力和同理心。 ### Conclusion ### 结论 The last decade has seen AI become a global phenomenon and an inflection point in technology, the economy, and even geopolitics. But as a researcher, educator, and now, entrepreneur, it’s still the spirit behind Turing’s 75-year-old question that inspires me most. I still share his sense of wonder. It’s what energizes me every day by the challenge of spatial intelligence. 过去十年见证了人工智能成为一种全球现象,以及技术、经济甚至地缘政治的转折点。但作为一名研究者、教育家,以及现在的企业家,最能激励我的仍然是图灵75年前那个问题背后的精神。我仍然分享着他的惊奇感。正是这种感觉,每天都因空间智能的挑战而让我充满活力。 For the first time in history, we’re poised to build machines so in tune with the physical world that we can rely on them as true partners in the greatest challenges we face. Whether accelerating how we understand diseases in the lab, revolutionizing how we tell stories, or supporting us in our most vulnerable moments due to sickness, injury, or age, we’re on the cusp of technology that elevates the aspects of life we care about most. This is a vision of deeper, richer, more empowered lives. 历史上第一次,我们准备好建造与物理世界如此协调的机器,以至于在我们面临的最大挑战中,我们可以将它们视为真正的伙伴。无论是加速我们在实验室理解疾病的方式,彻底改变我们讲故事的方式,还是在我们因疾病、受伤或年老而最脆弱的时刻支持我们,我们正处在一种能够提升我们最关心的生活方面的技术的风口浪尖。这是一个更深刻、更丰富、更有力量的生活愿景。 Almost a half billion years after nature unleashed the first glimmers of spatial intelligence in the ancestral animals, we’re lucky enough to find ourselves among the generation of technologists who may soon endow machines with the same capability—and privileged enough to harness those capabilities for the benefits of people everywhere. Our dreams of truly intelligent machines will not be complete without spatial intelligence. 在自然界于远古动物身上释放出空间智能的第一缕微光近五十亿年后,我们有幸成为这一代技术专家中的一员,或许很快就能赋予机器同样的能力——并有幸利用这些能力为世界各地的人们谋福利。我们对真正智能机器的梦想,没有空间智能是不会完整的。 网闻录 从语言到世界:空间智能是人工智能的下一个前沿