Advertisement

【人工智能】Embodied AI :具身人工智能概述 | Overview of Embodied Artificial Intelligence

阅读量:

从“互联网人工智能”时代到“具身人工智能”时代,人工智能算法和代理不再从主要来自互联网的图像、视频或文本数据集中学习。相反,他们通过与环境的互动从类似于人类的以自我为中心的感知中学习。因此,对支持各种具体人工智能研究任务的隐含人工智能模拟器的需求大幅增长。对具身人工智能日益增长的兴趣有利于对通用人工智能(AGI)的更大追求,但尚未对这一领域进行当代和全面的调查。

目录

【人工智能】Embodied AI :具身人工智能概述 | Overview of Embodied Artificial Intelligence

History

Motivation

Agent and Environment 代理和环境

Egocentric Perception 以自我为中心的感知

Internet AI

Active Perception

Sparse Rewards

Tasks

Datasets and Simulators 数据集和模拟器

Sim2Real

Applications

Conclusion


【人工智能】Embodied AI :具身人工智能概述 | Overview of Embodied Artificial Intelligence

Recent research trends in Artificial Intelligence, Machine Learning, and Computer Vision have led to a growing research space called Embodied AI. Facebook AI Research (FAIR) and Intel Labs has been spearheading new projects in the space of Embodied AI. “Embodied” is defined as “giving a tangible or visible form to an idea.” Simply put, “Embodied AI” means “AI for virtual robots.” More specifically, Embodied AI is the field for solving AI problems for virtual robots that can move, see, speak, and interact in the virtual world and with other virtual robots — these simulated robot solutions are then transferred to real world robots.
人工智能、机器学习和计算机视觉的最新研究趋势导致了一个名为“体现人工智能”的研究空间不断扩大。Facebook AI Research(FAIR)和Intel Labs一直在具身AI领域开展新项目。“具身”被定义为“赋予一个想法以有形或可见的形式”。简单地说,“具身AI”的意思是“虚拟机器人的AI”。更具体地说,具身人工智能是解决虚拟机器人人工智能问题的领域,这些虚拟机器人可以在虚拟世界和其他虚拟机器人中移动、看、说话和互动——这些模拟机器人解决方案随后被转移到现实世界的机器人上。

History

Linda Smith proposed the “embodiment hypothesis” in 2005 as the idea that intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity. They argue that starting as a baby grounded in a physical, social, and linguistic world is crucial to the development of the flexible and inventive intelligence that characterizes humankind. Furthermore, the Embodiment Thesis states that many features of cognition are embodied in that they are deeply dependent upon characteristics of the physical body of an agent, such that the agent’s beyond-the-brain body plays a significant causal role, or a physically constitutive role, in that agent’s cognitive processing. While the initial hypothesis comes from Psychology and Cognitive Science, the recent research developments of Embodied AI has come largely from Computer Vision researchers.
琳达·史密斯(Linda Smith)在2005年提出了“体现假说”,认为智能是在智能体与环境的相互作用中出现的,并且是感觉运动活动的结果。他们认为,从婴儿开始,以物理,社会和语言世界为基础,对于人类特有的灵活和创造性智能的发展至关重要。此外,实施例论文指出,认知的许多特征体现在它们深深依赖于主体身体的特征,使得智能体的大脑外身体在该主体的认知过程中起着重要的因果作用或物理构成作用。虽然最初的假设来自心理学和认知科学,但具体人工智能的最新研究进展主要来自计算机视觉研究人员。

AI subfields have been largely separated since the 1960s, subject to various limitations. However, Embodied AI brings together interdisciplinary fields, such as Natural Language Processing (NLP), Computer Vision, Reinforcement Learning, Navigation, Physics-Based Simulations, and Robotics. While Embodied AI requires multiples AI subfields to succeed, the growth of Embodied AI as a research area has largely been driven by Computer Vision researchers.
自 1960 年代以来,人工智能子领域在很大程度上是分开的,受到各种限制。然而,具身人工智能汇集了跨学科领域,如自然语言处理 (NLP)、计算机视觉、强化学习、导航、基于物理的模拟和机器人技术。虽然隐身人工智能需要多个人工智能子领域才能成功,但具身人工智能作为一个研究领域的增长在很大程度上是由计算机视觉研究人员推动的。

Computer Vision researchers define Embodied AI as artificial agents operating in 3D environments that base their decisions off of egocentric perceptual inputs that in turn change with agent actions. Embodied AI enables training of embodied AI agents (virtual robots and egocentric assistants) in a realistic 3D simulator, before transferring the learned skills to reality. This empowers a paradigm shift from “Internet AI” based on static datasets (e.g., ImageNet, COCO, VQA) to “Embodied AI” where agents act within realistic simulated environments.
计算机视觉研究人员将具身人工智能定义为在 3D 环境中运行的人工代理,这些代理的决策基于以自我为中心的感知输入,而这些感知输入又会随着代理行为而变化。具身 AI 能够在逼真的 3D 模拟器中训练具身 AI 代理(虚拟机器人和以自我为中心的助手),然后再将学到的技能转移到现实中。这使得基于静态数据集(例如,ImageNet,COCO,VQA)的“互联网AI”转变为“体现AI”,其中代理在真实的模拟环境中行动。

Internet AI. These images from the COCO dataset do not provide a 3D realistic environment.
互联网人工智能。来自 COCO 数据集的这些图像不提供 3D 真实环境。

Motivation

Many of the AI advances in the last decade have been because of Machine Learning and Deep Learning (e.g., Semantic Segmentation, Object Detection, Image Captioning). Machine Learning and Deep Learning have been successful because of the increasing amounts of data (e.g., Youtube, Flickr, Facebook) and the increasing amounts of computing power (e.g., CPUs, GPUs, TPUs). However, this type of “Internet data” (images, video, and text curated from the internet) does not come from a real-world first-person human perspective. Data is shuffled, randomized, coming from satellites, coming from selfie photos, coming from Twitter feeds, and none of this is how a human perceives the world. However, Machine Learning methods are trying to feed this data to NLP, CV, and Navigation problems. While there has been much progress in these fields due to “Internet data” and “Internet AI”, it is not the most suitable data or the most suitable methods. The methods for Machine Learning do not map to the ways that humans learn. Humans learn by seeing, moving, interacting, and speaking with others. Humans learn from sequential experiences, not from shuffled and randomized experiences. The thesis of Embodied AI is to have embodied agents (or virtual robots) learn in the same way that humans learn. Which is why insight from Cognitive Science and Psychology experts is essential. This means that virtual robots should learn by seeing, moving, speaking, and interacting with the world — just like humans.

过去十年中的许多人工智能进步都是由于机器学习和深度学习(例如,语义分割,对象检测,图像字幕)。机器学习和深度学习之所以成功,是因为数据量不断增加(例如,Youtube,Flickr,Facebook)和不断增加的计算能力(例如,CPU,GPU,TPU)。然而,这种类型的“互联网数据”(从互联网策划的图像、视频和文本)并非来自现实世界的第一人称视角。数据被打乱,随机化,来自卫星,来自自拍照片,来自Twitter提要,而这些都不是人类感知世界的方式。然而,机器学习方法正试图将这些数据提供给NLP、CV和导航问题。虽然由于“互联网数据”和“互联网AI”,这些领域取得了很大进展,但它并不是最合适的数据或最合适的方法。机器学习的方法并不映射到人类的学习方式。人类通过观察、移动、互动和与他人交谈来学习。人类从顺序经验中学习,而不是从随机和随机的经验中学习。具身人工智能的论点是让具身代理(或虚拟机器人)以与人类相同的方式学习。这就是为什么认知科学和心理学专家的洞察力至关重要的原因。这意味着虚拟机器人应该像人类一样,通过观察、移动、说话和与世界互动来学习。

While “Embodied AI” has a different methodology than “Internet AI”, Embodied AI can benefit from many of the successes from Internet AI. Computer Vision and Natural Language Processing actually work pretty well for some things now (if there is plenty of labelled data). These advances in CV and NLP greatly increase the potential for success in Embodied AI.

虽然“具身人工智能”的方法与“互联网人工智能”不同,但具身人工智能可以从互联网人工智能的许多成功中受益。计算机视觉和自然语言处理实际上现在在某些事情上工作得很好(如果有很多标记数据)。CV和NLP的这些进步大大增加了具身AI成功的潜力。

Additionally, there now exist plenty of realistic 3D scenes which can serve as simulated environments for Embodied AI training. These environments include SUNCG, Matterport3D, iGibson, Replica, Habitat, and DART. These scenes are much more realistic than the environments that have been used in previous research simulators. The widespread and public availability of these datasets greatly increase the potential for success in Embodied AI.

此外,现在存在大量逼真的3D场景,可以作为体现AI训练的模拟环境。这些环境包括SUNCG,Matterport3D,iGibson,Replica,Habitat和DART。这些场景比以前的研究模拟器中使用的环境要逼真得多。这些数据集的广泛和公开可用性大大增加了具身人工智能的成功潜力。

This is a 3D Map view of the realistic environment. The embodied agent only sees a first person view.
这是现实环境的 3D 地图视图。具身代理只能看到第一人称视角。

Embodied AI lends itself to applications, such as personal robotics, virtual assistants, and even autonomous vehicles. The combination of NLP, Computer Vision, and Robotics actually make the problem tasks of Embodied AI easier.
具身人工智能适用于个人机器人、虚拟助手甚至自动驾驶汽车等应用。NLP、计算机视觉和机器人技术的结合实际上使具身人工智能的问题任务变得更容易。

Agent and Environment 代理和环境

An agent is merely an abstraction that can take actions in an environment. We refer to this agent as virtual robot, simulated agent, virtual agent, or egocentric agent (due to its first-person views/sensors/interactions). You can think of an agent as a player of a game. On the other hand, an environment is just an abstraction that represents a 3D map with multiple locations, rooms, and objects in the world. It is a simulated environment that represents the physical world. You can accomplish many goals in an environment, such as interaction, navigation, and language understanding. For more information on Agents and Environments, feel free to read our Overview of Reinforcement Learning.
代理只是可以在环境中执行操作的抽象。我们将这个代理称为虚拟机器人、模拟代理、虚拟代理或以自我为中心的代理(由于其第一人称视角/传感器/交互)。您可以将代理视为游戏的玩家。另一方面,环境只是一个抽象,表示世界上具有多个位置、房间和对象的 3D 地图。它是一个代表物理世界的模拟环境。您可以在环境中实现许多目标,例如交互、导航和语言理解。有关代理和环境的更多信息,请随时阅读我们的强化学习概述。

Egocentric Perception 以自我为中心的感知

Embodied AI is egocentric as opposed to allocentric. Egocentric perception can be though of as a first-person view. Egocentric perception encodes objects with respect to the agent. Allocentric perception encodes objects with respect to another object (e.g., the front door, the center of the room, etc.). An allocentric perception might be useful if the AI knows the whole map of the environment; however, an embodied agent only knows what they have seen from egocentric perception. An embodied agent does not have access to a map, unless they create it as they navigate the different rooms and locations in the environment. Recall that the Embodiment Thesis focuses on the self, so it would only be appropriate for Embodied AI to also focus on the self.
具身的人工智能是以自我为中心的,而不是以异中心为中心。以自我为中心的感知可以作为第一人称视角。以自我为中心的感知对相对于主体的对象进行编码。异中心感知相对于另一个物体(例如,前门、房间中心等)对物体进行编码。如果人工智能知道环境的整个地图,那么异中心感知可能会很有用;然而,一个具身的主体只知道他们从以自我为中心的感知中看到了什么。具身代理无权访问地图,除非他们在导航环境中的不同房间和位置时创建地图。回想一下,体现论文专注于自我,所以只有体现人工智能也关注自我才是合适的。

Internet AI

Embodied AI has additional challenges compared to Internet AI. Internet AI learns from static images from Internet datasets (e.g., ImageNet). These static images are high quality and nicely framed; on the other hand, Embodied AI uses egocentric perception, which produces images or videos that might be shaky and not well composed. The inherent properties of Egocentric Perception create additional changes for Embodied AI as compared to Internet AI. Additionally, the focus of Internet AI is pattern recognition in images, videos, and text on datasets typically curated from the internet; on the other hand, the focus of Embodied AI is to enable action by an embodied agent (e.g. robot) in an environment. Ultimately, the goal is to take all of the advances that Machine Learning and Computer Vision have made in Internet AI and apply it to Embodied AI.

与互联网人工智能相比,具身人工智能面临着额外的挑战。互联网人工智能从互联网数据集(例如ImageNet)的静态图像中学习。这些静态图像质量高,构图精美;另一方面,Embodied AI使用以自我为中心的感知,它产生的图像或视频可能会不稳定且构图不佳。与互联网人工智能相比,以自我为中心的感知的固有属性为具身人工智能创造了额外的变化。此外,互联网人工智能的重点是通常从互联网策划的数据集上的图像、视频和文本中的模式识别;另一方面,具身AI的重点是使具身代理(例如机器人)在环境中的操作成为可能。最终,目标是将机器学习和计算机视觉在互联网人工智能中取得的所有进步应用于具身人工智能。

The goal in this example is to find the find the car and perceive its color. It needs to understand the question, find the car, and then answer.
此示例中的目标是找到汽车并感知其颜色。它需要理解问题,找到汽车,然后回答。

Active Perception

The agent may be spawned anywhere in the environment and may not immediately ‘see’ the pixels containing the answer to its visual goal (i.e. the car/goal may not be visible). Thus, the agent must move to succeed — controlling the pixels that it will perceive. The agent must learn to map its visual input to the correct action based on its perception of the world, the underlying physical constraints, and its understanding of the question. The observations that the agent collects are a consequence of the actions that the agent takes in the environment. The agent is controlling the data distribution that is coming in. The agent controls the pixels it gets to see. This is unlike static datasets, which have been curated online and there’s less control over viewpoint variations of objects, etc. One of the challenges of active perception is to be generally robust to visual variation.

代理可以在环境中的任何地方生成,并且可能不会立即“看到”包含其视觉目标答案的像素(即汽车/目标可能不可见)。因此,代理必须移动才能成功——控制它将感知的像素。智能体必须学会根据其对世界的感知、潜在的物理约束和对问题的理解,将其视觉输入映射到正确的动作。代理收集的观察结果是代理在环境中执行的操作的结果。代理正在控制传入的数据分发。代理控制它可以看到的像素。这与静态数据集不同,静态数据集是在线策划的,对对象的视点变化等的控制较少。主动感知的挑战之一是通常对视觉变化具有鲁棒性。

Sparse Rewards

Unlike object detection or image recognition (supervised learning), these agents do collect immediate rewards for each action. Agents in an environment often experience sparse rewards (reinforcement learning). The aim of a reinforcement learning (RL) algorithm is to allow an agent to maximize the rewards from the environment. In some environments, the rewards are supplied to the agent continuously. In others, a positive reward is only provided when the agent completes the goal (e.g., “walk to car”), but it leads to sparse rewards. Sparse rewards can make learning the intended behavior more challenging. It can also make exploration more challenging. For more information on Reinforcement Learning, feel free to read our Overview of Reinforcement Learning.

与对象检测或图像识别(监督学习)不同,这些代理确实会为每个动作收集即时奖励。环境中的代理经常会遇到稀疏的奖励(强化学习)。强化学习 (RL) 算法的目的是允许代理最大化来自环境的奖励。在某些环境中,奖励会持续提供给代理。在其他情况下,只有当代理完成目标(例如,“步行到车上”)时才提供积极的奖励,但它会导致稀疏的奖励。稀疏的奖励会使学习预期的行为更具挑战性。它还可以使探索更具挑战性。有关强化学习的更多信息,请随时阅读我们的强化学习概述。

Tasks

There are several tasks that can be accomplished in the field of Embodied AI. Here are some of the existing tasks.
在具身AI领域可以完成几项任务。下面是一些现有任务。

  1. Visual Odometry. Odometry is using any sensor to determine how much distance has been traversed, so visual odometry is just clarification that the particular sensor to be used for odometry is visual (e.g., camera). Traversed distance in odometry is relative to the starting position. So visual odometry assumes the initial position is known. Visual odometry (VO), as one of the most essential techniques for pose estimation, has attracted significant interest in both the computer vision and robotics communities over the past few decades. It has been widely applied to various robots as a complement to GPS, Inertial Navigation System (INS), wheel odometry, etc. In the last thirty years, enormous work has been done to develop accurate and robust VO systems.
    1)视觉里程计。测程法使用任何传感器来确定已经穿越了多少距离,因此视觉里程计只是澄清用于测程的特定传感器是视觉的(例如,相机)。测程法中的横移距离相对于起始位置。因此,目视里程计假设初始位置是已知的。视觉里程计(VO)作为最重要的姿态估计技术之一,在过去的几十年中引起了计算机视觉和机器人界的极大兴趣。作为GPS、惯性导航系统(INS)、车轮里程计等的补充,已广泛应用于各种机器人。在过去的三十年中,已经做了大量的工作来开发准确而强大的VO系统。

  2. Global Localization. Localization is the problem of estimating the position of an autonomous agent given a map of the environment and agent observations. The ability to localize under uncertainty is required by autonomous agents to perform various downstream tasks such as planning, exploration and navigation. Localization is considered as one of the most fundamental problems in robotics. Localization is useful in many real-world applications such as autonomous vehicles, factory robots and delivery drones. The global localization problem assumes the initial position is unknown (as compared to VO which assumes that the initial position is known). Despite the long history of research, global localization is still an open problem.
    2)全球本地化。定位是在给定环境图和代理观察的情况下估计自主代理位置的问题。自主代理需要能够在不确定性下进行定位,以执行各种下游任务,例如规划、探索和导航。本地化被认为是机器人技术中最基本的问题之一。本地化在许多实际应用中非常有用,例如自动驾驶汽车、工厂机器人和送货无人机。全局定位问题假设初始位置是未知的(与 VO 相比,VO 假设初始位置是已知的)。尽管研究历史悠久,但全球本地化仍然是一个悬而未决的问题。

  3. Visual Navigation. Navigation in three-dimensional environments is an essential capability of robots that function in the physical world (or virtual robots in a simulated environment). Animals, including humans, can traverse cluttered dynamic environments with grace and skill in pursuit of many goals. Animals can navigate efficiently and deliberately in previously unseen environments, building up internal representations of these environments in the process. Such internal representations are of central importance to Artificial Intelligence. For more information on Visual Navigation, feel free to read our Overview of Embodied Navigation. (Coming Soon)
    3)视觉导航。三维环境中的导航是在物理世界中运行的机器人(或模拟环境中的虚拟机器人)的基本功能。动物,包括人类,可以优雅而熟练地穿越杂乱的动态环境,以追求许多目标。动物可以在以前看不见的环境中高效而有意识地导航,并在此过程中建立这些环境的内部表征。这种内部表示对人工智能至关重要。有关视觉导航的更多信息,请随时阅读我们的具身导航概述。(即将推出)

  4. Grounded Language Learning. We are increasingly surrounded by artificially intelligent technology that takes decisions and executes actions on our behalf. This creates a pressing need for general means to communicate with, instruct and guide artificial agents, with human language the most compelling means for such communication. To achieve this in a scalable fashion, agents must be able to relate language to the world and to actions; that is, their understanding of language must be grounded and embodied. However, learning grounded language is a notoriously challenging problem in artificial intelligence research.
    4)扎根的语言学习。我们越来越多地被人工智能技术所包围,这些技术代表我们做出决定和执行行动。这就迫切需要与人工代理进行交流、指导和指导的一般手段,而人类语言是这种交流的最引人注目的手段。为了以可扩展的方式实现这一目标,智能体必须能够将语言与世界和行动联系起来;也就是说,他们对语言的理解必须是扎根和体现的。然而,学习扎根的语言是人工智能研究中一个众所周知的挑战性问题。

  5. Instruction Guided Visual Navigation. The idea that we might be able to give general, verbal instructions to a robot and have at least a reasonable probability that it will carry out the required task is one of the long-held goals of robotics, and artificial intelligence. Despite significant progress, there are a number of major technical challenges that need to be overcome before robots will be able to perform general tasks in the real world. One of the primary requirements will be new techniques for linking natural language to vision and action in unstructured, previously unseen environments. It is the navigation version of this challenge that is referred to as Vision-and-Language Navigation (VLN).
    5)指令引导视觉导航。我们可能能够向机器人发出一般的口头指示,并且至少有合理的概率来执行所需的任务,这是机器人和人工智能的长期目标之一。尽管取得了重大进展,但在机器人能够在现实世界中执行一般任务之前,还需要克服许多重大的技术挑战。主要要求之一将是将自然语言与非结构化、以前看不见的环境中的视觉和行动联系起来的新技术。正是这一挑战的导航版本被称为视觉和语言导航(VLN)。

  6. Embodied Question Answering. EmbodiedQA is where an agent is spawned at a random location in a 3D environment and asked a question (e.g., ‘What color is the car?’). In order to answer (e.g., ‘Orange!’), the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question. This challenging task requires a range of AI skills — active perception (e.g., agent must move to perceive the car — controlling the pixels that it will perceive), language understanding (e.g., what is the question asking?), goal-driven navigation, commonsense reasoning (e.g., where are cars generally located in the house?), and grounding of language into actions (e.g., associate entities in text with corresponding image pixels or sequence of actions).
    6)具身问答。EmbodiedQA是在3D环境中的随机位置生成代理并提出问题(例如,“汽车是什么颜色的?”)。为了回答(例如,“橙色!”),智能体必须首先智能地导航以探索环境,通过第一人称(以自我为中心)的视觉收集信息,然后回答问题。这项具有挑战性的任务需要一系列人工智能技能——主动感知(例如,代理必须移动以感知汽车——控制它将感知的像素)、语言理解(例如,在问什么问题?)、目标驱动的导航、常识推理(例如,汽车通常位于房子的什么位置?),以及将语言接地到行动中(例如,将文本中的实体与相应的图像像素或动作序列相关联)。

Egocentric view of embodied agent navigating through a realistic environment.
以自我为中心的实体主体在现实环境中导航的观点。

Datasets and Simulators 数据集和模拟器

Datasets have been a key driver of progress in Internet AI. With Embodied AI, simulators will assume the role played previously by datasets. Datasets consist of 3D scans of an environment. These datasets represent 3D scenes of a house, or a lab, a room, or the outside world. These 3D scans however, do not let an agent “walk” through it or interact with it. Simulators allow the embodied agent to physically interact with the environment and walk through it. The datasets are imported into a simulator for the embodied agents to live in and interact with. With the simulator, the agent can see, move, and interact with its environment. The agent can even speak to other agents or humans with the power of the simulator.
数据集一直是互联网人工智能进步的关键驱动力。通过具身人工智能,模拟器将承担数据集以前扮演的角色。数据集由环境的 3D 扫描组成。这些数据集表示房屋、实验室、房间或外部世界的 3D 场景。然而,这些3D扫描不会让代理“走过”它或与之交互。模拟器允许具身代理与环境进行物理交互并穿过环境。数据集将导入到模拟器中,供体现代理在其中居住和交互。使用模拟器,代理可以查看、移动并与其环境交互。代理甚至可以借助模拟器的强大功能与其他代理或人类交谈。

Environments are realized through simulators. However, there are different types of environment representations.
环境是通过模拟器实现的。但是,存在不同类型的环境表示形式。

  1. Unstructured. An environment can keep everything and maybe compress it. An example of this is Habitat. It will be able to handle long-horizon tasks. This is the predominant model of current environments and usually has metric representations of the environments.
    1)非结构化。环境可以保留所有内容,也可以对其进行压缩。这方面的一个例子是栖息地。它将能够处理长期任务。这是当前环境的主要模型,通常具有环境的度量表示形式。

  2. Topological. An environment can be represented with a graph of key nodes. Topological environments consist of a graph with nodes corresponding to locations in the environment and a system capable of retrieving nodes from the graph based on observations. The graph stores no metric information, only connectivity of locations corresponding to the nodes. Topological environments allow for navigation strategies, such as “landmark navigation” as opposed to navigating with metric representations.
    2)拓扑。环境可以用关键节点的图形表示。拓扑环境由一个图形组成,其节点对应于环境中的位置,以及一个能够根据观测值从图形中检索节点的系统。该图不存储指标信息,仅存储与节点对应的位置连接。拓扑环境允许导航策略,例如“地标导航”,而不是使用度量表示进行导航。

  3. Spatial Memory / Cognitive Map. A mapper can build a spatial map of its environment based on the agent’s egocentric views. The spatial memory captures the layout of the environment. The cognitive map is fused from first-person views as observed by the agent over time to produce a metric/semantic egocentric belief about the world in a top-down view. At each time step, the agent updates the belief of its environment. This allows the agent to progressively improve its model of the environment as it moves around.
    3)空间记忆/认知地图。映射器可以根据代理的以自我为中心的视图构建其环境的空间映射。空间内存捕获环境的布局。认知地图融合了代理随着时间的推移观察到的第一人称视角,以在自上而下的视图中产生关于世界的度量/语义以自我为中心的信念。在每个时间步长,代理都会更新其环境的信念。这允许代理在移动时逐步改进其环境模型。

Sim2Real

A straightforward way to train embodied agents is to place them directly in the physical world. This is valuable, but training robots in the real world is slow, dangerous (robot can fall over and break), resource intensive (robot and environment demand resources and time), and difficult to reproduce (especially rare edge cases). An alternative is to train embodied agents in realistic simulators and then transferring the learned skills to reality.

训练具身代理的一种直接方法是将它们直接放置在物理世界中。这是有价值的,但在现实世界中训练机器人是缓慢的,危险的(机器人可能会摔倒和摔倒),资源密集型(机器人和环境需要资源和时间),并且难以复制(尤其是罕见的边缘情况)。另一种方法是在现实模拟器中训练具身代理,然后将学到的技能转移到现实中。

Simulators can help overcome some of the challenges of the physical world. Simulators can run orders of magnitude faster than real-time and can be parallelized over a cluster; training in simulation is safe, cheap. Once an approach has been developed and tested in simulation, it can be transferred to physical platforms that operate in the real world.

模拟器可以帮助克服物理世界的一些挑战。模拟器的运行速度可以比实时快几个数量级,并且可以在集群上并行化;模拟培训是安全、便宜的。一旦一种方法在仿真中被开发和测试,就可以转移到在现实世界中运行的物理平台上。

Embodied AI robot can work with fire fighters.
具身AI机器人可以与消防员一起工作。

Applications

We want to apply Embodied AI to the real world. We want a physical agent that is capable of taking actions in the real world and can talk to humans with natural language. For instance, a search and rescue scenario. A fire fighter asks: “Is there smoke in any room?” First, the robot has to understand what the question is asking. Then, the language learning and comprehension needs to be grounded in terms of the environment, so that it can understand what it means by “room”. If it the agent is confused or needs assistance, it needs to be able to ask the firefighter what to do. For example, the agent might ask “What subset of rooms are you interested in?” And the firefighter could respond with “The rooms in the burning house.” Or perhaps this particular agent could have learned enough “common sense” to know that the rooms that need to be searched for smoke are the rooms in the burning building. Once the agent understand its objective, it needs to be able to navigate to each room. After it arrives at each room, it needs to be able to comprehend whether there is smoke in the room or not base on its egocentric perception. Finally, it can navigate back to the firefighter and respond: “Yes, in one room.”
我们希望将具身人工智能应用于现实世界。我们想要一个能够在现实世界中采取行动的物理代理,并且可以用自然语言与人类交谈。例如,搜索和救援场景。一名消防员问:“房间里有烟吗?首先,机器人必须了解问题在问什么。然后,语言学习和理解需要以环境为基础,以便理解“房间”的含义。如果代理感到困惑或需要帮助,它需要能够询问消防员该怎么做。例如,代理可能会询问“您对哪个房间子集感兴趣?消防员可以用“燃烧的房子里的房间”来回应。或者,也许这个特殊的特工可能已经学会了足够的“常识”,知道需要搜索烟雾的房间是燃烧的建筑物中的房间。一旦代理了解其目标,它就需要能够导航到每个房间。在它到达每个房间后,它需要能够根据其以自我为中心的感知 来理解房间里是否有烟雾。最后,它可以导航回消防员并回答:“是的,在一个房间里。

This firefighting scenario shows multiple examples of Embodied AI tasks.
此消防场景显示了体现 AI 任务的多个示例。

Conclusion

Recent advancements in Internet AI have fueled the initial progress of Embodied AI. Once Embodied AI transcends the developments of Internet AI, it will be a major leap forward in enabling robots to effectively learn how to interact with the real world.
互联网人工智能的最新进展推动了具身人工智能的初步进展。一旦具身人工智能超越了互联网人工智能的发展,这将是使机器人能够有效地学习如何与现实世界互动的重大飞跃。

全部评论 (0)

还没有任何评论哟~