Li Feifei's "embodied intelligence" new achievements! The robot connects to the large model and directly understands human speech, and can complete complex instructions with zero pre-training

Source: Qubit

The latest achievements of Li Feifei's team embodied intelligence are here:

The large model is connected to the robot to convert complex instructions into specific action plans without additional data and training.

From then on, humans can freely use natural language to give instructions to robots, such as:

Open the top drawer and watch out for the vases!

The large language model + visual language model can analyze the target and obstacles that need to be bypassed from the 3D space, helping the robot to make action planning.

Then the key point is that robots in the real world can directly perform this task without "training".

The new method realizes zero-sample daily operation task trajectory synthesis, that is, tasks that the robot has never seen before can be performed at one time, without even giving him a demonstration.

The operable objects are also open. You don’t need to delineate the range in advance. You can open the bottle, press the switch, and unplug the charging cable.

At present, the project homepage and papers are online, and the code will be released soon, and has aroused widespread interest in the academic community.

A former Microsoft researcher commented: This research is at the forefront of the most important and complex artificial intelligence systems.

Specific to the robot research community, some colleagues said that it has opened up a new world for the field of motion planning.

There are also people who did not see the danger of AI, but because of this research on AI combined with robots, they have changed their views.

**How can a robot understand human speech directly? **

Li Feifei's team named the system VoxPoser, as shown in the figure below, its principle is very simple.

First, given the environment information (collecting RGB-D images with the camera) and the natural language instructions we want to execute.

Then, LLM (Large Language Model) writes code based on these contents, and the generated code interacts with VLM (Visual Language Model) to guide the system to generate a corresponding operation instruction map, namely 3D Value Map.

The so-called 3D Value Map, which is the general term for Affordance Map and Constraint Map, marks both "where to act" and "how to act"**.

In this way, the action planner is moved out, and the generated 3D map is used as its objective function to synthesize the final operation trajectory to be executed.

From this process, we can see that compared with the traditional method, additional pre-training is required. This method uses a large model to guide the robot how to interact with the environment, so it directly solves the problem of scarcity of robot training data.

Furthermore, it is precisely because of this feature that it also realizes the zero-sample capability. As long as the above basic process is mastered, any given task can be held.

In the specific implementation, the author transformed the idea of VoxPoser into an optimization problem, that is, the following complex formula:

It takes into account that the instructions given by humans may have a wide range and require contextual understanding, so the instructions are disassembled into many subtasks. For example, the first example at the beginning consists of "grab the drawer handle" and "pull the drawer".

What VoxPoser wants to achieve is to optimize each subtask, obtain a series of robot trajectories, and finally minimize the total workload and working time.

In the process of using LLM and VLM to map language instructions into 3D maps, the system considers that language can convey a rich semantic space, so it uses "entity of interest(entity of interest)" to guide the robot to operate , that is, through the value marked in the 3DValue Map to reflect which object is "attractive" to it, and those objects are "repulsive".

Still using the example at the beginning 🌰, the drawer is "attracting", and the vase is "repelling".

Of course, how to generate these values depends on the understanding ability of the large language model.

In the final trajectory synthesis process, since the output of the language model remains constant throughout the task, we can quickly re-assess when encountering disturbances by caching its output and re-evaluating the generated code using closed-loop visual feedback. planning.

Therefore, VoxPoser has a strong anti-interference ability.

△ Put the waste paper into the blue tray

The following are the performances of VoxPoser in the real and simulated environments (measured by the average success rate):

It can be seen that it is significantly higher than the primitive-based baseline task regardless of the environment or condition (with or without distractors, whether instructions are visible or not).

Finally, the author was pleasantly surprised to find that VoxPoser produced 4 "emergent abilities":

(1) Evaluate physical characteristics, such as given two blocks of unknown mass, let the robot use tools to conduct physical experiments to determine which block is heavier;

(2) Behavioral commonsense reasoning, such as in the task of setting tableware, tell the robot "I am left-handed", and it can understand the meaning through the context;

(3) Fine-grained correction. For example, when performing tasks that require high precision such as "cover the teapot", we can issue precise instructions to the robot such as "you deviated by 1 cm" to correct its operation;

(4) Multi-step operations based on vision, such as asking the robot to accurately open the drawer in half. The lack of information due to the lack of an object model may prevent the robot from performing such a task, but VoxPoser can propose a multi-step operation strategy based on visual feedback. That is, first fully open the drawer while recording the displacement of the handle, and then push it back to the midpoint to meet the requirements.

Fei-Fei Li: The 3 North Stars of Computer Vision

About a year ago, Li Feifei wrote an article in the Journal of the American Academy of Arts and Sciences, pointing out three directions for the development of computer vision:

  • Embodied AI
  • Visual Reasoning
  • Scene Understanding

Li Feifei believes that embodied intelligence does not only refer to humanoid robots, but any tangible intelligent machine that can move in space is a form of artificial intelligence.

Just as ImageNet aims to represent a wide variety of real-world images, so embodied intelligence research needs to address complex and diverse human tasks, from folding laundry to exploring new cities.

Following instructions to perform these tasks requires vision, but not only vision, but also visual reasoning to understand three-dimensional relationships in the scene.

Finally, the machine must understand the people in the scene, including human intentions and social relationships. For example, seeing a person open the refrigerator can tell that he is hungry, or seeing a child sitting on an adult's lap can tell that they are parent-child.

Robots combined with large models may be just one way to solve these problems.

In addition to Li Feifei, Tsinghua Yaoban alumnus Wu Jiajun, who graduated from MIT with a Ph.D. and is now an assistant professor at Stanford University, participated in this research.

The first author of the thesis, Wenlong Huang, is now a doctoral student at Stanford and participated in the PaLM-E research during his internship at Google.

Paper address: Project home page: Reference link: [1] [1]

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)