🎉 Hey Gate Square friends! Non-stop perks and endless excitement—our hottest posting reward events are ongoing now! The more you post, the more you win. Don’t miss your exclusive goodies! 🚀
🆘 #Gate 2025 Semi-Year Community Gala# | Square Content Creator TOP 10
Only 1 day left! Your favorite creator is one vote away from TOP 10. Interact on Square to earn Votes—boost them and enter the prize draw. Prizes: iPhone 16 Pro Max, Golden Bull sculpture, Futures Vouchers!
Details 👉 https://www.gate.com/activities/community-vote
1️⃣ #Show My Alpha Points# | Share your Alpha points & gains
Post your
Li Feifei's "embodied intelligence" new achievements! The robot connects to the large model and directly understands human speech, and can complete complex instructions with zero pre-training
Source: Qubit
The latest achievements of Li Feifei's team embodied intelligence are here:
The large model is connected to the robot to convert complex instructions into specific action plans without additional data and training.
The operable objects are also open. You don’t need to delineate the range in advance. You can open the bottle, press the switch, and unplug the charging cable.
**How can a robot understand human speech directly? **
Li Feifei's team named the system VoxPoser, as shown in the figure below, its principle is very simple.
Then, LLM (Large Language Model) writes code based on these contents, and the generated code interacts with VLM (Visual Language Model) to guide the system to generate a corresponding operation instruction map, namely 3D Value Map.
From this process, we can see that compared with the traditional method, additional pre-training is required. This method uses a large model to guide the robot how to interact with the environment, so it directly solves the problem of scarcity of robot training data.
Furthermore, it is precisely because of this feature that it also realizes the zero-sample capability. As long as the above basic process is mastered, any given task can be held.
In the specific implementation, the author transformed the idea of VoxPoser into an optimization problem, that is, the following complex formula:
What VoxPoser wants to achieve is to optimize each subtask, obtain a series of robot trajectories, and finally minimize the total workload and working time.
In the process of using LLM and VLM to map language instructions into 3D maps, the system considers that language can convey a rich semantic space, so it uses "entity of interest(entity of interest)" to guide the robot to operate , that is, through the value marked in the 3DValue Map to reflect which object is "attractive" to it, and those objects are "repulsive".
Of course, how to generate these values depends on the understanding ability of the large language model.
In the final trajectory synthesis process, since the output of the language model remains constant throughout the task, we can quickly re-assess when encountering disturbances by caching its output and re-evaluating the generated code using closed-loop visual feedback. planning.
Therefore, VoxPoser has a strong anti-interference ability.
The following are the performances of VoxPoser in the real and simulated environments (measured by the average success rate):
Finally, the author was pleasantly surprised to find that VoxPoser produced 4 "emergent abilities":
(1) Evaluate physical characteristics, such as given two blocks of unknown mass, let the robot use tools to conduct physical experiments to determine which block is heavier;
(2) Behavioral commonsense reasoning, such as in the task of setting tableware, tell the robot "I am left-handed", and it can understand the meaning through the context;
(3) Fine-grained correction. For example, when performing tasks that require high precision such as "cover the teapot", we can issue precise instructions to the robot such as "you deviated by 1 cm" to correct its operation;
(4) Multi-step operations based on vision, such as asking the robot to accurately open the drawer in half. The lack of information due to the lack of an object model may prevent the robot from performing such a task, but VoxPoser can propose a multi-step operation strategy based on visual feedback. That is, first fully open the drawer while recording the displacement of the handle, and then push it back to the midpoint to meet the requirements.
Fei-Fei Li: The 3 North Stars of Computer Vision
About a year ago, Li Feifei wrote an article in the Journal of the American Academy of Arts and Sciences, pointing out three directions for the development of computer vision:
Just as ImageNet aims to represent a wide variety of real-world images, so embodied intelligence research needs to address complex and diverse human tasks, from folding laundry to exploring new cities.
Following instructions to perform these tasks requires vision, but not only vision, but also visual reasoning to understand three-dimensional relationships in the scene.
Finally, the machine must understand the people in the scene, including human intentions and social relationships. For example, seeing a person open the refrigerator can tell that he is hungry, or seeing a child sitting on an adult's lap can tell that they are parent-child.
Robots combined with large models may be just one way to solve these problems.