"Great shock" of a CTO: GPT-4V autonomous driving five consecutive tests

Original source: Qubits

Image source: Generated by Unbounded AI

Under the high expectation, GPT4 finally pushed vision-related functions.

This afternoon, I quickly tested GPT's ability to perceive images with my friends, and although I expected it, I still shocked us greatly.

Core Ideas:

** I think the semantics-related problems in autonomous driving should have been solved well by the large model, but the credibility and spatial awareness of the large model are still not satisfactory. **

It should be more than enough to solve some so-called efficiency-related corner cases, but it is still very far from relying on large models to complete driving independently to ensure safety.

Example1: Some unknown obstacles on the road

######

** Description of **###### GPT4

Accurate part: 3 trucks detected, the number of the license plate of the front car is basically correct (ignore it if there are Chinese characters), the weather and environment are correct, Accurate identification of unknown obstacles ahead without prompting.

Inaccurate part: the position of the third truck is not divided left and right, and the text on the top of the head of the second truck guesses one blindly (because of insufficient resolution?). )。

That's not enough, let's continue to give a little hint and ask what this object is and whether it can be pressed over.

Impressive! Similar scenarios have been tested in multiple, and the performance of unknown obstacles can be said to be very amazing.

Example2: Understanding of Water in Pave

There is no prompt to automatically recognize the signage, this should be the basis exercise, we continue to give some hints.

Shocked again... Can automatically tell the fog behind the truck, also took the initiative to mention the puddle, but once again said the direction to the left... It feels like some engineering may be needed to better position and direction the GPT output.

Example3: A vehicle turned around and hit the guardrail

The first frame is entered, because there is no timing information, but the truck on the right is considered to be stopped. So here comes another frame:

It can already be automatically said that this car broke through the guardrail and hovered on the edge of the road, fantastic... But on the contrary, it seems that the easier road signs have gone wrong... I can only say that this is a big model, it will always shock you and never know when it will cry silly you... One more frame:

This time, I talked directly about the debris on the road surface, and I was amazed again... It's just that once I said the arrow on the road wrong... Overall, the information that needs special attention in this scene is covered, and the problem of road signs is not hidden.

Example4: Let's have a funny

It can only be said that it is very in place, compared to the case that seemed extremely difficult before, such as "someone waved at you", which was like pediatrics, the semantic corner case can be solved.

Example5 Come to a famous scene... Delivery vehicles strayed into new roads

At first, it was conservative, and did not directly guess the reason, giving a variety of guesses, which is also in line with the goal of alignment.

After using CoT, the problem found is that it is not understood that the car is an autonomous vehicle, so by giving this information, it can give more accurate information.

Finally, through a bunch, it is possible to output the conclusion that newly laid asphalt is not suitable for driving. The end result is still OK, but the process is more tortuous, and more engineering is required, and it is necessary to design well.

This reason may also be because it is not a first-view picture, and can only be speculated through the third-point view. So this example is not very precise.

Summary

Some quick attempts have fully proved the power and generalization performance of GPT4V, and the appropriate should be able to fully exert the strength of GPT4V.

Solving the semantic corner case should be very desirable, but the problem of hallucinations will still plague some applications in safety-related scenarios.

Very exciting, I personally think that the reasonable use of such a large model can greatly accelerate the development of L4 and even L5 autonomous driving, but does LLM necessarily drive directly? End-to-end driving, in particular, remains a debatable issue.

Reference Links:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)