AI got Google captcha, and the latest multimodal large model is more accurate than GPT-4V space understanding

Original source: Qubits

Image source: Generated by Unbounded AI

Google CAPTCHA can't stop AI!

The latest multimodal large model makes it easy to find all the traffic lights in the picture and accurately circles the specific location.

The performance directly exceeds the GPT-4V.

This is the multimodal large model "Ferret" brought by the Apple and Columbia University research team.

It has stronger graphic and text correlation capabilities, which improves the accuracy of large models in the task of "seeing, speaking, and answering".

For example, the very small part (region 1) in the figure below can also be distinguished as a shock.

The GPT-4V did not answer correctly and did not perform well in small parts.

So, how does Ferret do it?

** "Point a little" image big model understand **

The core problem that Ferret solves is to make spatial understanding of both referring and grounding closer.

References refer to having the model understand exactly the semantics of a given region, that is, what a location it can know is.

Positioning is to give semantics so that the model can find the corresponding target in the graph.

For humans, these two abilities are a natural combination, but many existing multimodal models only use referencing and positioning alone.

Therefore, Ferret proposed a new type of mixed region representation method that can combine discrete coordinates and continuous features to represent regions in an image.

This allows the model to distinguish objects that are almost identical to the bounding boxes.

For example, in the case of the two objects in the figure below, if only the discrete bounding box is used, the model will feel very "confused". Combined with continuous free-form blended representations, this problem is well solved.

In order to extract continuous features of diverse regions, the paper proposes a spatial perception visual sampler capable of handling sparsity differences between different shapes.

As a result, Ferret can accept a variety of regional inputs such as points, bounding boxes, and free shapes and understand their semantics.

In the output, it can automatically generate the coordinates of each anchored object based on the text.

To achieve this, the architecture of the Ferret model includes components such as an image encoder, a spatially aware visual sampler, and a language model (LLM).

Ferret combines discrete coordinates and continuous features to form a hybrid region representation.

This representation is designed to solve the challenge of representing areas of various shapes and formats, including points, bounding boxes, and free-form shapes.

Each coordinate in discrete coordinates is quantized to a discrete coordinate of a target frame, and this quantization ensures the robustness of the model to different image sizes.

The continuous features are extracted by the spatial perception visual sampler, which uses binary masks and feature maps to randomly sample points within the ROI and obtain features through bilinear interpolation.

These features are processed by a spatial awareness module inspired by a 3D point cloud model, condensed into a single vector, and mapped to a large language model (LLM) for further processing.

To augment Ferret's capabilities, the paper also created a dataset called GRIT.

This dataset contains 1.1M samples and covers four main categories: individual objects, relationships between objects, region-specific descriptions, and region-based complex reasoning.

The GRIT dataset includes data converted from public datasets, instruction tuning data generated through ChatGPT and GPT-4, and an additional 95K difficult negative samples are provided to improve the robustness of the model.

Experimental results show that the model not only shows superior performance in classical referencing and localization tasks, but also far exceeds other existing MLLM models in multimodal dialogue based on region and need to localize.

In addition, the study proposes a Ferret-Bench that can assess the reference/localization, semantics, knowledge, and reasoning ability of local areas of an image.

The Ferret model, which was evaluated on LLaVA-Bench and Ferret-Bench, excelled in all tasks, especially on the three new tasks that required referential and visual grounding.

Moreover, there is a significant improvement in the details of the description of the image, and there is a significant decrease in hallucinations.

All Chinese Team

The Ferret big model is jointly brought by Apple's AI/ML and Columbia University research team, with an all-Chinese lineup.

There is Haoxuan and Zhang Haotian as a joint work.

You Haoxuan is now a Ph.D. in computer science from Colum University and will join Apple's AI/ML team after graduation. He graduated from Xidian University in 2018.

His research interests include visual language understanding, text-image generation and visual language.

Zhang Haotian is now a visual intelligence researcher in Apple's AI/ML team.

Before joining Apple, Haotian received his Ph.D. from the University of Washington and his bachelor's degree from Shanghai Jiao Tong University.

He is one of the lead authors of GLIP/GLIPv2, which has been nominated for CVPR2022 Best Paper Award.

In addition, the team includes Gan Zhe, Wang Zirui, Cao Liangliang, Yang Yinfei and other former Google and Microsoft excellent multimodal large model researchers.

Paper Address:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)