🎉 Gate Square’s "Spark Program" Surpasses 1,000 KOLs!
💥 The creator ecosystem is in full bloom!
📈 Get featured, earn rewards, and grow your influence—what are you waiting for?
💰 Cash incentives ✔️
🚀 Traffic support ✔️
👑 Exclusive verification ✔️
From 0 to 1,000 in just weeks—Gate Square is becoming the epicenter of Web3 content! ⚡
You’re not just posting content, but the next "viral opportunity"!
🌟 Join the Spark Program and kickstart your breakthrough!
👉 https://www.gate.com/announcements/article/45695
AI got Google captcha, and the latest multimodal large model is more accurate than GPT-4V space understanding
Original source: Qubits
Google CAPTCHA can't stop AI!
The latest multimodal large model makes it easy to find all the traffic lights in the picture and accurately circles the specific location.
For example, the very small part (region 1) in the figure below can also be distinguished as a shock.
** "Point a little" image big model understand **
The core problem that Ferret solves is to make spatial understanding of both referring and grounding closer.
References refer to having the model understand exactly the semantics of a given region, that is, what a location it can know is.
Positioning is to give semantics so that the model can find the corresponding target in the graph.
For humans, these two abilities are a natural combination, but many existing multimodal models only use referencing and positioning alone.
This allows the model to distinguish objects that are almost identical to the bounding boxes.
For example, in the case of the two objects in the figure below, if only the discrete bounding box is used, the model will feel very "confused". Combined with continuous free-form blended representations, this problem is well solved.
As a result, Ferret can accept a variety of regional inputs such as points, bounding boxes, and free shapes and understand their semantics.
In the output, it can automatically generate the coordinates of each anchored object based on the text.
Ferret combines discrete coordinates and continuous features to form a hybrid region representation.
This representation is designed to solve the challenge of representing areas of various shapes and formats, including points, bounding boxes, and free-form shapes.
Each coordinate in discrete coordinates is quantized to a discrete coordinate of a target frame, and this quantization ensures the robustness of the model to different image sizes.
The continuous features are extracted by the spatial perception visual sampler, which uses binary masks and feature maps to randomly sample points within the ROI and obtain features through bilinear interpolation.
These features are processed by a spatial awareness module inspired by a 3D point cloud model, condensed into a single vector, and mapped to a large language model (LLM) for further processing.
This dataset contains 1.1M samples and covers four main categories: individual objects, relationships between objects, region-specific descriptions, and region-based complex reasoning.
The GRIT dataset includes data converted from public datasets, instruction tuning data generated through ChatGPT and GPT-4, and an additional 95K difficult negative samples are provided to improve the robustness of the model.
The Ferret model, which was evaluated on LLaVA-Bench and Ferret-Bench, excelled in all tasks, especially on the three new tasks that required referential and visual grounding.
All Chinese Team
The Ferret big model is jointly brought by Apple's AI/ML and Columbia University research team, with an all-Chinese lineup.
There is Haoxuan and Zhang Haotian as a joint work.
You Haoxuan is now a Ph.D. in computer science from Colum University and will join Apple's AI/ML team after graduation. He graduated from Xidian University in 2018.
His research interests include visual language understanding, text-image generation and visual language.
Before joining Apple, Haotian received his Ph.D. from the University of Washington and his bachelor's degree from Shanghai Jiao Tong University.
He is one of the lead authors of GLIP/GLIPv2, which has been nominated for CVPR2022 Best Paper Award.
Paper Address: