Can AI understand what it generates? After experiments on GPT-4 and Midjourney, someone solved the case

Article source: Heart of the Machine

Edit: Large plate of chicken, egg sauce

Without "understanding", there can be no "creation".

Image source: Generated by Unbounded AI

From ChatGPT to GPT4, from DALL・E 2/3 to Midjourney, generative AI has garnered unprecedented global attention. The potential of AI is huge, but great intelligence can also cause fear and concern. Recently, there has been a fierce debate on this issue. First, the Turing winners "scuffled", and then Andrew Ng joined in.

In the field of language and vision, today's generative models can be output in a matter of seconds and can challenge even experts with years of skills and knowledge. This seems to provide a compelling motivation for the claim that models have surpassed human intelligence. However, it is also important to note that there are often basic errors of comprehension in the output of the model.

In this way, a paradox seems to emerge: how do we reconcile the seemingly superhuman abilities of these models with the fundamental errors that persist that most humans can correct?

Recently, the University of Washington and the Allen Institute for AI jointly released a paper to study this paradox.

Address:

This paper argues that this phenomenon occurs because the capability configuration in today's generative models deviates from the human intelligence configuration. This article proposes and tests the paradoxical hypothesis of generative AI: generative models are trained to directly output expert-like results, a process that skips the ability to understand the ability to generate that quality output. However, for humans, this is very different, and basic understanding is often a prerequisite for expert-level output capabilities.

In this paper, the researchers test this hypothesis through controlled experiments and analyze the generative model's ability to generate and understand text and vision. In this article, we will first talk about the "understanding" conceptualization of generative models from two perspectives:

    1. Given a generation task, the extent to which the model can select the correct response in the discriminant version of the same task;
    1. Given a correctly generated response, the extent to which the model can answer the content and questions about that response. This results in two experimental setups, selective and interrogatorial.

The researchers found that in selective evaluation, the model often performed as well as or better than humans in the generation task setting, but in the discriminant (understanding) setting, the model performed less than humans. Further analysis shows that compared with GPT-4, human discrimination ability is more closely related to generative ability, and human discrimination ability is more robust to adversarial input, and the gap between model and human discrimination ability increases with the increase of task difficulty.

Similarly, in interrogative evaluations, while models can produce high-quality outputs across different tasks, researchers have observed that models often make mistakes in answering questions about these outputs, and that the model's comprehension is again lower than that of humans. This article discusses a range of potential reasons for the divergence between generative models and humans in terms of capacity configuration, including model training goals, the size and nature of inputs.

The significance of this research is that, first of all, it means that existing concepts of intelligence derived from human experience may not be generalizable to AI, and even though AI's capabilities seem to mimic or surpass human intelligence in many ways, its capabilities may be fundamentally different from the expected patterns of humans. On the other hand, the findings of this paper also suggest caution when studying generative models to gain insight into human intelligence and cognition, as seemingly expert-level human-like outputs may obscure non-human mechanisms.

In conclusion, the generative AI paradox encourages people to study models as an interesting antithesis of human intelligence, rather than as a parallel antithesis.

"The generative AI paradox highlights the interesting notion that AI models can create content that they themselves may not fully understand. This raises the potential problems behind the limitations of AI's understanding and its powerful generative capabilities." Netizens said.

What is the Generative AI Paradox

Let's start by looking at the generative AI paradox and the experimental design to test it.

*Figure 1: Generative AI in language and vision can produce high-quality results. Paradoxically, however, the model has difficulty demonstrating a selective (A,C) or interrogative (B,D) understanding of these patterns. *

Generative models appear to be more effective at acquiring generative capabilities than comprehension, in contrast to human intelligence, which is often more difficult to acquire.

To test this hypothesis, an operational definition of various aspects of the paradox is required. First, for a given model and task t, with human intelligence as a baseline, what it means to be "more effective" than to understand ability. Using g and u as some of the performance indicators for generation and comprehension, the researchers formalized the generative AI paradox hypothesis as:

To put it simply, for a task t, if the human generative performance g is the same as the model, then the human comprehension performance u will be significantly higher than the model (> ε under a reasonable ε). In other words, the model performed worse in terms of understanding than researchers would expect from humans with similarly powerful generative abilities.

The operational definition of generation is simple: given a task input (question/prompt), generation is all about generating observable content to satisfy that input. As a result, performance g (e.g., style, correctness, preference) can be evaluated automatically or by humans. While comprehension is not defined by a few observable outputs, it can be tested by clearly defining its effects:

  1. Selective evaluation. To what extent can the model still select an accurate answer from the provided set of candidates in a discriminant version of the same task for a given task that can generate an answer? A common example is multiple-choice answers, which are one of the most common ways to test human understanding and natural language understanding in language models. (Fig. 1, columns A, C)
  2. Question-based evaluation. To what extent can the model accurately answer questions about the content and appropriateness of a given generated model output? This is similar to an oral exam in education. (Figure 1, columns B, D).

These definitions of understanding provide a blueprint for evaluating the "generative AI paradox" and allow researchers to test whether Hypothesis 1 holds true across different patterns, tasks, and models.

When models can be generated, can they be discriminated? **

First, the researchers performed a side-by-side performance analysis of the variants of the generative task and the discriminative task in the selective evaluation to evaluate the model's generation and comprehension ability in language and visual modes. They compared this generation and discrimination performance to humans.

Figure 2 below compares the generation and discrimination performance of GPT-3.5, GPT-4, and humans. You can see that in 10 of the 13 datasets, there is at least one model that supports subhypothesis 1, with models that are better than humans in terms of generation but less discriminative than humans. Of the 13 datasets, 7 datasets support subhypothesis 1 for both models.

Expecting humans to generate detailed images like visual models is unrealistic, and the average person can't match the stylistic quality of models like Midjourney, so it's assumed that humans have lower generative performance. Only the generation and discrimination accuracy of the model is compared to the discrimination accuracy of humans. Similar to the language domain, Figure 3 illustrates that CLIP and OpenCLIP are also less accurate than humans in terms of discriminant performance. It is assumed that humans are less capable of generating, which is consistent with subhypothesis 1: Vision AI is above the human average in terms of generation, but lagging behind humans in terms of understanding.

Figure 4 (left) shows GPT-4 compared to humans. By looking at it, it can be seen that when the answers are lengthy and challenging, such as summarizing a lengthy document, the model tends to make the most mistakes in the discriminant task. **Humans, by contrast, are able to maintain a consistently high accuracy rate in tasks of varying difficulty.

Figure 4 (right) shows OpenCLIP's discriminant performance compared to humans at different levels of difficulty. Taken together, these results highlight the ability of humans to discern the correct answer even in the face of challenging or adversarial samples, but this ability is not as strong in language models. This discrepancy raises questions about how well these models are truly understood.

Figure 5 illustrates a notable trend: raters tend to favor GPT-4 responses over human-generated responses.

Does the model understand the results it generates? **

The previous section showed that models are generally good at generating accurate answers, but lag behind humans in the discrimination task. Now, in question-based assessments, researchers ask the model questions directly about the generated content to investigate the extent to which the model can demonstrate a meaningful understanding of the generated content – which is the strength of humans.

Figure 6 (left) shows the results of the language modality. While the model excels at generation, it often makes mistakes when answering questions about its generation, suggesting that the model is making mistakes in understanding. Assuming that a human cannot generate such text at the same speed or scale, although the question is about the output of the model itself, the accuracy of human quality assurance has been consistently high compared to the model. As described in subhypothesis 2, the investigators expect humans to achieve higher accuracy in their own generated text. At the same time, it can be noted that the humans in this study are not experts, and it can be a huge challenge to produce text as complex as the output of the model.

As a result, the researchers expect that if the model is compared to a human expert, the performance gap in understanding the content they generate will widen, as the human expert is likely to answer such questions with near-perfect accuracy.

Figure 6 (right) shows the results of a question in visual mode. As you can see, image understanding models still can't compare to humans in accuracy when answering simple questions about the elements in the generated images. At the same time, image generation SOTA models surpass most ordinary people in terms of the quality and speed of generating images (it is expected that it will be difficult for ordinary people to generate similar realistic images), suggesting that visual AI is relatively far behind humans in terms of generation (stronger) and understanding (weaker). Surprisingly, there is a smaller performance gap between simple models and humans compared to advanced multimodal LLMs (i.e., Bard and BingChat), which have some fascinating visual understanding but still struggle to answer simple questions about generated images.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)