The large model of Apple's Wensheng diagram was unveiled: matryoshka diffusion, supporting 1024x1024 resolution

accustomed to Stable Diffusion, and now finally has a Matryoshka Diffusion model, again made by Apple.

Original source: Heart of the Machine

Image source: Generated by Unbounded AI

In the era of generative AI, diffusion models have become a popular tool for generative AI applications such as image, video, 3D, audio, and text generation. However, extending the diffusion model to the high-resolution domain is still a major challenge, as the model must recode all the high-resolution inputs at each step. Solving these challenges requires the use of deep architectures with attention blocks, which makes optimization more difficult and consumes more computing power and memory.

What to do? Some recent work has focused on efficient network architectures for high-resolution images. However, none of the existing methods exhibit results beyond 512×512 resolution, and the generation quality lags behind that of mainstream cascade or latent methods.

Let's take OpenAI DALL-E 2, Google IMAGEN, and NVIDIA eDiffI as examples, which save computing power by learning a single low-resolution model and multiple super-resolution diffusion models, where each component is trained separately. On the other hand, latent diffusion models (LDMs) only learn low-resolution diffusion models and rely on individually trained high-resolution autoencoders. For both scenarios, multi-stage pipelines complicate training and inference, often requiring fine-tuning or hyperparameterization.

In this paper, the researchers propose Matryoshka Diffusion Models (MDM), a novel diffusion model for end-to-end high-resolution image generation. The code will be released soon.

Address:

The main idea presented in the study is to perform a joint diffusion process at multiple resolutions using a nested UNet architecture as part of high-resolution generation.

The study found that MDM, together with the nested UNet architecture, achieved 1) multi-resolution loss: greatly improved the convergence speed of high-resolution input denoising; 2) An efficient progressive training plan, starting with training a low-resolution diffusion model and progressively adding high-resolution inputs and outputs as planned. Experimental results show that the combination of multi-resolution loss and progressive training can achieve a better balance between training cost and model quality.

The study evaluated MDM in terms of class-conditional image generation as well as text-conditional image and video generation. MDM enables training high-resolution models without the need for cascades or latent diffusion. Ablation studies have shown that both multi-resolution loss and progressive training greatly improve training efficiency and quality.

Let's take a look at the following MDM-generated images and videos.

Methodology Overview

According to the researchers, the MDM diffusion model is trained end-to-end in high resolution while using hierarchical data formation. MDM first generalizes the standard diffusion model in the diffusion space, and then proposes a dedicated nested architecture and training process.

First, let's look at how to generalize the standard diffusion model in the extended space.

Unlike cascade or latent approaches, MDM learns a single diffusion process with a hierarchical structure by introducing a multi-resolution diffusion process in an extended space. This is shown in Figure 2 below.

Specifically, given a data point x ∈ R^N, the researcher defines the time-dependent latent variable z_t = z_t^1 , . . . , z_t^R ∈ R^N_1+... NR。

According to the researchers, diffusion modeling in an extended space has two advantages. For one, we are generally concerned with the full-resolution output z_t^R during inference, and all other medium-resolution outputs are treated as additional latent variables z_t^r, adding complexity to the modeling distribution. Second, multi-resolution dependencies provide an opportunity to share weights and computations across z_t^r, redistributing computation in a more efficient way and enabling efficient training and inference.

Let's take a look at how nestedUNet works.

Similar to typical diffusion models, the researchers implemented MDM using a UNet network structure, in which residual connections and computational blocks are used in parallel to preserve fine-grained input information. The computational blocks here contain multi-layer convolutions and self-attention layers. The codes for NestedUNet and standard UNet are as follows.

In addition to its simplicity compared to other hierarchical methods, NestedUNet allows computations to be allocated in the most efficient way. As shown in Figure 3 below, early investigators found that MDM achieved significantly better scalability when most of the parameters and calculations were allocated at the lowest resolution.

Finally, there is learning.

The investigators trained MDM at multiple resolutions using conventional denoising targets, as shown in equation (3) below.

Progressive training is used here. The investigators trained MDM directly end-to-end following the above equation (3) and demonstrated better convergence than the original baseline method. They found that the training of high-resolution models was greatly accelerated using a simple progressive training method similar to that proposed in the GAN paper.

This training method avoids costly, high-resolution training from the outset and accelerates overall convergence. Not only that, but they also incorporated mixed-resolution training, which trains samples with different final resolutions simultaneously in a single batch.

Experiments & Results

MDM is a general-purpose technology for any issue that can gradually compress input dimensions. The comparison of MDM to the baseline approach is shown in Figure 4 below.

Table 1 shows a comparison on ImageNet (FID-50K) and COCO (FID-30K).

Figures 5, 6, and 7 below illustrate the results of MDM in image generation (Figure 5), text-to-image (Figure 6), and text-to-video (Figure 7). Despite being trained on a relatively small dataset, MDM has demonstrated a strong zero-shot ability to produce high-resolution images and videos.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)