📢 Gate Square Exclusive: #PUBLIC Creative Contest# Is Now Live!
Join Gate Launchpool Round 297 — PublicAI (PUBLIC) and share your post on Gate Square for a chance to win from a 4,000 $PUBLIC prize pool
🎨 Event Period
Aug 18, 2025, 10:00 – Aug 22, 2025, 16:00 (UTC)
📌 How to Participate
Post original content on Gate Square related to PublicAI (PUBLIC) or the ongoing Launchpool event
Content must be at least 100 words (analysis, tutorials, creative graphics, reviews, etc.)
Add hashtag: #PUBLIC Creative Contest#
Include screenshots of your Launchpool participation (e.g., staking record, reward
The large model of Apple's Wensheng diagram was unveiled: matryoshka diffusion, supporting 1024x1024 resolution
Original source: Heart of the Machine
In the era of generative AI, diffusion models have become a popular tool for generative AI applications such as image, video, 3D, audio, and text generation. However, extending the diffusion model to the high-resolution domain is still a major challenge, as the model must recode all the high-resolution inputs at each step. Solving these challenges requires the use of deep architectures with attention blocks, which makes optimization more difficult and consumes more computing power and memory.
What to do? Some recent work has focused on efficient network architectures for high-resolution images. However, none of the existing methods exhibit results beyond 512×512 resolution, and the generation quality lags behind that of mainstream cascade or latent methods.
Let's take OpenAI DALL-E 2, Google IMAGEN, and NVIDIA eDiffI as examples, which save computing power by learning a single low-resolution model and multiple super-resolution diffusion models, where each component is trained separately. On the other hand, latent diffusion models (LDMs) only learn low-resolution diffusion models and rely on individually trained high-resolution autoencoders. For both scenarios, multi-stage pipelines complicate training and inference, often requiring fine-tuning or hyperparameterization.
In this paper, the researchers propose Matryoshka Diffusion Models (MDM), a novel diffusion model for end-to-end high-resolution image generation. The code will be released soon.
The main idea presented in the study is to perform a joint diffusion process at multiple resolutions using a nested UNet architecture as part of high-resolution generation.
The study found that MDM, together with the nested UNet architecture, achieved 1) multi-resolution loss: greatly improved the convergence speed of high-resolution input denoising; 2) An efficient progressive training plan, starting with training a low-resolution diffusion model and progressively adding high-resolution inputs and outputs as planned. Experimental results show that the combination of multi-resolution loss and progressive training can achieve a better balance between training cost and model quality.
The study evaluated MDM in terms of class-conditional image generation as well as text-conditional image and video generation. MDM enables training high-resolution models without the need for cascades or latent diffusion. Ablation studies have shown that both multi-resolution loss and progressive training greatly improve training efficiency and quality.
Let's take a look at the following MDM-generated images and videos.
According to the researchers, the MDM diffusion model is trained end-to-end in high resolution while using hierarchical data formation. MDM first generalizes the standard diffusion model in the diffusion space, and then proposes a dedicated nested architecture and training process.
First, let's look at how to generalize the standard diffusion model in the extended space.
Unlike cascade or latent approaches, MDM learns a single diffusion process with a hierarchical structure by introducing a multi-resolution diffusion process in an extended space. This is shown in Figure 2 below.
Let's take a look at how nestedUNet works.
Similar to typical diffusion models, the researchers implemented MDM using a UNet network structure, in which residual connections and computational blocks are used in parallel to preserve fine-grained input information. The computational blocks here contain multi-layer convolutions and self-attention layers. The codes for NestedUNet and standard UNet are as follows.
The investigators trained MDM at multiple resolutions using conventional denoising targets, as shown in equation (3) below.
This training method avoids costly, high-resolution training from the outset and accelerates overall convergence. Not only that, but they also incorporated mixed-resolution training, which trains samples with different final resolutions simultaneously in a single batch.
Experiments & Results
MDM is a general-purpose technology for any issue that can gradually compress input dimensions. The comparison of MDM to the baseline approach is shown in Figure 4 below.