Vincent's 3D model breakthrough! MVDream is coming, generating ultra-realistic 3D models in one sentence

Original source: Xinzhiyuan

Image source: Generated by Unbounded AI

Awesome!

Now you can create beautiful, high-quality 3D models with just a few words?

No, a foreign blog set off the Internet and put something called MVDream in front of us.

Users can create a lifelike 3D model with just a few words.

And what’s different from before is that MVDream seems to really “understand” physics.

Let’s take a look at how amazing this MVDream is~

MVDream

The younger brother said that in the era of large models, we have seen too many text generation models and image generation models. And the performance of these models is getting more and more powerful.

Later, we even witnessed the birth of Vincent video models, including of course the 3D models we are going to mention today.

Just imagine, you only need to enter a sentence, and you can generate an object model that looks like it exists in the real world, and even contains all the necessary details. How cool this scene would be.

And this is definitely not an easy task, especially since users need to generate models that are realistic enough in detail.

Let’s take a look at the effect first~

The same one, the one on the far right is the finished product of MVDream.

The difference between the 5 models is visible to the naked eye. The first few models completely violate the objective facts and are only correct when viewed from certain angles.

For example, in the first four pictures, the generated model actually has more than two ears. Although the fourth picture looks more detailed, when turned to a certain angle we can find that the character's face is concave and there is an ear stuck on it.

Who knows, the editor immediately remembered the front view of Peppa Pig, which was very popular before.

It's the kind where certain angles are shown to you, but don't look at other angles, as it will kill you.

But the generation model of MVDream on the far right is obviously different. No matter how the 3D model is rotated, you will not feel anything unconventional.

This is what was mentioned at the beginning. MVDream really understands common sense of physics and will not make some weird things just to ensure that there are two ears in each view.

The little brother pointed out that the most important thing about whether a 3D model is successful is whether the different perspectives of the model are realistic enough and the quality is high enough.

And we also need to ensure that the model is spatially coherent, not like the model with multiple ears above.

One of the main methods of generating 3D models is to simulate the camera's perspective and then generate what can be seen from a certain perspective.

In other words, this is called 2D lifting. It means splicing different perspectives together to form the final 3D model.

The above multi-ear situation occurs because the generative model does not fully grasp the shape information of the entire object in the three-dimensional space. And MVDream is just a big step forward in this regard.

The new model solves the consistency problem in the 3D perspective that has been occurring before.

Fractional distillation sampling

The method used is called score distillation sampling, developed by DreamFusion.

Before understanding the fractional distillation sampling technique, we need to first understand the architecture used by this method.

In short, this is actually just another diffusion model for two-dimensional images, similar to the DALLE, MidJourney and Stable Diffusion models.

More specifically, everything starts with the pre-trained DreamBooth model, an open source model based on Stable Diffusion raw graphs.

Then, change came.

What the research team then did was to directly render a set of multi-view images instead of just one image. This step requires three-dimensional data sets of various objects to complete.

Here, the researchers took multiple views of a 3D object from a dataset, used them to train a model, and then used it to generate those views backwards.

The specific method is to change the blue self-attention block in the picture below to a three-dimensional self-attention block. That is to say, the researchers only need to add one dimension to reconstruct multiple images instead of one image.

In the image below, we can see that the camera and timestep are also input into the model for each view, to help the model understand which image will be used where, and which view needs to be generated.

Now, all the images are connected together and the generation is done together as well. So they can share information and better understand the big picture.

The text is then fed into the model, which is trained to accurately reconstruct objects from the data set.

And this is where the research team applied the multi-view fractional distillation sampling process.

Now, with a multi-view diffusion model, the team can generate multiple views of an object.

The next step is to use these views to reconstruct a 3D model that is consistent with the real world, not just views.

NeRF (neural radiance fields) needs to be used to achieve this, just like DreamFusion mentioned earlier.

Basically, this step is to freeze the previously trained multi-view diffusion model. In other words, in this step, the pictures from each view above are only "used" and will not be "trained" again.

Guided by the initial rendering, the researchers began using a multi-view diffusion model to generate some noise-added versions of the initial image.

The researchers added noise to let the model know that it needed to generate different versions of the image while still receiving context.

This model is then used to further generate higher quality images.

Add the image used to generate this image and remove the noise we added manually so that we can use the results to guide and improve the NeRF model in the next step.

These steps are all about better understanding which part of the image the NeRF model should focus on in order to produce better results in the next step.

Repeat this until a satisfactory 3D model is generated.

This is how the team operates to evaluate the image generation quality of the multi-view diffusion model and how different designs will affect its performance.

First, they compared choices of attention modules for building cross-view consistency models.

These options include:

(1) One-dimensional temporal self-attention widely used in video diffusion models;

(2) Add a new three-dimensional self-attention module to the existing model;

(3) Reuse the existing 2D self-attention module for 3D attention.

In this experiment, to clearly show the differences between these modules, the researchers used 8 frames of 90-degree viewing angle changes to train the model, which is closer to the setting of the video.

At the same time, in the experiment, the research team also maintained a higher image resolution, that is, 512×512 as the original SD model. The results are shown in the figure below. The researchers found that even with such limited perspective changes in static scenes, temporal self-attention is still affected by content shifts and cannot maintain perspective consistency.

The team hypothesizes that this is because temporal attention can only exchange information between the same pixels in different frames, while corresponding pixels may be far apart when the viewpoint changes.

On the other hand, adding new 3D attention without learning consistency can lead to severe quality degradation.

The researchers believe that this is because learning new parameters from scratch will consume more training data and time, which is not suitable for this situation where the three-dimensional model is limited. The strategy proposed by the researchers to reuse 2D self-attention achieves the best consistency without degrading the generation quality.

The team also noticed that if the image size was reduced to 256 and the number of views to 4, the differences between these modules would be much smaller. However, to achieve the best consistency, the researchers made their choices based on preliminary observations in the following experiments.

In addition, for multi-view fractional distillation sampling, the researchers implemented multi-view diffusion guidance in the threestudio (THR) library, which implements state-of-the-art text-to-3D model generation methods under a unified framework.

The researchers used the implicit-volume implementation in threestudio as the three-dimensional representation, which includes a multi-resolution hash-grid.

For the camera view, the researchers sampled the camera in exactly the same way as when rendering the 3D dataset.

In addition, the researchers also used the AdamW optimizer to optimize the 3D model for 10,000 steps with a learning rate of 0.01.

For fractional distillation sampling, in the first 8000 steps, the maximum and minimum time steps dropped from 0.98 steps to 0.5 steps and 0.02 steps respectively.

The rendering resolution starts at 64×64 and gradually increases to 256×256 after 5000 steps.

More cases are as follows:

The above is how the research team used the 2D text-to-image model, used it for multi-view synthesis, and finally used it to iterate and create a text-to-3D model.

Of course, this new method still has certain limitations. The main flaw is that the image generated now is only 256x256 pixels, and the resolution can be said to be very low.

In addition, the researchers also pointed out that the size of the data set used to perform this task will certainly limit the versatility of this method to some extent, because if the data set is too small, it will not be able to reflect our complex world more realistically. world.

References:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)