"TripoSR: Fast 3D object reconstruction from a single image".

This is an impressive system where you can put in a single image, and it will generate a 3D model. They show videos going around the 3D model all 360-degrees.

I was, however, surprised and a bit disappointed to discover the output of this model is a neural radiance field, aka NeRF, not a traditional 3D model (using polygons) that you could plug into your existing videos games, or even a model using the newer Gaussian splatting technique. A neural radiance field, aka NeRF, is a neural network where you put in light rays as input, and it outputs pixel values for what you should see along that light ray. It's like neural ray tracing.

First a description of how the TripoSR system works. It builds on an earlier system called LRM, which stands for "large reconstruction model", and LRM in turn is built on DINO, and a generative adversarial network (GAN) that makes NeRFs from the output of DINO. Nobody knows what DINO stands for, but what DINO is is a vision transformer (ViT) that was made unusually large and trained in a "self-supervised" manner, analogous to how GPT is trained on language.

What the large reconstruction model (LRM) did is change DINO so it outputs an encoding known as a "3D triplane". That is, instead of outputting a 3D model in the form of 3D voxels, it outputs 3 2D planes. I never heard of this technique. The idea is that 3 2D planes gives you a lot of information that you could use to construct the 3D object, or at least its visible surface, without storing the massive amount of data that true 3D voxels would require. The 3D planes are oriented orthogonal to each other on combinations of x, y, and z axes, intersecting at the origin (0, 0, 0). Think of it as xy, xz, and yz planes. The way this works is you basically take the DINO vision transformer (ViT), which outputs image "features", and combine it with a new "image-to-triplane" encoder that takes the original image and the "feature" encodings from DINO, and makes a triplane.

The generative adversiarial network (GAN) comes into the picture because it was trained to turn 3D triplanes into neural radiance fields.

The great selling point of this system is that the whole process works without using the image you provide it as a "training" image and going through the training process (involving the whole backpropagation algorithm and gradient descent and the whole thing). In other words, when you put in your image, everything happens at "inference" time, with the network operating in feedforward-only mode. As such it can generate 3D models quickly and on only one GPU.

My guess is a big part of what makes this possible is the vast amount of "world knowledge" of what objects are likely to look like in 3D that is the result of the massive "self-supervised" DINO vision transformer (ViT) model.

Introducing TripoSR: Fast 3D object generation from single images

#solidstatelife #computervision

1
7