August 27th, 2024

Splatt3R: Zero-Shot Gaussian Splatting from Uncalibrated Image Pairs

Splatt3R is a feed-forward model for 3D reconstruction and novel view synthesis from stereo images, achieving real-time rendering and strong generalization to uncalibrated images without requiring depth information.

Read original articleLink Icon
Splatt3R: Zero-Shot Gaussian Splatting from Uncalibrated Image Pairs

Splatt3R is a novel feed-forward model designed for 3D reconstruction and novel view synthesis from stereo image pairs without requiring camera parameters or depth information. It builds upon the MASt3R framework, extending its capabilities to predict 3D Gaussian Splats, which include additional attributes necessary for constructing Gaussian primitives for each point. The model is trained using a two-step process: first optimizing the geometry of the 3D point cloud, followed by a novel view synthesis objective, which helps avoid local minima during training. A unique loss masking strategy is employed to enhance performance on extrapolated viewpoints. Splatt3R processes images using a vision transformer encoder and a transformer decoder that performs cross-attention between the input images. It introduces a third prediction head to estimate covariances, spherical harmonics, opacities, and mean offsets, enabling the construction of complete Gaussian primitives. The model is trained on the ScanNet++ dataset and demonstrates strong generalization to uncalibrated images, achieving real-time rendering at 4FPS for 512 x 512 resolution. The results indicate effective scene reconstruction and novel view synthesis capabilities, even for images captured in uncontrolled environments.

- Splatt3R enables 3D reconstruction from uncalibrated stereo image pairs.

- It builds on the MASt3R framework, enhancing it to predict Gaussian attributes.

- The model employs a unique loss masking strategy for improved performance.

- It achieves real-time rendering capabilities at 4FPS for 512 x 512 resolution.

- Splatt3R shows strong generalization to in-the-wild images.

Link Icon 5 comments
By @refibrillator - 8 months
Novel view synthesis via 3DGS requires knowledge of the camera pose for every input image, ie the cam position and orientation in 3D space.

Historically camera poses have been estimated via 2D image matching techniques like SIFT [1], through software packages like COLMAP.

These algorithms work well when you have many images that methodically cover a scene. However they often struggle to produce accurate estimates in the few image regime, or “in the wild” where photos are taken casually with less rigorous scene coverage.

To address this, a major trend in the field is to move away from classical 2D algorithms, instead leveraging methods that incorporate 3D “priors” or knowledge of the world.

To that end, this paper builds heavily on MASt3R [2], which is a vision transformer model that has been trained to reconstruct a 3D scene from 2D image pairs. The authors added another projection head to output the initial parameters for each gaussian primitive. They further optimize the gaussians through some clever use of the original image pair and randomly selected and rendered novel views, which is basically the original 3DGS algorithm but using synthesized target images instead (hence “zero-shot” in the title).

I do think this general approach will dominate the field in the coming years, but it brings its own unique challenges.

In particular, the quadratic time complexity of transformers is the main computational bottleneck preventing this technique from being scaled up to more than two images at a time, and to resolutions beyond 512 x 512.

Also, naive image matching itself has quadratic time complexity, which is really painful with large dense latent vectors and can’t be accelerated with kd-trees due to the curse of dimensionality. That’s why the authors use a hierarchical coarse to fine algorithm that approximates the exact computation and achieves linear time complexity wrt to image resolution.

[1] https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...

[2] https://github.com/naver/mast3r

By @jonhohle - 8 months
The mirror in the example with the washing machine is amazing. Obviously the model doesn’t understand that it’s a mirror so renders it as if it were a window with volume behind the wall. But it does it so realistically that it produces the same effect as a mirror when viewed from different angles. This feels like something out of a sci-fi detective movie.
By @S0y - 8 months
This is really awesome. A question for someone who knows more about this: How much harder would it be to make this work using any number of photos? I'm assuming this is the end goal for a model like this.

Imagine being able to create an accurate enough 3D rendering of any interior with just a bunch of snapshots anyone can take with their phone.

By @teqsun - 8 months
Just to check my understanding, the novel part of this is the fact that it generates it from two pictures from any camera without custom hand-calibration for that particular camera, and everything else involved is existing technology?
By @rkagerer - 8 months
What is a splat?