MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Abstract
We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes.
-
we leverage MVS to encode geometry-aware Gaussian representations and decode them into Gaussian parameters.
-
To further enhance performance, we propose ahybrid Gaussian rendering that integrates an efficient volume rendering design for novel view synthesis.
-
To support fast fine-tuning for specific scenes, we introduce a multi-view geometric consistent aggregation strategy to effectively aggregate the point clouds generated by the generalizable model, serving as the initialization for per-scene optimization.
Compared with previous generalizable NeRF-based methods, which typically require minutes of fine-tuning and seconds of rendering per image, MVSGaussian achieves real-time rendering withbetter synthesis quality for each scene.
Compared with the vanilla 3DGS, MVSGaussian achieves better view synthesis with less training computational cost. Extensive experiments on DTU , Real Forward-facing , NeRF Synthetic , and Tanks and Temples datasets validate that MVSGaussian attains state-of-the-art performance with convincing generalizability , real-time rendering speed , and fast per-scene optimization.
Figure
Figure 1

Comparison with existing methods.
(a) We present the generalizable results on the Real Forward-facing dataset. Compared with other competitors, our method achievesbetter performance at a faster inference speed.
(b) The results after per-scene optimization, where circle size represents optimization time. Our method achieves optimal performance injust 45 seconds.
(c) We illustrate a scene (“room”), showcasing the (PSNR/optimization time) of synthesized views, with**"-"** indicating results from direct inference using the generalizable model.
Figure 2

Overview of MVSGaussian. We first extractfeatures
from input source views
using FPN. These features are then aggregated into a cost volume , regularized by 3D CNNs to produce depth. Subsequently, for each 3D point at the estimated depth , we use a pooling network to aggregate warped source features , obtaining the aggregated feature
. This feature is then enhanced using a 2D UNet , yielding the enhanced feature
.
is decoded into Gaussian parameters for splatting, while
is decoded into volume density and radiance for depth-aware volume rendering. Finally, the two rendered images are averaged to produce the final rendered result.
Figure 3

Consistent aggregation. With depth maps and point clouds produced by the generalizable model, we first conduct geometric consistency checks on depths to derive masks for filtering out unreliable points. The****_filtered point clouds _are then concatenated to obtain a point cloud, serving as theinitialization for per-scene optimization.
Figure 4

Qualitative comparison of rendering quality under generalization and 3-view settings with state-of-the-art methods.
Figure 5

Qualitative comparison of rendering quality with state-of-the-art methods after per-scene optimization.
Figure 6

Analysis of the Optimization process.
(a) The evolution of view quality (PSNR) on the Real Forward-facing dataset during thefirst 2000 iterations of our method and 3DGS.
(b) Qualitative comparison of our method and 3DGS on the “trex” scene, where (PSNR/iteration number) is shown.
Figure 7

Visualization of camera calibration and point cloud reconstructio n by COLMAP.
Figure 8

Depth maps visualization. We visualize the depth maps predicted by our method on different datasets.
Figure 9

Qualitative comparison of rendering quality with state-of-the-art methods under generalization and three views settings.
Figure 10

Point cloud visualization under different aggregation strategies.
Figure 11

Qualitative comparison of rendering quality with state-of-the-art methods after per-scene optimization.
Conclusion and Limitations
We present MVSGaussian, an efficient generalizable Gaussian Splatting approach. Specifically, _we leverage MVS to inferdepth, establishing a pixel-aligned Gaussian representation. To enhance generalization, we propose a hybrid rendering approach that integrates depth-aware volume rendering. Besides, thanks to high-quality initialization, our models can be fine-tuned quickly for specific scenes._
Compared with generalizable NeRFs, which typically require minutes of fine-tuning and seconds of rendering per image, MVSGaussian achieves real-time rendering with superior synthesis quality.
Compared with 3DGS, MVSGaussian achieves better view synthesis with reduced training time.
As our method relies on MVS for depth estimation, it inherits limitations from MVS, such as decreased depth accuracy in areas with weak textures or specular reflections , resulting in degraded view quality.
