3D Gaussian Splatting for Real-Time Radiance Field Rendering(RELATED WORK)
Traditional Scene Reconstruction and Rendering
The first novel-view synthesis approaches were based on light fields , first densely sampled then allowing unstructured capture.
The advent of SfM enabled an entire new domain where a collection of photos could be used to synthesize novel views. SfM estimates a sparse point cloud during camera calibration, that was initially used forsimple visualization of 3D space.
Subsequent MVS produced impressive full 3D reconstruction algorithms over the years, enabling the development of several view synthesis algorithms.
All these methods re-project and blend the input images into the novel view camera, and use the geometry to guide this re-projection.
These methods produced excellent results in many cases, but typically cannot completely recover fromunreconstructed regions , or from “over-reconstruction ”, when MVS generates inexistent geometry.
Recent neural rendering algorithms vastly reduce such artifacts and avoid the overwhelming cost of storing all input images on the GPU, outperforming these methods on most fronts.
Neural Rendering and Radiance Fields
Deep learning techniques were adopted early for novel-view synthesis; CNNs were used to estimate blending weights, or for texture-space solutions. The use of MVS-based geometry is a major drawback of most of these methods; in addition, the use of CNNs for final rendering frequently results in temporal flickering.
Volumetric representations for novel-view synthesis were initiated by Soft3D ; deep-learning techniques coupled with volumetric ray-marching were subsequently proposed building on a continuous differentiable density field to represent geometry. Rendering using volumetric ray-marching has a significant cost due to the large number of samples required to query the volume.
NeRFs introduced importance sampling and positional encoding to improve quality, but used a large Multi-Layer Perceptron negatively affecting speed. The success of NeRF has resulted in an explosion of follow-up methods that address quality and speed, often by introducing regularization strategies ; the current SOTA in image quality for novel-view synthesis is Mip-NeRF360.
While the rendering quality is outstanding, training and rendering timesremain extremely high; 3DGS is able to equal or in some cases surpass this quality while providing fast training and real-time rendering.
The most recent methods have focused on faster training and/or rendering mostly by exploiting three design choices:
① the use of spatial data structures to store (neural) features that are subsequently interpolated during volumetric ray-marching,
② different encodings,
③ MLP capacity.
Such methods include different variants of space discretization, codebooks, and encodings such as hash tables , allowing the use of a smaller MLP or foregoing neural networks completely.
InstantNGP uses a hash grid and an occupancy grid to accelerate computation and a smaller MLP to represent density and appearance.
Plenoxels that use a sparse voxel grid to interpolate a continuous density field, and are able to forgo neural networks altogether.
Both rely on Spherical Harmonics : the former to represent directional effects directly, the latter to encode its inputs to the color network. While both provide outstanding results, these methods can still struggle to represent empty space effectively, depending in part on the scene/capture type.
In addition, image quality is limited in large part by the choice of thestructured grids used for acceleration, and rendering speed is hindered by the need to query many samples for a given ray-marching step.
The unstructured, explicit GPU-friendly 3D Gaussians achieve faster rendering speed and better quality without neural components.
Point-Based Rendering and Radiance Fields
Point-based methods efficiently render disconnected and unstructured geometry samples (i.e., point clouds).
In its simplest form, point sample rendering rasterizes an unstructured set of points with a fixed size , for which it may exploit natively supported point types of graphics APIs or parallel software rasterization on the GPU. While true to the underlying data, point sample rendering suffers from holes , causes aliasing , and is strictly discontinuous.
Seminal work on high-quality point-based rendering addresses these issues by “splatting” point primitives with an extent larger than a pixel , e.g., circular or elliptic discs, ellipsoids, or surfels.
There has been recent interest in differentiable point-based rendering techniques. Points have been augmented with neural features and rendered using a CNN resulting in fast or even real-time view synthesis; however they still depend on MVS for the initial geometry, and as such inherit its artifacts, most notably over- or under-reconstruction in hard cases such asfeatureless/shiny areas or thin structures.
Point-based
-blending and NeRF-style volumetric rendering share essentially the same image formation model. Specifically, the color
is given by volumetric rendering along a ray:

where samples of density
, transmittance
, and color
are taken along the ray with intervals
. This can be re-written as


A typical neural point-based approach computes the color
of a pixel by blending
ordered points overlapping the pixel:

where
is the color of each point and
is given by evaluating a 2D Gaussian with covariance
multiplied with a learned per-point opacity.
From
and
, we can clearly see that the image formation model is the same. However, the rendering algorithm is very different.
NeRFs are a continuous representation implicitly representing empty/occupied space ; expensive random sampling is required to find the samples in
with consequent noise and computational expense.
In contrast, points are an unstructured, discrete representation that is flexible enough to allow creation, destruction, and displacement of geometry similar to NeRF. This is achieved by optimizing opacity and positions, as shown by previous work, while avoiding the shortcomings of a full volumetric representation.
Pulsar achieves fast sphere rasterization which inspired 3DGS tile-based and sorting renderer. However, given the analysis above, we want to maintain (approximate) conventional
-blending on sorted splats to have the advantages of volumetric representations: Our rasterization respects visibility order in contrast to their order-independent method.
Moreover, we back-propagate gradients on** all** splats in a pixel and rasterizeanisotropic splats. These elements all contribute to the high visual quality of our results.
In addition, previous methods mentioned above also use CNNs for rendering, which results in temporal instability.
Nonetheless, the rendering speed of Pulsar and ADOP served as motivation to develop our fast rendering solution.
While focusing on specular effects , the diffuse point-based rendering track of Neural Point Catacaustics overcomes this temporal instability by using an MLP , but still required MVS geometry as input. The most recent method in this category does not require MVS , and also uses SH for directions; however, it can only handle scenes of one object and needs masks for initialization. While fast for small resolutions and low point counts , it is unclear how it can scale to scenes of typical datasets.
3DGS use 3D Gaussians for a more flexible scene representation, avoiding the need for MVS geometry and achieving real-time rendering thanks to our**tile-based** rendering algorithm for the projected Gaussians.
A recent approach uses points to represent a radiance field with aradial basis function approach. They employ point pruning and densification techniques during optimization, but use volumetric ray-marching and cannot achieve real-time display rates.
In the domain of human performance capture, 3D Gaussians have been used to represent captured human bodies; more recently they have been used with volumetric ray-marching forvision tasks. Neural volumetric primitives have been proposed in a similar context.
While these methods inspired the choice of 3D Gaussians as scene representation, they focus on the specific case of reconstructing and rendering a single isolated object (a human body or face), resulting in scenes with small depth complexity.
In contrast, 3DGS‘s optimization of anisotropic covariance, interleaved optimization/density control , and efficient depth sorting for rendering allow us to handle complete, complex scenes including background, both indoors and outdoors and with large depth complexity.

