SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain
Abstract
With the emergence of Gaussian Splats, recent efforts have focused on large-scale scene geometric reconstruction. However, most of these efforts either concentrate on memory reduction or spatial space division , neglecting information in the semantic space.
In this paper, we propose a novel method, named SA-GS, for fine-grained 3D geometry reconstruction using semantic-aware 3D Gaussian Splats.
Specifically, we leverage prior information stored in large vision models(such as SAM and DINO) to generate semantic masks.
We then introduce a geometric complexity measurement function to serve as soft regularization , guiding _theshape of each Gaussian Splat _within specific semantic areas.
Additionally, we present a method that estimates the expected** number** of Gaussian Splats in different semantic areas, effectively providing a lower bound for Gaussian Splats in these areas.
Subsequently, we extract the point cloud using a novel probability density-based extraction method, transforming Gaussian Splats into a point cloud crucial for downstream tasks.
Our method also offers the potential for detailed** semantic inquiries** while maintaining high image-based reconstruction results.
We provide extensive experiments on publicly available large-scale scene reconstruction datasets with highly accurate point clouds as ground truth and our novel dataset. Our results demonstrate the superiority of our method over current state-of-theart Gaussian Splats reconstruction methods by a significant margin in terms of geometric-based measurement metrics.
Figure
Figure 1

Qualitative comparison between our method and other 3DGS based methods.
We proposed Shape constrain, alpha constrain and point cloud extraction in the current study.
Figure 2

The blue section illustrates common methods for reconstructing geometrically aligned Gaussian Splats.
The input for all Gaussian Splatting methods includes a COLMAP initialization consisting of images, camera positions, and SfM sparse point clouds.
The output will be a traditional representation such as a mesh or point cloud.
During training , in addition to the common image rendering loss, most methods encourage all 3D Gaussians to form adisk-like shape.
After several training iterations, or at the end of the training process, other methods select ahard threshold for the alpha value and use the _ remaining Gaussians for geometric reconstruction_.
However, thesehard constraints often result in poorer reconstruction.
Instead of encouraging all Gaussians to adoptthe same shape , our method uses semantic information to control the shape in detail.
① produce semantic masks for each input image,
② extract shape information for each semantic group, and use this information to locally control the shape of each Gaussian.
③ provide an opacity field sampling method that can dynamically allocate the desired number of points and ignore defective reconstruction parts.
Figure 3

the results of reconstructing the Campus and College scenes from GauUsceneV2.
Using SuGaR , many surfaces incorrectly model the lighting conditions due to complex effects, such as how glass reflects sunlight at different angles and how clouds block sunlight.
These imaginary surfaces, which do not represent the true surface, are regarded as fantasy surfaces.
SA-GS largely alleviates this problem.
Another major source of geometric error occurs at theedges of unbounded scenes.
However, this issue is common to all methods due to the sparsity of images at the edges and is not the focus of our current work.
Figure 4

Inconsistency problem.
The semantic segmentation results are sometimes inconsistent with previous judgments.
(a-b) two tunnels are regarded as ground usingGroundingSAM.
(c-d) However, in the images captured from a camera position immediately adjacent to them, the left tunnel is not regarded as ground.
This inconsistency between consecutive images is the primary cause of failure in naive reconstruction methods.
Figure 5

Our method pipeline consists ofthree main stages.
Firstly, we utilize the same input as 3DGS, but enhance it with semantic information extracted via Grounding SAM.
Next, we assess the geometric complexity of each semantic group by calculating high-frequency power.
Our geometric constraint is implemented through a soft regularization , facilitated by a semantic loss function. This guides the Gaussian shapes to match the expected shapes determined earlier.
The rendering loss further **refines the shape and attributes **of the 3DGS, while the shape constraint , indicated by a negative sign, ensures alignment between rendered and real images.
Controlling the shapes of different 3DGS is achieved by mapping their projected pixels onto the semantic map obtained earlier.
Additionally, by reducing thenumber of low-opacity Gaussian splats to the expected count, we minimize GPU memory consumption during training.
Finally, we offer a user-friendly point cloud extraction method viahierarchical probability density sampling.
① we create amultinomial distribution using the opacity values stored in each 3DGS.
② based on user inputs and the multinomial distribution, we determine thenumber of points to sample from each Gaussian distribution.
Figure 6

The comparison between ours and 3DGS. Ours largely sharpening the edge of image.
(a) the tower merges together and sharpened in ours.
(b) eliminates the noise around the high building in ours.
(c) shows that our steadilyalpha decreasing strategy is successful.
Conclusion
We propose a semantic-aware geometric constraint algorithm that dynamically assigns expected shapes to Gaussian splats projected into different semantic groups.
We present an algorithm capable of computing the geometric complexity of Gaussian splats based on spectrum analysis.
Furthermore, we utilize geometric complexity measurement to determine the number of Gaussian splats.
Subsequently, we introduce a hierarchical probability density sampling method that can extract as many points as desired by users while maintaining a dynamic alpha value to mitigate the fantasy surface problem.
Limitation
① during training, we constrain the shape of all Gaussians that project onto the same pixel __without explicitly ignoring Gaussiansblocked by those with high opacity values before them__. This may result in****all Gaussians conforming to the shape of the semantic group that occupies the largest region in the scene.
② our algorithm relies on key semantics provided by users, which may sometimes be absent.
③ while theinconsistency between consecutive images can be addressed by our robust loss, the direct resolution of inconsistency in the 3D world itself has not been achieved.
