FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering
Abstract
This work introduces FlashGS, an open-source CUDA Python library designed to facilitate the efficient differentiable rasterization of 3D Gaussian Splatting through algorithmic and kernel-level optimizations. FlashGS is developed based on the observations from a comprehensive analysis of rendering process to enhance computational efficiency and bring the technique to wide adoption.
The paper includes a suite of optimization strategies, encompassingredundancy elimination, efficient pipelining , refined control and scheduling mechanisms , and memory access optimizations, all of which are meticulously integrated to amplify the performance of the rasterization process.
论文包括一系列优化策略,包括冗余消除、高效流水、精细的控制和调度机制以及内存访问优化,所有这些都被精心整合以提高光栅化过程的性能。
An extensive evaluation of FlashGS’ performance has been conducted across a diverse spectrum of synthetic and real-world large-scale scenes , encompassing a variety of image resolutions. The empirical findings demonstrate that FlashGS consistently achieves an average 4x acceleration over mobile consumer GPUs, coupled with reduced memory consumption. These results underscore the superior performance and resource optimization capabilities of FlashGS, positioning it as a formidable tool in the domain of 3D rendering.
Figure
Figure 1

Two representative rendering output images with 3DGS and our FlashGS.
Figure 2

3DGS Overview
Figure 3

Runtime breakdown of 3GDS rasterization on the MatrixCity dataset.
Figure 4

We evaluate the key-value pairs binning process from the rendering process of 6 frames in the scene trained from MatrixCity[16] dataset. The number of assigned k-v pairs is much more than the number of tiles really covered by the AABB or the projected ellipse.
Figure 5

Geometry Redundancies. There are 3 kinds of redundancies in original 3DGS intersection algorithm: I. The definition of ellipse ignores the opacity. II. The AABB is overestimated. III. The tiles out of the ellipse are binned with the Gaussian.
Figure 6

Opacity distribution in the smallcity scene of MatrixCity dataset. The horizontal axis below the bar shows percentage ranging from 0% to 100%. These percentages correspond to the opacity values from 0 to 1 shown above the bar.
Figure 7

FlashGS Overview.
Figure 8

intersection tiles with the ellipse.
(a) 3DGS uses AABB and gets 16 tiles.
(b) GScore applies OBB and gets 8 tiles.
(c) Precise intersection shows only 4 tiles
(purple represents the real intersection tiles, green shows the tiles each method treats as intersected, and white means not).
Figure 9

Geometric simplification for precise per-tile intersection.
A tile is considered intersected if the segment of the ellipse intersecting the line of the tile’s edge coincides with the edge.
Figure 10

Schematic for the workflow of original 3DGS and the improved FlashGS. We balance the computation and memory access across various stages and reduce redundant operations (M_c andM_b are indicative of the amount of computation and memory access , respectively).
Figure 11

Adaptive task partitioning for Gaussian intersections with varying sizes.
If a large ellipse requires processing multiple tiles , other threads within the warp are utilized to collaborate on the intersection.
Figure 12

Software pipelining to achieve better overlap between computation and memory access.
Figure 13

Rasterization runtime breakdown on 6 representative frames from different datasets, normalized to 3DGS.
Figure 14

Profiling results of FlashGS: The number of instructions issued in rendering and the memory transactions in preprocessing. All results are normalized to 3DGS.
Figure 15

Number of renderedGaussian-tile (kv) pairs and instructions issued per pair of FlashGS (Normalized to 3DGS).
Figure 16

The rendering quality and rasterization time on Matricity(1080p)-800 frame.
Conclusions
We propose FlashGS, enabling real-time rendering of largescale and high-resolution scenes.
In this paper, we achieved a fast rendering pipeline through a refined algorithm design and several highly optimized implementations, addressing the redundancy and improper compute-to-memory ratio issues present in original 3DGS.
FlashGS significantly surpasses the rendering performance of existing methods on GPUs, achieves efficient memory management, while maintaining high image quality.
