【读论文】AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction
文章目录
-
1. What
-
2. Why
-
3. How
-
- 3.1 Input
- 3.2 Background Reconstruction
- 3.3 Foreground Reconstruction
-
- 3.3.1 Constructing Template Gaussians
- 3.3.2 Reflected Gaussian Consistency
- 3.3.3 Dynamic Appearance Modeling
-
4. Experiment
-
- 4.1 Experimental Setup
- 4.2 Main Results
- 4.3 Ablation Studies
1. What
For scene construction and novel view synthesis, this paper imposes geometric constraints on Gaussian representing the road and sky regions, leverages 3D templates to initialize the foreground points, and introduces a reflected Gaussian consistency to supervise the unseen sides of the foreground objects. Moreover, it uses residual spherical harmonics for foreground objects. Finally, it achieves sota on Pandaset and KITTI datasets in both construction tasks and novel view synthesis with lateral ego-vehicle trajectory adjustments.
2. Why
Question of PVG: this method does not tackle the simulation of novel scenarios, such as ego-vehicle lane changes and adjusting object trajectories(It can’t edit scenes).
3. How

3.1 Input
A series of N images (I_i) taken by a camera with its corresponding intrinsic (K_i) and extrinsic (E_i) matrices, along with the 3D LiDAR point clouds L_i **and corresponding dynamic objects trajectories T_i.
3.2 Background Reconstruction
The road and sky regions are decomposed from the rest of the background using semantic masks.
By projecting LiDAR points to the image plane at each time step i, each Gaussian is assigned to one of the road, sky, and other class.
When splatting road and sky Gaussians, these Gaussians are constrained to be flat by minimizing their roll and pitch angles as well as their vertical scale.
Finally, the loss was defined as:
\mathcal{L}_{BG}=(1-\lambda)\mathcal{L}_{1}(I_{g},\hat{I}_{g})+\lambda\mathcal{L}_{DSSIM}(I_{g},\hat{I}_{g})+\beta\mathcal{C}_{g}\quad g\in\{road,sky,other\}\\ \mathcal{C}_{g}=\begin{cases}\frac{1}{N_g}\sum_{i=1}^{N_g}\left(|\phi_i|+|\theta_i|+|s_{z_i}|\right)&\mathrm{if} g\in\{road,sky\}\\ 0&\mathrm{else}\end{cases}
In the second phase of background reconstruction, all Gaussians are splatted together and supervised on the whole image using LBG with g\in\{road\cup sky \cup other\}.
3.3 Foreground Reconstruction
3.3.1 Constructing Template Gaussians
Notably, we employ [23], which generates 3D shapes of objects such as vehicles from a single image. Then, given a sequence of frames with K foreground objects, the template is copied K times and placed into the scene based on the object trajectories.
3.3.2 Reflected Gaussian Consistency
Foreground objects exhibit symmetry in their structure. Leveraging this assumption helps to improve the reconstruction quality, especially in scenarios with limited views [39].
The reflection matrix M for the Gaussians can be defined as:
M=I-2\frac{aa^T}{\|a\|^2}
where a represents the axis of reflection and I is identity. And the property of Gaussian can be reflected by:
\begin{aligned}\tilde{x}&=Mx\\ \tilde{R}&=MR\\ \tilde{f}_{SH}&=D_{M}f_{SH}\end{aligned}
where D_M is a Wigner D-matrix.
This reflected consistency constraint enforces the rendering results of the Gaussian of the two symmetrical sides of the object to be similar. It can be seen from the pipeline of training phase below:

3.3.3 Dynamic Appearance Modeling
Similar to StreetGaussian, to simulate vital signals such as indicator lights, headlights, and taillights, this paper models the dynamic appearance of foreground objects by:
\Delta f_{SH,t}=MLP(E_{t},x,f_{SH})\\ f_{SH,t}=f_{SH}+\Delta f_{SH,t}
Finally, the overall loss for foreground objects is:
\mathcal{L}_{FG}=(1-\lambda)\mathcal{L}_{1}(I_{g},\hat{I}_{g})+\lambda\mathcal{L}_{DSSIM}(I_{g},\hat{I}_{g})+(1-\lambda)\mathcal{L}_{1}(I_{g},\tilde{I}_{g})+\\ \lambda\mathcal{L}_{DSSIM}(I_{g},\tilde{I}_{g})+\gamma\mathcal{L}_{1}(\Delta f_{SH,t})\quad g\in\{fg_{1},fg_{2},...,fg_{K}\}
where K is the number of foreground objects, and fg_k is a set of Gaussians representing the k-th object. I_g,\hat{I}_g, and \tilde{I}_g denote masked ground-truth images, rasterized images of foreground Gaussians, and reflected foreground Gaussians.
4. Experiment
4.1 Experimental Setup
Dataset : KITTI and Pandaset. Pandaset includes 103 urban driving scenarios in San Francisco, each with 80 image frames and corresponding LiDAR point clouds.
Evaluation Metric : FID is an abbreviation for Frechet Inception Distance, which is a measure of the quality of the resulting image. It compares the distribution of generated and real images in the feature space of the Inception network, a deep learning model.
Implementation Details :
Background: 15K+15K maintain a fixed positioning of road and sky Gaussians
Foreground: 5K+10K iterations for scene fusion, wherein both foreground and background Gaussians are fine-tuned together.
4.2 Main Results
- Pandaset experiments: Sota + qualitative
- The lateral shift for the ego-vehicle: FID Sota
- KITTI: Sota, but LPIPS is not the best. It may be that the fitting of the sky and the road causes the texture to be insufficient, such as clouds
4.3 Ablation Studies
- Background Geometry Constraints \longrightarrow FID
- Foreground Initialization\longrightarrow FID + qualitative
- Reflected Gaussian Consistency Constraint\longrightarrowqualitative
- Effect of Dynamic Appearance Modeling\longrightarrowPSNR + high-frequency details
- Novel Scenario Simulation
- Different types of templates were used for initializing foreground objects
