SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
Abstract
Novel viewpoint synthesis for moving or dynamic environments remains a significant challenge in the domains of computer vision and graphics. Lately, Gaussian-based rendering has become an effective method for representing stationary or static settings while enabling high-fidelity, real-time novel viewpoint synthesis.
Novel viewpoint synthesis for moving or dynamic environments remains a significant challenge in the domains of computer vision and graphics. Lately, Gaussian-based rendering has become an effective method for representing stationary or static settings while enabling high-fidelity, real-time novel viewpoint synthesis.
By building on this approach, we introduce a novel framework that systematically disassembles the movements and visual characteristics of dynamic scenes into distinct points and Gaussian distributions, respectively.
We propose a central concept for using minimal control points compared to the number of Gaussians. This approach enables us to learn economical 6 degree-of-freedom (DoF) transformation basis sets, which can be estimated using learned interpolation weights to generate the motion field of 3D Gaussian distributions.
We develop a deformation MLP aimed at forecasting time-dependent 6 DoF transformations within the control points. This approach eases learning complexities, enhances learning capabilities, and enables the acquisition of temporally and spatially consistent motion patterns.
Simultaneous learning through joint training includes the 3D Gaussians, the canonical positions in space of control points' canonical positions, and a neural network for modeling deformation. These components aim to reconstruct the visual appearance, geometric structure, and dynamic behavior of 3D scenes.
While studying, the positions and count of control points are dynamically adjusted to address varying degrees of motion complexity in different regions. Additionally, a specialized ARAP loss formulated based on the principle that it should be as rigid as possible was designed specifically to ensure spatial consistency and maintain local rigidity in learned motions.
Finally, owing to the sparse motion representation and its decomposition based on appearance, our method allows users to perform targeted motion editing while maintaining the original visual fidelity.
Through extensive experiments, results consistently show that our approach achieves superior performance compared to existing approaches in terms of novel view synthesis. The method not only maintains an impressive rendering speed but also enables the creation of motion editing applications that preserve the original appearance effectively.
Figure
Figure 1

From an image sequence captured by a monocular dynamic video, we introduce a method using sparse control points to model motion that are utilized to generate 3D Gaussian distributions for high-fidelity rendering. Our method allows both (b) dynamic view synthesis and (c) motion editing, leveraging the motion representation through sparse control points.
Figure 2

This paper introduces a new approach by utilizing sparse control points and a deformation MLP to guide the 3D Gaussian dynamics.
The MLP employs canonical control point coordinates and temporal information to derive per-control-point 6 degrees-of-freedom transformations, which guide 3D Gaussian deformation through the K closest control points.
Transformed Gaussians can then be rendered as images after being transformed. Their rendering loss is measured prior to backpropagating gradients for optimizing the Gaussians, control points, and MLP_. The adaptive management of Gaussian density and control point density is implemented dynamically during training.
Figure 3

Qualitative comparison of dynamic view synthesis on D-NeRF datasets.
It is benchmarked against state-of-the-art approaches such as D-NeRF, TiNeuVox-B, K-Planes, and 4DGS. Our approach achieves superior visual quality while retaining more details of dynamic scenes. Notably,in the Lego scene located at the bottom row,the train motion does not align with the test motion.
Figure 4

The qualitative analysis of dynamic view synthesis techniques in the context of scenes generated by NeRF-DS.
Through our method, we achieve high-quality outputs even when specialized design is not required for specular surfaces.
Figure 5

We present the reconstructed movement sequences within a dynamic scene and after being edited to enhance clarity and flow. The upper section displays the reconstructed movement sequences, while the lower section shows the manipulated movement sequences.
Through our approach, it is evident that motion outside the training set gains advantage from modeling a locally rigid motion space using control points.
Figure 6

We display the rendering outputs and Gaussian curves of (a) the baseline method without control point settings, (b) the comprehensive approach, and (c) our method that excludes the incorporation of the ARAP loss term.
Conclusion and Future Works
We introduce a novel method for generating 3D Gaussians using control points and a deformation MLP, which can be learned from dynamic scenes.
By employing a compact motion representation, our method integrates intelligent adaptive learning mechanisms with rigorous constraints to enable high-fidelity dynamic scene reconstruction and precise motion editing.
Experiments demonstrated that our method exceeds existing approaches concerning the visual fidelity of synthesized dynamic perspectives. Despite these challenges, limitations persist. The performance is particularly vulnerable to poorly estimated camera configurations, resulting in reconstruction failures when applied to datasets characterized by inaccurate pose information, such as standard benchmarks like HyperNeRF.
The current approach also encounters typical specular effects, which have led to substantial hindrance on NeRF-DS datasets. To tackle this challenge, future research might consider enhancing the method by integrating Spec-Gaussian alongside a specialized specular design. Such an enhancement would offer more precise modeling of highlights and mirrored surfaces.
Furthermore, the occurrence of motion blur in videos featuring dynamic objects must be taken into account. It can effectively tackle this problem by incorporating deblurring methods into the proposed approach to enhance its robustness.
