GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing
Abstract
We introduce GaussCtrl, a content-focused approach that was developed for editing, which is facilitated by the 3DGS reconstructing a 3D scene.
Our method first generates a set of images utilizing 3DGS and subsequently manipulates them employing a pre-trained 2D diffusion model named ControlNet in response to the input prompt. Thereafter, these processed results are utilized to refine the underlying 3D model.
Our central technical innovation is multi-view consistent editing, which allows all images to be edited collectively by avoiding iterative updates to a single image and concurrent 3D model refinement, as done in prior approaches. This approach achieves both computational efficiency and superior visual outcomes.
This is achieved by the two terms:
(a)
(b) A latent code alignment mechanism based on attention ensures consistency in visual appearance when conditioning image editing processes through multiple reference perspectives, employing both self-similar and cross-perspective attention within the framework of images' latent representations.
The experimental results clearly show that our approach surpasses existing methods in terms of editing efficiency and visual output.
Figure
Figure 1

(a) GaussCtrl edits a 3DGS scene by modifying its descriptive prompt.
(b)该过程是通过修改3DGS渲染的图像并重新训练三维模型来实现的。
(c) 我们的创新性地提出了一个深度条件多视图一致编辑框架,在前人由于不一致编辑导致的模糊或不合理三维结果中取得了显著提升。
Figure 2

GaussCtrl pipeline.
A 3DGS scene coupled with textual instructions serves as the foundation for our system. The method leverages the 3DGS to generate images, which are subsequently modified by applying text instructions. These modifications collectively contribute to enhancing the original 3DGS.
Our key contribution is multi-view consistent editing.
该深度条件编辑法以ControlNet为基础, 旨在实现几何一致性的目标.
(2) The attention mechanism for enhancing consistency of the alignment of latent codes while editing.
Figure 3

Qualitative results.
Various scenarios demonstrate a range of text-based editing operations across different scenes, encompassing object modifications and environment adjustments. Such as altering the visual characteristics and demographic attributes of the subject being edited, as well as reconfiguring their surroundings.
Figure 4

Qualitative comparison on 360-degree scenes.
Our method creates images that exhibit greater consistency and superior quality compared to existing top-tier techniques.
Figure 5

Qualitative results on forward-facing scenes.
By our approach, we produce more realistic outcomes with higher quality, enhanced uniformity, and fewer artifacts in the results.
Figure 6

Failure cases of

.

relies on the correspondence between text instructions and editing results but does not consider the editing quality.
(a) Our red panda exhibits a lighter shade, resulting in a lower score. This is attributed to its unique fur texture. Other approaches exhibit incorrect facial structures, however, and achieve scores that surpass our results.
(b) previous methods are more akin to a newborn baby, thereby achieving higher scores. Despite this, the first two approaches perform poorly, with ViCA's outcome appearing unnatural in the infant's ocular regions.
Figure 7

Editing consistency comparison on the bear scene (Polar bear).
Both \IN2N(GS) and ViCA are affected by editing inconsistencies (as indicated in View #1,2,4,8,10). These methods result in spots and a bear’s face that appears blurry. Additionally, these methods are not immune to the face-on-the-back issue. Our approach successfully addresses these issues.
Figure 8

Ablation studies on the consistent editing.
(a) Sampled images from the original dataset.
(b) Editing results usingInstruct Pix2Pix.
(c) Our proposed depthconditioned editing, which uses ControlNet with the randomly initialised latent codes.
(d) Consistent initial latent code is applied by using DDIM inversion.
(e)Attentionbased latent code alignment is added based on (d).
Figure 9

Failure cases.
Left: Because of utilizing depth guidance, our approach falters significantly when undergoing a substantial geometric transformation. Yet, our investigation reveals that existing methodologies also fail to perform adequately in this situation despite their lack of reliance on depth information.
Despite the 2D pre-trained diffusion model failing in cases where it is not effective, our method can still maintain the consistency.
Limitations
Some may have concerns about its capacity to modify the scene’s original geometry when conditioning editing on depth maps. Yet, few studies suggest that changes to the original geometry are generally not required. Instead, editing tasks commonly involve altering object aesthetics, changing background settings, or incorporating localized enhancements, such as adding mustaches to individuals.
For ensuring thoroughness in our experiments, we have provided an example requiring geometric modifications. It proves to be impossible for us to transform the bear statue into a giraffe as shown in Figure 9. Despite this challenge, similar issues are also encountered by other approaches such as IPix2Pix and baseline methods (including IN2N and VICANeRF).
Another drawback is primarily concerned with ensuring that the final result remains true to the user’s intention. Our method is unfortunately capable of transforming a man into a comic character like Hulk in Fig. 9 but fails to achieve this desired outcome consistently. We believe this issue stems from our ControlNet system, which is unable to identify such specific instructions and generate accurate results. In contrast, although our approach produces consistent and sharp results overall, these outcomes are not always aligned with user expectations in every scenario.
Conclusion
In this paper, we propose an high-efficiency intelligent 3D-aware editing framework , GaussCtrl, which significantly reduces visual artifacts and blurry outcomes caused by inconsistency in 2D editing, particularly in 360-degree environments.
Based on a pre-captured Gaussian model, our method ensures consistency across all stages of the editing process, which includes depth-conditioned image editing and attention-based latent code alignment.
Analyze the performance of GaussCtrl across various scenarios, textual prompts, and items. Our method achieves superior results based on extensive testing.
