Advertisement

【读论文】Neural Scene Graphs for Dynamic Scenes

阅读量:

Neural Scene Graphs for Dynamic Scenes

文章目录

  • Neural Scene Graphs for Dynamic Scenes
    • 1. What

    • 2. Why

    • 3. How

      • 3.1 Neural Scene Graphs
      • 3.2 Representation Models
      • 3.3 Neural Scene Graph Rendering
      • 3.4 3D Object Detection as Inverse Rendering
    • 4. Self-thoughts

1. What

What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)

Using video and annotated tracking data, this paper composes dynamic, multi-object scenes into a learned scene graph, which can also be used for 3D object detection via inverse rendering.

2. Why

Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)

Maybe contain Background, Question, Others, Innovation:

Traditional pipelines containing point clouds allow for learning hierarchical scene representations but can’t handle highly view-dependent features.

Nerf resolves the view-dependent effect but does not allow for hierarchical representations or dynamic scenes.

NeRF-W has some tries, it incorporates an appearance embedding and a decomposition of transient and static elements via uncertainty fields but still relies on the consistency of the static scene.

Related works mentioned:

  1. Implicit Scene Representations: Existing methods have been proposed to learn features on discrete geometric primitives, such as points, meshed, and multi-planes.

  2. Neural Rendering: Differentiable rendering functions have made it possible
    to learn scene representations. NeRF stands out because it outputs a color value conditioned on a ray’s direction.

  3. Scene Graph Representations: Model a scene as a directed graph which represents objects as leaf nodes.

  4. Latent Class Encoding: By adding a latent vector z to the input 3D query point, similar objects can be modeled using the same network.

We will introduce part3 and 4 in more detail later.

3. How

3.1 Neural Scene Graphs

The first thing for scene reconstruction is to model the scene in a specific way. That is what we will introduce first.
请添加图片描述

On the left side(a), there’s an “isometric view” of a “Neural scene graph.” This graph represents different elements within a scene as nodes and their spatial relationships as edges.

Each node is associated with a transformation (rotation and translation) and scaling, denoted as T^w_i and S_i, indicating how each node (object) is oriented and scaled within the world coordinate system W. The nodes are visualized as colored boxes, with edges indicating the relationships between them like the positions of objects relative to each other or the world frame W. The objects have latent object codes like l_1,l_2 suggesting they represent specific objects like cars and trucks. There’s also a background node F_{bckg} and different class nodes like F_{\theta_{car}} and F_{\theta_{truck}}.

To sum up, we can define it as a directed acyclic graph:

{\mathcal S}=\langle{\mathcal W},C,F,L,E\rangle.

where to supply that C is a leaf node representing the camera and E is the edge, representing affine transformations from u to v(relationship) or property assignments.

3.2 Representation Models

请添加图片描述

Static and dynamic scene representations are different.
For the static scene, it is the same as the original NeRF, which uses position(x, y, z) and direction (d_x,d_y,d_z) as input, and uses color (c) and density (\sigma) as output. We can summarize the process as:

\begin{aligned}[\sigma(\boldsymbol{x}),\boldsymbol{y}(\boldsymbol{x})]& =F_{\theta_{bckg,1}}(\gamma_{x}(\boldsymbol{x})) \\ \mathbf{c}(\boldsymbol{x})& =F_{\theta_{bckg,2}}(\gamma_{d}(\boldsymbol{d}),\boldsymbol{y}(\boldsymbol{x})). \end{aligned}

For the dynamic scene, each object is represented by a neural radiance field.

Meanwhile considering the limit of computation, we introduce a latent vector lencoding an object’s representation. Conditioning on the latent code allows shared weights \theta_c between all objects of class c. Adding l_o to the input of a volumetric scene function F_{\theta_c} can be thought of as a mapping from the representation function of class c to the radiance field of object o.

In the architecture of NN, we add this 256-dimensional latent vector l_o, resulting in the following new first stage:

[y(x),\sigma(x)]=F_{\theta_{c,1}}(\gamma_{\boldsymbol{x}}(\boldsymbol{x}),\boldsymbol{l}_{o}).

Because in the video, the dynamic object will change over time, to consider the location-dependent effects, we add the global position p_o in the global frame as another input, that is:

c(x,l_o,p_o)=F_{\theta_{c,2}}(\gamma_d(d),y(x,l_o),p_o).

Notice the x inside this formulation was in the local coordinate, after a certain transformation and normalization:

x_o=S_oT_o^wx\text{ with }x_o\in[1,-1].

We need to query its color and volume density in the local object coordinate system. When a ray passes through the global coordinate system, we need to convert it to the local coordinate system, which will be reflected in the rendering part later.

So all in all, the representation of a dynamic scene is:

F_{\theta_c}:(\boldsymbol{x}_o,\boldsymbol{d}_o,\boldsymbol{p}_o,\boldsymbol{l}_o)\rightarrow(\boldsymbol{c},\boldsymbol{\sigma});\forall\boldsymbol{x}_o\in[-1,1].

3.3 Neural Scene Graph Rendering

请添加图片描述

Each ray r_j traced through the scene is discretized at N_d sampling points at each of the m_j dynamic node intersections and N_s in the background like the original NeRF, resulting in a set of quadrature points \{\{t_{i}\}_{i=1}^{N_{s}+m_{j}N_{d}}\}_{j}

When testing whether the ray passes the dynamic objects, it will use the Ray-box Intersection and then transform this ray to the objects’ local coordinates to query the property. The calculation method is the same as NeRF:

\begin{aligned}\hat{C}(\boldsymbol{r})&=\sum_{i=1}^{N_s+m_jN_d}T_i\alpha_ic_i,\text{where}\\ T_i&=\exp(-\sum_{k=1}^{i-1}\sigma_k\delta_k) \text{and} \alpha_i=1-\exp(-\sigma_i\delta_i)\end{aligned}

Finally, the loss function is:

\mathcal{L}=\sum_{r\in\mathbb{R}}\|\hat{C}(\boldsymbol{r})-C(\boldsymbol{r})\|_{2}^{2}+\frac{1}{\sigma^{2}}\|\boldsymbol{z}\|_{2}^{2},

which also uses the latent code p(z_o) like DeepSDF.

3.4 3D Object Detection as Inverse Rendering

Just like R-CNN (【计算机视觉】24-Object Detection-博客), it samples anchor positions in a bird’s-eye view plane and optimizes over anchor box positions and latent object codes that minimize the \ell_{1} image loss between the synthesized image and an observed image.

4. Self-thoughts

  1. How to handle shadows on merge scene graphs.
  2. Have no idea how to improve it.

全部评论 (0)

还没有任何评论哟~