[paper] 00040-Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale
Estimating Depth Information, Surface Normals, and Semantic Labels Using a Shared Multi-Scale Convolutional Architecture
Paper:
Author:
Dr. David Eigen is affiliated with the Department of Computer Science at the Courant Institute, New York University.
Rob Fergus: FaceBook AI Research
Abstract:
Multi-scale convolution technique encompasses depth estimation, normals recovery, and object segmentation.
This method enhances predictions by processing them across multiple hierarchical levels, effectively extracting detailed information from images without the use of superpixel-based segmentation.
1. Introduction
In this paper, we investigate three key challenges—depth estimation, normal computation, and semantic segmentation—implemented by a unified framework.
Several advantages:
First, appropriate training set and loss function.
Second, optimize the procedure for implementing multi-modal systems.
Third, a significant portion of the computation can be shared among modalities, which enhances the system's efficiency.
2. Related Work
Typically, these systems rely on ConvNets to identify localized features or produce descriptor representations for distinct proposal regions; in contrast, whereas our network integrates both local and global perspectives to generate diverse output types.
3. Model Architecture


Scale 1: Full-Image View :
The network's initial stage anticipates a coarse yet spatially varying collection of features across the imaged region by utilizing a comprehensive field of view. This is achieved via two fully connected layers.
Scale 2: Predictions:
The function of the second scale is to generate predictions at an intermediate resolution level, by integrating a more detailed yet narrower focus on the image alongside the full-resolution information provided by the coarse network.
Scale 3: Higher Resolution:
The final level of our model improves the forecasts up to a higher resolution level.
4. Tasks
We utilize this same architectural framework for each of the three depth-related tasks, normal estimation tasks, and semantic segmentation tasks we examine. Each task employs distinct loss functions and target data to define itself.
4.1 Depth
LOSS:

4.2 Surface Normals
Predicting surface normal vectors, we modify the output from a single-channel representation to a three-channel configuration. This allows us to determine the x, y, and z components of the normal orientation at every pixel location.
LOSS:

4.3 Semantic Labels
In the context of semantic segmentation, we employ a per-pixel softmax classifier to estimate the class label for each individual pixel.


5 Training
5.1 Training Procedure
Our company trains its model through two distinct stages by employing stochastic gradient descent (SGD): In the first phase, we simultaneously train both Scale 1 and Scale 2. In the second phase, after fixing the parameters of these scales, we proceed to train Scale 3.
5.2 Data Augmentation
The system applies random scaling, in-plane rotations, translations, color adjustments, flipping, and contrast enhancements.
5.3 Combining Depth and Normals
6. Performance Experiments
6.1 Depth


6.2 Surface Normals


6.3 Semantic Labels
6.3.1 NYU Depth


6.3.2 Sift Flow
Sift flow: dense correspondence across difference scenes.
we weight each pixel by


6.3.3 Pascal voc


7. Probe Experiments
7.1 Contributions of Scales

7.2 Effect of Depth and Normals Inputs
What is the critical importance of depth and normal information compared to RGB in the semantic labeling task?
What could occur if our network were to substitute the ground truth depth and normal inputs with its predicted outputs?

