Computer Vision L9 -- Vision 3D

Why not using LiDAR 3D sensor?
It costs significantly money.


Machine Learning based approach for detecting depth (3D vision):
Idea 1: Binocular Stereo, it operates on the principle of binocular disparity owing to the fact that humans possess two eyes.
Imagine taking a photo with each of two cameras: one positioned near you and one farther away. When comparing these images, objects close to your position will seem to move more quickly, while objects farther away will appear to move less.
The scene resembles what one would observe while traveling along a highway where, looking out at the horizon, grass directly ahead seems to advance rapidly compared to distant mountains on the horizon which appear to creep slowly.
By analyzing the relationship between corresponding pixel movements in two images of the same scene, we can determine how far away an object is.
Idea 2: Photometric Stereo; this is an approach requiring just one camera. Instead of moving the camera, they moved the light source. People have developed very effective depth estimation systems that rely solely on light motion.
Image representations are well known to be based on pixels. However, whereas video is represented by stop motion, sound can be modeled using waveforms. The challenge now becomes: how can we model three-dimensional information? A related concept in this context is the voxel.


Before talking about voxel, do you know how the word Pixel comes?
The term originates from the abbreviation of the word picture element. A digital image represents the visual data in terms of its width and height dimensions.
A voxel represents a three-dimensional volume element; analogous to how pixels denote two-dimensional squares within a matrix, voxels form cubes within their structure. Within this large 3D grid structure, each unit will possess inherent resolution capabilities, with numerous smaller cubes serving as fundamental units to encode various properties.
However, its practical application remains limited due to its excessive memory consumption; most allocated memory merely represents empty space. Moreover, achieving a high resolution requires an increased number of small cubes for object formation; each addition leads to greater memory usage. On the other hand, increasing cube size reduces memory usage but results in lower resolution.
other representations:





Point cloud represents a collection of points in three-dimensional space without inherent ordering. Consequently, since these points lack a natural correspondence, their arrangement does not inherently correlate to any particular structure. Moreover, adding each point by 10 units in the X column should not alter the essence of what they represent since it merely shifts their position within the coordinate system. However, why would a point cloud network fail to recognize that such a translation does not affect the fundamental nature of the data?



The key points are those that need to be seen in order to keep the feature vector unchanged. These key points:
- 该方法表现出很强的鲁棒性,在存在遮挡/噪声或某些点未被记录的情况下仍能正常工作。这种表示方法对于某些点缺失依然具有强鲁棒性。
 - 这类特征通常反映了事物的大致轮廓或形状,在进行分类任务时它们确实起到了关键作用。
 



We will instead aim to compress this scene, specifically every (x1,x2), into the parameters of a neural network. This work builds upon our previous exploration where we stored such scenes explicitly. We discussed how images can be represented as functions, and this investigation focuses on capturing similar functions using neural network parameters.



