본문 바로가기
공부/Deep Learning

[Paper Review] Unsupervised Learning of Depth and Ego-Motion from Video

by 수제햄버거 2021. 5. 1.
728x90
반응형

논문 : Unsupervised Learning of Depth and Ego-Motion from Video (Sfm Learner)

URL : arxiv.org/abs/1704.07813

 

Unsupervised Learning of Depth and Ego-Motion from Video

We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. We achieve this by simultaneously training depth and camera pose estimation networks using the task of view synthe

arxiv.org

저자 : Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe (google)

Pulish : CVPR,2017

 

[Summary] Main Points of this paper


·   pt  is a pixel in target view(homogeneous coordinate).
K  is camera intrinsic matrix, so K-1pt  means inverse projection to normalized coordinate.
Dtpt  means predicted depth of pixel pt  ,then DtptK-1pt  means 3D point in target view.
multiplied by
Tt→s  , represents target view to source view.
Finally, we multiply the camera intrinsic matrix K for projection to source image (
ps )

 

·      Depth network is an encoder-decoder architecture. In Encoder, it reduces spatial resolution for increasing the number of channels in the feature. On the contrary, in Decoder, it increases the spatial resolution for reducing the number of channels in the feature.

·      There is a skip connection between Encoder and Decoder, which combines high-level features that have undergone multiple convolutions in decoder and low-level features that contain more accurate spatial information in encoders to infer Depth through richer information.

·      Pose network shares Explanability and feature layer. After a few more convolution in the last feature, the global average pooling’s output is the pose. And concatenate deconvolution to create an explainablility map of the same size as the image.

·      The explainablility map work as a weight in finding photometric errors. It is designed to give low weights to areas where depth is difficult to estimate from a single image due to occlusion or moving objects.

 

·      Loss: photometric loss + smoothness loss + explainability loss

·      Photometric loss: converts multiple nearby images to a depth map from a specific target image, adding error to all pixels (expainability multiplied)

·      To estimate depth from a global perspective, it designs the depth network as an encoder-decoder structure with a narrow center and calculate the loss as multi-scale.

 

[Strengths] Clearly explain why these aspects of the paper are valuable.

·      Data collection is relatively easy.

·      It has comparable performance with supervised learning.

 

[Weaknesses] Clearly explain why these aspects of the paper are weak.

·      It has problems about textureless and faraway distance object.

·      It can’t apply the unknown camera types/ calibration.

·      It doesn’t explicitly estimate scene dynamics and occlusions.

 

[Main Contribution] What is the contribution of the paper? Or novelty

·      Unsupervised learning was almost the first to be introduced.

·      They came up with a model that learns depth and pose at once as end to end.

반응형