논문 : Unsupervised Learning of Depth and Ego-Motion from Video (Sfm Learner)
URL : arxiv.org/abs/1704.07813
Unsupervised Learning of Depth and Ego-Motion from Video
We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. We achieve this by simultaneously training depth and camera pose estimation networks using the task of view synthe
arxiv.org
저자 : Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe (google)
Pulish : CVPR,2017
[Summary] Main Points of this paper
· pt is a pixel in target view(homogeneous coordinate).
K is camera intrinsic matrix, so K-1pt means inverse projection to normalized coordinate.
Dtpt means predicted depth of pixel pt ,then DtptK-1pt means 3D point in target view.
multiplied by Tt→s , represents target view to source view.
Finally, we multiply the camera intrinsic matrix K for projection to source image (ps )
· Depth network is an encoder-decoder architecture. In Encoder, it reduces spatial resolution for increasing the number of channels in the feature. On the contrary, in Decoder, it increases the spatial resolution for reducing the number of channels in the feature.
· There is a skip connection between Encoder and Decoder, which combines high-level features that have undergone multiple convolutions in decoder and low-level features that contain more accurate spatial information in encoders to infer Depth through richer information.
· Pose network shares Explanability and feature layer. After a few more convolution in the last feature, the global average pooling’s output is the pose. And concatenate deconvolution to create an explainablility map of the same size as the image.
· The explainablility map work as a weight in finding photometric errors. It is designed to give low weights to areas where depth is difficult to estimate from a single image due to occlusion or moving objects.
· Loss: photometric loss + smoothness loss + explainability loss
· Photometric loss: converts multiple nearby images to a depth map from a specific target image, adding error to all pixels (expainability multiplied)
· To estimate depth from a global perspective, it designs the depth network as an encoder-decoder structure with a narrow center and calculate the loss as multi-scale.
[Strengths] Clearly explain why these aspects of the paper are valuable.
· Data collection is relatively easy.
· It has comparable performance with supervised learning.
[Weaknesses] Clearly explain why these aspects of the paper are weak.
· It has problems about textureless and faraway distance object.
· It can’t apply the unknown camera types/ calibration.
· It doesn’t explicitly estimate scene dynamics and occlusions.
[Main Contribution] What is the contribution of the paper? Or novelty
· Unsupervised learning was almost the first to be introduced.
· They came up with a model that learns depth and pose at once as end to end.