HFNet

HFNet

Robotics/SLAM & Deep SLAM 2020. 1. 14. 17:47

From Coarse to Fine: Robust Hierarchical Localization at Large Scale

Paul-Edouard Sarlin Cesar Cadena Roland Siegwart Marcin Dymczyk

Autonomous Systems Lab, ETH Zurich ¨ Sevensense Robotics AG

논문 해석 및 요약본

1. Introduction

- HFNet 은 Hierarchical 한 구조로 되어있는 monolitic CNN 으로 6 DoF localization 을 하는 networks 이다.

Global retrieval 과 Local Feature matching 으로 구성되어있으며, Hierarchical approach 는 significant 한 runtime saving 을 해주며 real time operation 이 가능하게 하는 장점이 있다.

- 최근에 leading approaches 의 겨우는 2D keypoints in the query 와 3D points (in sparse model) 의 상관관계를 추정하는 방법이며, local descriptors 를 이용한다. 이러한 direct matiching 은 robust 할 뿐만 아니라 mobile 에서도 intractive 한 특징이 있다. efficiency에 대해서 최적화 할수 있으나 fragile 하였다.

- 일반적인 위치를 찾는 방식들의 robustness 는 변하지 않는 hand crafted local feature 로 제한되는 경향이 있다. 최근에 CNN 은 계산 비용이 적으면서도 robustness 가 높은 피쳐로 떠오르고 있었다.

- image retrieval 의 경우는 robustness 와 efficiency 의 측면에서 더 좋은 결과를 나타내었으나, 정확도 측면에서 뒤쳐졌다. 또한 city scale localization 의 경우 scalability 의 영향력이 존재했다.

- 이 paper 에서는 robustness 와 efficiency 에 대한 갭사이의 연결다리를 hierarchical localization paradigm 에서 만들려고 노력하였다.

정리하자면, 1) sota of the large scale localization with 뛰어난 robustness 능력

2) HF-Net 은 monolitic neural net 으로 효과적으로 hierarchical features 를 예측하며 빠르고 robust 하다.

우리는 실질적으로 사용가능하고, 효율적인 multitask distillation 을 사용하여 runtime goal 을 달성.

2. Related Work

6 DoF visual localization

1) structure-based : 3D points in a 3D SFM Model

2) image-based : 2D keypoints of query images

-> exhaustive matching and compute intensive.

모델이 커질수록 인지적 로스가 발생하며 매칭이 불확실해지고 robustness 를 잃어버림

3) directly regress the pose form a single image 는 정밀도 면에서 경쟁력이 떨어진다.

4) image retrieval 방식은 image-based 방식과 비슷하면서 근사적인 포즈를 데이터베이스 분활화를 통해 줄 수 있다.

하지만 충분히 정확하지 않았다. 그럼에도 image retrieval 은 local matching 보다 더 robust 하였으며 global image wide information 을 제공하였다.

(sota image retrieval 방식은 large deep learing models 를 사용한다.)

Scalable localizaiton

feature 읽고 쓰고 뽑는데에 덜 비싼 형태로 선정해야한다. (binary descriptors 등인듯)

이것이 런타임에 향상을 가져올 지라도 robustness 의 손상을 줄 수 있으며 stable condition 에 영향을 줄 수 있다.

map level 에서 had crafted local features 로 3d points 에 대해서 image retrieval 을 하여 위치를 찾는 방법이 제안되었다.

local descriptors 와 hetrogeneous structure 때문에 efficiency 와 robustness 가 제한되어지는 영향이 있다.

Learned local features

CNNs 은 Dense pixel-wise features 을 제공하므로써 image matching에서 강력한 성능을 제공하고 있다.

CNN keypoint 는 학습이 어렵지만 classical methods 보다 더 나은 성능을 보여주고 있다. SuperPoint는 selfsupervision을 통해서 학습되며 DELF 는 attention mechanism 을 통해 landmark recognition task를 최적화 한다.

Deep learning on mobile

mobile devices 에 개발한 모델을 배포하는 것 사소한 일이 아니다.

multi-task learning 는 테스크 간에 컴퓨팅 자원을 배분하여 사용하는데에 manual tuning 없이 효율성을 제공한다.

요구되는 네트워크의 사이즈를 줄이기 위한 방법으로 Distillation이 사용될 수 있다.

3. Hierarchical Localization

위치 추정의 robustness 를 높이는 것이 목표 (계산 요구사항에 맞춰.) 이며, hierarchical localization framework 를 사용하였다.

Prior retrieval.

맵 레벨에서의 coarse search 의 경우 query 와 database images 를 global descriptors로 매칭하면서 이루어진다. k-nearest neighbors (NN) 로 그룹핑된 prior frames 이라고 불리는 맵에서의 후보 위치들이다. 이 검색은 효율적이며, database image 전체보다는 훨신 적은수의 SfM model에서의 points 를 준다.

Covisibility clustering.

The prior frames 들은 3D structure에 근거하여 같이 관측된 지점들끼리 clustering 된다. 이러한 connected components를 places라고 명명하며, covisibility graph 가 database images 를 3D points in the model 과 연결한다.

Local feature matching.

각 place 에서 2D keypoints 들을 query image 와 3D points 들과 매칭한다. 6-DoF pose 를 PnP 를 통해서 추정하려고 노력한다. geometric consistency 를 RANSAC 을 이용해서 체크한다. local search 는 places 를 고려하는 것이 전체 모델의 3D points를 고려하는 것보다 효율적이다. 유효한 위치가 찾아졌을때 알고리즘이 멈추게 된다..

Discussion.

NetVLAD 를 distillation 시킨 MobileNetVLAD (MNV)를 사용하였으며, 이것이 runtime 제약사항에 대해서 효율적이지만 정밀도 면에서는 original model이 효율적이다. The local matching step 은 SIFT 를 이용하였고, 이것은 조금 컴퓨팅 자원 관점에서 비싼 연산이다. 하지만 이 방식은 small-scale environments에서 좋은 성능을 보였고, larger, denser models에서는 좋지 못하였다. 추가적으로, SIFT 는 큰 조도 변화에서 좋은 성능을 보이지 않았다.

Lastly, a significant part of the computation of local and global descriptors is redundant, as they are both based on the image low-level clues. The heterogeneity of hand-crafted features and CNN image retrieval is thus computationally suboptimal and could be critical on resource-constrained platforms.

(중요 : global 과 local 을 구하는 것 모두 image 의 low level clue 를 이용하는 방식이므로 redundent 한 연산이다.)

4. Proposed Approach

improved robustness, scalability, and efficiency 하는 방법들을 소개한다. homogeneous network structure를 사용한 학습된 feature 를 사용하였으며, Section 4.1 아키텍처 구조, Section 4.2. 독특한 트레이닝 구조 에 대해서 설명하겠다.

SuperPoint 는 최근에 SIFT 를 넘어서는 keypoint repeatability and descriptor matching 성능을 보여주었다. 몇가지 학습된 feature 에서는 SIFT보다 sparser 한 feature 를 보여주기도 하였으며 키포인트가 줄어든다는 것은 매칭속도가 향상됨을 뜻합니다.

또한 GPU 를 사용하게 될때 이러한 네트워크 베이스 방식은 SIFT 보다 따른 추론 성능을 보입니다. 하지만 제안된 위치 추정 방식은 여전히 큰 계산 bottleneck 을 가지고 있기는 합니다. mobile devices 환경에서 성능을 향상 시키기 위해서 우리는 독특한 hierarchical features 를 도입했으며 바로 HF-Net 입니다. 이는 효율적으로 coarse 에서 fine 한 위치 추정을 수행하며, 이는 keypoints 를 감지 후에 local , global descriptors 를 single shot 에서 계사합니다. 이는 컴퓨팅 자원 공유를 최대화 합니다.

4.1. HF-Net Architecture

hierarchical structure

- local and global features , low additional runtime costs.

single encoder (MobileNet) and three heads predicting:

i) keypoint detection scores. (SuperPoint)

ii) dense local descriptors. (SuperPoint)

iii) a global image-wide descriptor. (NetVLAD)

For the local features, the SuperPoint architecture is appealing for its efficiency, as it decodes the keypoints and local descriptors in a fixed non-learned manner. This is much faster than applying transposed convolutions to upsample the features.

decoding 할시에는 학습된 방식이 아닌 fixed 방식을 사용하여서 효율성을 더 높였다고 함(더빠르다)

It predicts dense descriptors which are fast to sample bilinearly, resulting in a runtime independent from the number of detected keypoints.

키포인트 개수에 관계없이 bilinearly 하게 sample

On the other hand, patch-based architectures like LF-Net [38] apply a Siamese network to image patches centered at all keypoint locations, resulting in a computational cost proportional to the number of detections.

반면 패치 방식인 LF-Net 의 경우는 Siamese network 를 이미지 패치에 적용하기 때문에 개수의 제한을 받는다.

The local feature heads branch out from the MobileNet encoder at an earlier stage

than the global head, as a higher spatial resolution is required to retain spatially discriminative features,

local features are on a lower semantic level than image-wide descriptors.

로컬 피쳐의 경우 높은 공간 분해능이 필요하기 때문이 좀더 이전 stage 에서 엔코더를 빠져나온다.

global 은 image wide descriptors 를 한다면 local features 는 low semantic level 에서 descripting 한다.

4.2. Training Process

Data scarcity

i) exhibits a sufficient perceptual diversity at the global image level

ii) contains ground truth local correspondences between matching images

These correspondences are often recovered from the dense depth [38] computed from an SfM model [47, 49], which is intractable to build at the scale required by image retrieval.

Data augmentation

Self-supervised methods that do not rely on correspondences, such as SuperPoint, require heavy data augmentation, which is key to the invariance of the local descriptor. While data augmentation often captures well the variations in the real world at the local level, it can break the global consistency of the image and make the learning of the global descriptor very challenging.

Multi-task distillation

is our solution to this data problem. We employ distillation to learn the representation directly from an off-the-shelf trained teacher model. This alleviates the above issues, with a simpler and more flexible training setup that allows the use of arbitrary datasets, as infinite amount of labeled data can be obtained from the inference of the teacher network. Directly learning to predict the output of the teacher network additionally eases the learning task, allowing to directly train a smaller student network. We note an interesting similarity with SuperPoint, whose detector is training by bootstrapping, supervised by itself through the different training runs. This process could also be referred as self-distillation, and shows the effectiveness of distillation as a practical training scheme. The supervision of local and global features can originate from different teacher networks, resulting in a multitask distillation training that allows to leverage state-of-theart teachers. Recent advances [23] in multi-task learning enable a student s to optimally copy all teachers t1,2,3 without any manual tuning of the weights that balance the loss:

where d g and d l are global and local descriptors, p are keypoint scores, and w1,2,3 are optimized variables. More generally, our formulation of the multi-task distillation can be applied to any application that requires multiple predictions while remaining computationally efficient, particularly in settings where ground truth data for all tasks is expensive to collect. It could also be applied to some hand-crafted descriptors deemed too compute-intensive.

(두개 해석 좀 더 생각해보기)

Experiments

We start our evaluation by investigating the performance of local matching methods under different settings on two datasets, HPatches [4] and SfM [38], that provide dense ground truth correspondences between image pairs for both 2D and 3D scenes.

Datasets

HPatches [4] contains 116 planar scenes containing illumination and viewpoint changes with 5 image pairs per scene and ground truth homographies. SfM is a dataset built by [38] composed of photo-tourism collections collected by [19, 53].

Ground truth correspondences are obtained from dense per-image depth maps and relative 6- DoF poses, computed using COLMAP [47].

We select 10 sequences for our evaluation and for each randomly sample 50 image pairs with a given minimum overlap.

A metric scale cannot be recovered with SfM reconstruction but is important to compute localization metrics. We therefore manually label each SfM model using metric distances measured in Google Maps

Metrics.

댓글

ABOUT ME

Hello Stella Hello Stella

From Coarse to Fine: Robust Hierarchical Localization at Large Scale

티스토리툴바