SLAM History Book — Spatial AI 계보 추적

가이드 렌더링 중...

# Ch.0 — SLAM Solved?

2026년, 핸드폰을 들면 AR 레이어가 벽에 달라붙는다. 실내 배송 로봇은 지도를 받지 않고도 주방과 회의실을 구분한다. [DUSt3R](https://arxiv.org/abs/2312.14132) 계열 모델에 사진 몇 장을 던지면 수 초 안에 3D 구조가 나온다. 이제는 데모라기보다 제품이고, 대체로 배경에 가깝다. 그래서 SLAM을 대체로 풀린 문제로 치는 분위기가 있다.

---

2003년으로 돌아가 보면 풍경이 다르다. Andrew Davison은 Imperial College London의 실험실에서 노트북 한 대와 웹캠 한 대로 실시간 3D 추적을 시연했다. [MonoSLAM](https://www.doc.ic.ac.uk/~ajd/Publications/davison_iccv2003.pdf)이라 불린 그 시스템은 30Hz의 데스크톱 처리 속도에서 한 프레임당 10여 개의 특징만 주시하며 수십 개 규모의 희박한 지도를 유지했다. 한 방의 책상 하나. 카메라가 책상 밖으로 나가면 지도가 발산했다. 그것이 당시 최고였다.

오늘날 핸드폰 AR이 순간 추적하는 특징점 수의 수백분의 일 수준이 당시 최고치였고, 그 격차를 채우는 데 23년이 걸렸다. *어떤 경로*로 채워졌는지가 이 책의 관심사다.

---

SLAM의 역사는 네 가지 서로 다른 전통이 독립적으로 진행되다가 충돌하며 서로를 흡수한 흔적이다. 사진측량학자들은 100년 전에 bundle adjustment를 손으로 풀었다. 로봇공학자들은 1986년 [Smith-Cheeseman](https://arxiv.org/abs/1304.3111)의 확률적 공간관계 프레임부터 지도를 확률의 언어로 다루기 시작했고, 이 문제 설정에 "SLAM"이라는 이름이 붙은 것은 그보다 9년 뒤인 [Durrant-Whyte & Leonard의 1995년 survey](https://ieeexplore.ieee.org/document/476131)에서였다. 컴퓨터 비전 연구자들은 실시간 특징점 추적에 집착했다. 그리고 2020년대의 딥러닝 공동체는 이 모든 것을 단일 네트워크로 흡수하려 시도하고 있다.

이 책이 던지는 질문은 "어떻게"가 아니라 "왜 이런 방식으로"다. EKF 기반 SLAM이 graph-based로 교체된 것은 기술의 자연스러운 진화였는가, 아니면 몇 사람의 선택이 가른 우연이었는가. Feature-based와 direct method의 분기는 처음부터 예견된 것이었는가. 딥러닝이 geometry 파이프라인을 대체하는 속도가 이토록 더딘 이유는 무엇인가. Counterfactual이 의미 있는 질문은, 선택지가 실제로 존재했을 때뿐이다. 이 책은 그 선택지들이 실제로 존재했음을 드러낸다.

---

그 경로를 추적하려면 도구가 필요하다. 연도만 나열하면 연대기가 되고, 기법만 설명하면 교과서가 된다. 이 책은 계보와 예측이라는 두 렌즈로 역사를 읽는다. 어떤 아이디어가 어디서 왔는가. 연구자들이 당시 시점에서 본 미래와 실제로 펼쳐진 미래가 어떻게 갈렸는가.

이 책에는 네 가지 반복 장치가 있다. 각 챕터를 읽을 때 이 장치들을 길잡이로 쓸 수 있다.

**계보 도입**은 챕터 첫 한두 단락에 놓인다. 그 챕터의 주인공이 어떤 지적 유산을 물려받았는지를 인물과 연도로 드러낸다. SLAM의 어떤 아이디어도 진공에서 탄생하지 않았다. 계보를 보면 차용의 지형이 보인다.

**🔗 차용 박스**는 특정 기법이 어디서 왔는지를 한두 문장으로 명시하는 마진 주석이다. "ORB-SLAM의 이 구조는 Strasdat 2011에서 왔다"처럼. 연구자들은 인용하지만 계보를 명시하지 않는 경우가 많다. 이 박스는 그 계보를 드러낸다.

**📜 예언 vs 실제 박스**는 원 논문의 Conclusion·Future Work·Summary 섹션이 짚은 것과 실제로 일어난 일을 대조한다. [Triggs 1999](https://dblp.org/rec/conf/dagstuhl/TriggsMHF99.html)의 BA 종합 논문이 §12 "Summary and Recommendations"에서 대규모 희소 구조 활용을 핵심 지침으로 남긴 자리를, 2010년대 [COLMAP](https://openaccess.thecvf.com/content_cvpr_2016/papers/Schonberger_Structure-From-Motion_Revisited_CVPR_2016_paper.pdf)이 수만 장 규모의 SfM을 오픈소스 실전 도구로 만들며 다른 각도에서 채웠다. 예측의 방향은 맞았지만 경로는 달랐다. 연구자가 당시 시점에서 본 미래와 실제 미래의 간극이 이 장치의 대상이다.

**🧭 아직 열린 것**은 챕터 말미에 놓인다. 그 챕터가 다룬 주제에서 2026년 기준 아직 해결되지 않은 항목들이다. SLAM이 풀렸다는 인식 안에 숨어 있는 열린 문제들을 꺼낸다. Ch.19에서 이 항목들을 전 챕터에 걸쳐 수확해 재구성한다.

---

책은 6부로 구성된다.

**1부: 선사시대**는 SLAM이 로봇공학에서 태어나기 이전, 사진측량과 고전 컴퓨터 비전이 쌓아 올린 도구들을 추적한다. 왜 bundle adjustment가 여전히 모든 최적화 backend의 뼈대인가.

**2부: SLAM의 탄생**은 1986년 Smith-Cheeseman의 확률적 프레임부터 Davison의 MonoSLAM까지, 로봇이 처음으로 스스로 지도를 만들기 시작한 시기를 추적한다. 문제 설정은 1986년에 잡혔고, "SLAM"이라는 약어와 표준 용어가 커뮤니티에 정착한 것은 1995년 Durrant-Whyte·Leonard의 survey가 기점이었다. EKF라는 도구가 어떻게 dominant paradigm이 되었고 왜 그 한계가 구조적이었는가.

**3부: 병렬 혁명**은 PTAM이 지도 작성과 카메라 추적을 분리한 2007년부터 graph-based SLAM과 loop closure, 그리고 ORB-SLAM까지를 다룬다. "실시간 SLAM"이 desktop에서 가능해진 10년.

**4부: 방법론의 분기**는 feature-based와 direct method의 갈림길, RGB-D의 등장, place recognition이 독립 서브필드로 분화하는 과정을 다룬다. 서로 다른 가정이 어떻게 서로 다른 생태계를 만들었는가.

**5부: 학습의 유입**은 monocular depth 추정부터 end-to-end SLAM, Neural Radiance Fields, 3D Gaussian Splatting까지를 다룬다. 딥러닝이 geometry 파이프라인을 흡수하는 속도와 그 마찰의 원인.

**6부: 막힌 길과 열린 문제**는 SLAM 역사의 실패한 경로들과, 오늘날 "풀렸다"는 인식 뒤에 남아 있는 구조적 미해결 문제들을 꺼낸다.

---

범위를 정해야 지도가 된다. Foundation model이 SLAM을 대체할 것인가 같은 질문은 이 책의 관심사가 아니다. 과거에 무슨 일이 있었고 왜 그랬는가가 재료다. 당시 제약 조건에서 그 선택이 어떤 의미였는지를 드러내는 것이 목표에 가깝다. homogeneous coordinates, epipolar geometry, EKF 공식은 독자가 이미 안다고 가정한다. 계보의 추적이 이 책의 일이고, 어떤 카메라나 LiDAR를 고를 것인가는 다른 책의 주제다.

수식과 정리·증명까지 체계적으로 짚고 싶다면 [SLAM Handbook](https://github.com/SLAM-Handbook-contributors/slam-handbook-public-release)이 있다. Carlone, Kim, Barfoot, Cremers, Dellaert가 편집해 Cambridge University Press에서 2026년에 나온 이 책은 18개 챕터에 SLAM의 현재 이론과 시스템을 총정리한다. 이 책은 그 상태에 이르기까지의 경로를 기록한다.

그 Handbook의 Epilogue에서 편집자 5인이 공동으로 남긴 격언 중 하나는 *"If someone tells you 'SLAM is solved,' don't listen to them"*이다. 이 장의 도입에서 말한 "풀린 문제로 치는 분위기"는 분야 내부의 관찰 대상이지 분야의 합의가 아니다.

---

Davison이 2003년 웹캠 앞에 서 있었을 때, 그는 자신이 무엇을 시작하는지 정확히 알지 못했다. 그 데모 영상은 지금도 인터넷에 남아 있다. 흔들리는 화면, 깜빡이는 랜드마크 점들, 수십 개 규모의 희박한 지도. 거기서 여기까지 오는 사이 어떤 일이 있었는지를 기록한다.

그 기록은 MonoSLAM보다 훨씬 앞에서 시작한다. "SLAM"이라는 약어가 1995년 survey에서 정착하기도 전, 심지어 Smith-Cheeseman이 확률적 지도를 수식으로 쓰기도 전에, 사진측량학자들은 이미 카메라로 3D 구조를 복원하고 있었다. 다음 챕터는 그 선사(先史)를 추적한다.

---

# Ch.1 — 사진측량과 bundle adjustment: Triggs 이전 100년

오늘날 SLAM 최적화 backend의 뼈대는 독일 측량학에서 태어났다. 20세기 초 Carl Pulfrich가 유리판 위에서 두 시점의 삼각측량을 손으로 계산하던 방법론은, Albrecht Meydenbauer의 사진측량 체계와 결합해 하나의 측량 전통을 형성했다. 그 전통이 1958년 Duane C. Brown의 수치 공식화를 거쳐, 1999년 Bill Triggs, Philip McLauchlan, Richard Hartley, Andrew Fitzgibbon의 손에서 컴퓨터 비전 언어로 번역되었다. Bundle adjustment의 1999년 종합은 100년 된 측량 유산을 컴퓨터 비전 커뮤니티가 쓸 수 있는 언어로 옮긴 작업이었다. Triggs et al.(1999)은 Pulfrich의 기하학에서 시차 원리를, Brown(1958)의 군사 측량에서 reprojection 정식화를 물려받았다. solver 골격은 Levenberg-Marquardt가 제공했다.

---

## 1. 20세기 초 유리판과 Stereophotogrammetry

1901년 [Carl Pulfrich](https://en.wikipedia.org/wiki/Carl_Pulfrich)는 함부르크 자연과학자 회의에서 Zeiss 광학연구소가 제작한 **입체 측량기(stereocomparator)**를 발표했다 (1899년 뮌헨에서 입체 거리계 시제품을 먼저 공개한 뒤의 정식 공개). 두 카메라 시점에서 같은 점을 찍고, 유리판 위의 좌표 차이를 읽어 거리를 산출하는 장치였다. 원리는 단순했다: 두 시점의 시차(parallax)가 깊이와 역비례한다. 수학은 그리스 시대의 삼각법이었고, 새로운 것은 광학 기기의 정밀도였다.

한 세대 앞선 흐름으로, [Albrecht Meydenbauer](https://de.wikipedia.org/wiki/Albrecht_Meydenbauer)는 건축물 보존을 위한 **건축 사진측량(architectural photogrammetry)**을 체계화했다. 1858년 그는 베츨라 대성당 외벽을 측량하다 추락사고를 겪은 뒤, 사진으로 대신할 수 있다는 생각을 품었다. 1885년 그는 프로이센 왕립 사진측량국(Königlich Preussische Messbild-Anstalt)을 설립했다.

이 두 흐름이 합쳐진 전통이 20세기 항공 측량으로 이어졌다. 비행기 위에서 지형을 찍고, 두 시점 사진으로 3차원 지도를 만드는 aerotriangulation이다. 수동 계산기의 시대였다.

> 🔗 **차용.** 현대 SLAM의 스테레오 깊이 추정은 Pulfrich의 stereocomparator와 같은 원리다. 두 카메라 간 baseline과 시차로 깊이를 구한다. 125년 전 유리판이 픽셀 배열로 바뀌었을 뿐이다.

---

## 2. 1958년 Brown과 수치 bundle adjustment

Pulfrich와 Meydenbauer가 광학 기기로 해결한 문제를, Brown은 수식으로 옮겼다.

[Duane C. Brown](https://digital.hagley.org/08206139_solution)은 미국 공군 탄도미사일 개발 체계의 측량 엔지니어였다. 위성 궤도와 지상 좌표를 함께 추정하는 문제, 즉 다수의 카메라 시점과 다수의 지상 제어점을 동시에 최적화하는 문제를 다루었다.

1958년 보고서 "A Solution to the General Problem of Multiple Station Analytical Stereotriangulation"(RCA-MTP Data Reduction Technical Report No. 43, AFMTC-TR-58-8)에서 Brown은 **bundle adjustment**를 수치적으로 공식화한 초기 문헌 중 하나를 남겼다 (같은 시기 Helmut Schmid도 공동 발명자로 함께 거론된다).

핵심은 **reprojection error**다. 카메라 $i$에서 관측된 2D 이미지 좌표 $x_{ij} \in \mathbb{R}^2$와, 3D 점 $X_j \in \mathbb{R}^3$을 내부 행렬 $K_i$·외부 행렬 $[R_i | t_i]$로 투영한 예측 좌표 $\pi(K_i, R_i, t_i, X_j)$의 차이를 최소화한다:

$$E = \sum_{i,j} \| x_{ij} - \pi(K_i, R_i, t_i, X_j) \|^2$$

"Bundle"이라는 이름은 각 카메라 중심에서 관측된 3D 점들로 뻗어 나가는 광선 다발(bundle of rays)에서 왔다. 그 광선들이 3차원 점에서 교차하도록 카메라 자세와 점 위치를 동시에 조정한다. 군사·첩보 응용에서 출발한 기법이 학계에 흡수되기까지는 40년이 걸렸다.

> 🔗 **차용.** 위성 geolocation 분야의 bundle 기법은 1990년대 이후 컴퓨터 비전 커뮤니티에 유입되었다. 군사 보안 분류(classified)로 묶인 기간 동안 학계는 같은 문제를 독립적으로 재발견했다. Triggs 1999는 그 두 흐름의 합류점이다.

---

## 3. Levenberg와 Marquardt — 비선형 최적화의 선구자

Brown이 최소화해야 할 목적함수를 손에 쥐었다면, 그것을 실제로 푸는 도구는 전혀 다른 곳에서 왔다.

Reprojection error 최소화는 비선형 최소제곱 문제다. 해석적 해가 없으므로 반복 수치 최적화가 필요하다.

1944년 [Kenneth Levenberg](https://cs.uwaterloo.ca/~y328yu/classics/levenberg.pdf)는 Gauss-Newton과 steepest descent를 댐핑 파라미터 $\lambda$로 보간하는 방법을 발표했다. $\lambda$가 클수록 steepest descent에 가까워져 안전하게 수렴하고, 작을수록 Gauss-Newton의 빠른 수렴을 활용한다. 이 전략은 목적함수에 $\lambda \mathbf{I}$를 더한 수식으로 표현되어 수치 안정성을 높였다. 컴퓨터 비전보다 20년 앞선 시점이었다. 1963년 [Donald Marquardt](https://epubs.siam.org/doi/10.1137/0111030)는 같은 아이디어를 독립적으로 재발견해 더 명시적으로 공식화했다. **Levenberg-Marquardt(LM) 알고리즘**이라는 이름으로 굳어졌다.

LM 알고리즘이 컴퓨터 비전에서 BA의 표준 solver가 되기까지 약 35년이 더 걸렸다. 분야 간 벽이 그 시간을 만들었다.

---

## 4. 1999년 Triggs et al. — 100년 유산 통합

Levenberg-Marquardt가 수치 도구를 준비해 둔 지 35년 뒤, 컴퓨터 비전은 마침내 그 도구를 가져갔다.

1999년 Vision Algorithms Workshop에서 Bill Triggs, Philip McLauchlan, Richard Hartley, Andrew Fitzgibbon은 ["Bundle Adjustment — A Modern Synthesis"](https://link.springer.com/chapter/10.1007/3-540-44480-7_21)를 발표했다.

이 논문이 한 일은, 20세기 측량학과 항공 사진측량에 흩어져 있던 BA 이론을 컴퓨터 비전 커뮤니티의 언어로 번역해 종합하는 것이었다. Triggs et al.이 기여한 것은 두 가지다. 첫째, sparse BA의 구조적 성질을 명시했다. Hessian 행렬의 희소 블록 구조(Schur complement trick)를 이용하면 카메라-점 결합 최적화를 훨씬 효율적으로 수행할 수 있다. 둘째, gauge freedom(기준틀의 임의성)을 명시적으로 다루었다.

이 논문이 나오고 7년 후, Noah Snavely의 [Photo Tourism(2006)](https://phototour.cs.washington.edu/Photo_Tourism.pdf)은 인터넷에 흩어진 사진 수백 장에서 노트르담·트레비 분수 같은 유명 랜드마크를 자동 재구성했다. 그로부터 10년 후 Johannes Schönberger의 [COLMAP(2016)](https://openaccess.thecvf.com/content_cvpr_2016/papers/Schonberger_Structure-From-Motion_Revisited_CVPR_2016_paper.pdf)은 수만~수십만 장 규모의 robust incremental SfM을 오픈소스로 공개하며, 이미 백만 장대까지 가 있던 연구 흐름을 누구나 재현할 수 있는 도구로 가져왔다. Triggs의 언어가 없었다면 그 경로는 훨씬 느렸을 것이다.

---

## 5. Reprojection Error — 개념의 형성

Triggs et al.이 어떤 오차 함수를 최소화하는지 서술했다면, 그 함수 자체가 어떻게 현재 형태로 정착했는지는 별도로 추적할 만하다.

이 오차 함수가 지금 형태로 자리 잡기까지 두 번의 전환이 있었다.

20세기 초 항공 삼각측량사들은 오차를 "지상 좌표계에서의 거리 차이"로 쟀다. 3D 공간에서 직접 비교하는 방식이어서, 카메라 렌즈가 틀어졌거나 캘리브레이션이 나빠도 그 오차는 지상 좌표 잔차에 녹아 보이지 않았다.

Brown은 1958년 보고서에서 비교 대상을 이미지 면으로 옮겼다. "3D 점을 이미지로 투영한 위치"와 "실제 이미지 관측"을 픽셀 단위로 맞추는 방식이다. 이렇게 하면 캘리브레이션 오차, 렌즈 왜곡, 외부 파라미터 오차가 하나의 잔차에 함께 드러난다. 통계적으로도 더 깔끔하다. 카메라 이미지 노이즈는 픽셀 단위의 등방성 가우시안으로 모델링할 수 있고, 그러면 reprojection error 최소화는 최대우도 추정과 같아진다.

Triggs et al.(1999)은 그 공식을 컴퓨터 비전 교과서 언어로 다듬어 표준화했다. 이 reprojection error minimization이 2026년 기준 factor graph 기반 SLAM backend의 핵심 측정 함수(measurement function)다.

> 🔗 **차용.** SLAM에서 visual landmark의 관측 모델 $z = \pi(K, T, p) + \epsilon$은 Brown(1958)의 reprojection 공식을 직접 계승한다. Gauss-Newton으로 이를 최소화하는 SLAM backend는 1958년 항공 삼각측량 solver와 수학적으로 동일한 구조를 가진다.

---

## 6. SLAM Backend의 뼈대 — 2026년까지

"SLAM"이라는 약어 자체는 1995년 [Durrant-Whyte·Leonard의 survey](https://ieeexplore.ieee.org/document/476131)에서 표준 용어로 정립됐지만, 그 backend의 수학은 이 챕터가 추적해 온 1958년 Brown의 reprojection 공식을 거의 그대로 물려받는다. 오늘날 SLAM 최적화 backend를 보자. ORB-SLAM3는 g2o를 통해 SE(3) 자세와 3D landmark 위치를 동시 최적화한다. LIO-SAM은 GTSAM의 factor graph 위에서 LM 알고리즘을 돌린다. DROID-SLAM은 GRU-based optical flow로 업데이트 방향을 구하지만, 최종 bundle adjustment 레이어는 여전히 Schur complement trick을 쓴다.

Lie group과 factor graph가 1999년의 행렬 표기를 대체했고, 신경망이 기술자 계산을 넘겨받았지만, 연산의 본질은 그대로다. 다수의 시점에서 관측된 점들의 reprojection error를 최소화해 카메라 자세와 맵을 동시에 추정한다. Pulfrich의 유리판이 픽셀 배열로 바뀌고, 손 계산이 GPU로 바뀌었을 뿐이다.

이 연속성은 분야의 강점이자 취약점이다. 강점: 100년의 수렴성 증명과 실용 검증이 무료로 따라온다. 취약점: BA의 전제(static world, point feature, Gaussian noise)가 현실 환경과 어긋날 때 대안이 없다.

---

> 📜 **예언 vs 실제.** Triggs et al.(1999)은 대규모 BA — 수천 대 카메라, 수백만 점 규모 — 로의 확장을 주요 도전으로 꼽은 것으로 널리 읽힌다. 그 방향성은 이후 20년에 걸쳐 달성되었다. 2006년 Snavely의 Photo Tourism이 인터넷 사진 수백 장으로 랜드마크를 재구성했고, 2016년 COLMAP은 그 흐름의 robust incremental SfM 구현체를 표준화했다. 다만 Triggs가 상상한 "직접 확장"이 아니었다. incremental BA와 visibility graph pruning 위에 vocabulary tree 루프 클로저가 얹힌, 엔지니어링 층의 결과였다. `[적중]`

---

## 🧭 아직 열린 것

**비선형 BA의 global optimum 보장.** LM 알고리즘은 국소 최솟값(local minimum)에 수렴한다. 초기값이 나쁘면 틀린 구조에 수렴한다. 초기화를 위한 방법들, 즉 5-point algorithm, PnP, epipolar geometry 추정이 차례로 등장했지만 이것들 역시 내부적으로 RANSAC과 반복 최적화에 의존한다. 대규모 환경에서 전역 최적을 보장하는 convex relaxation 기반 접근들이 연구되고 있으나, 실시간 SLAM 수준의 속도와 규모에서는 아직 실용화되지 않았다.

**사진측량 수준 정밀도와 Visual SLAM의 간극.** 항공 사진측량은 서브픽셀(0.1픽셀 이하) 정확도를 표준으로 요구한다. 교정된 카메라와 고품질 GCP(지상 기준점)가 있고, 최적화는 오프라인에서 수행한다. 실시간 Visual SLAM은 같은 수식 구조를 쓰면서도 GPS 없는 환경과 저해상도 카메라, 그리고 즉각적 추정이라는 제약 아래서 동작한다. 측량 분야의 정확도 기준(RMSE < 5 cm at 500 m 거리)에 Visual SLAM이 체계적으로 도달하는 환경은 제한적이며, 두 분야의 정확도 기준을 단일 프레임워크로 통합하는 시도는 진행 중이다.

---

BA의 전제(static world, point feature, Gaussian noise)가 무너지기 시작하는 것은 카메라가 이동하는 물체를 만났을 때다. 측량사는 다리를 측량하지 로봇 축구 경기장을 측량하지 않았다. 그 균열은 Ch.2에서 시작된다: 컴퓨터 비전이 단순한 특징점 매칭 너머로 움직이던 시기, Harris corner와 optical flow가 이 유산을 실시간으로 이어받으려 한 첫 번째 시도들이다.

---

# Ch.2 — Classical CV 도구상자: Harris에서 SIFT까지, 그리고 ORB까지

Ch.1의 번들조정은 카메라 자세와 3D 점을 동시에 최적화하는 backend 문제를 다루었다. 그러나 그 최적화가 작동하려면 먼저 이미지에서 "대응하는 점"을 찾아야 한다. 측량사는 야지에서 직접 타깃을 세웠고, 컴퓨터 비전은 그 역할을 알고리즘에 맡겨야 했다. feature detection·description은 그렇게 시작된 문제다.

1970년대 후반 Hans Moravec은 Stanford Cart 프로젝트에서 카메라로 환경의 두드러진 점을 찾으려 했다. 그 작업은 1980년 Stanford 박사논문 ["Obstacle Avoidance and Navigation in the Real World by a Seeing Robot Rover"](https://frc.ri.cmu.edu/~hpm/project.archive/robot.papers/1975.cart/1980.html.thesis/index.html)로 정리됐다. 텍스처가 풍부한 모서리가 추적하기 좋다는 직관은 있었지만, 수학적 정의는 없었다. 11년 후 Chris Harris와 Mike Stephens가 그 직관을 autocorrelation matrix의 eigenvalue로 공식화했다. Lucas와 Kanade는 그보다 7년 앞서 픽셀 추적의 틀을 세웠다. Lowe는 두 개념을 흡수해 scale과 rotation에 불변인 서술자를 만들었다. Rublee는 특허 없이 더 빠르게 같은 일을 했다. SLAM의 front-end는 이 계보 위에서 돌아간다.

---

## 2.1 코너라는 개념: Moravec에서 Harris까지

카메라가 조금 움직였을 때 영상 패치가 크게 변하는 점을 "코너"라 부른다. Moravec(1977)의 기준은 단순했다. 인접 픽셀과의 Sum of Squared Differences(SSD)가 상하좌우 모든 방향에서 크면 코너로 간주한다.

Harris와 Stephens는 1988년 Alvey Vision Conference에서 ["A Combined Corner and Edge Detector"](https://www.bmva.org/bmvc/1988/avc-88-023.html)를 발표하며 이를 연속 미분으로 대체했다. 이미지 $I$에서 점 $(x,y)$ 주변 창 $W$를 이동량 $(\Delta x, \Delta y)$로 움직일 때 강도 변화를 근사하면:

$$M = \sum_{(x,y) \in W} \begin{pmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{pmatrix}$$

$M$의 두 eigenvalue $\lambda_1, \lambda_2$로 점의 성격을 구분한다. 둘 다 크면 코너, 하나만 크면 엣지, 둘 다 작으면 평탄한 영역. Harris는 행렬식을 직접 계산하지 않고 $R = \det(M) - k \cdot \text{tr}(M)^2$ 점수를 사용해 eigenvalue 분해를 피했다. $k$는 보통 0.04–0.06.

> 🔗 **차용.** Harris(1988)의 autocorrelation matrix 아이디어는 Moravec(1977)의 SSD 기반 코너 탐색을 연속 미분으로 정제한 것이다. 개념의 원형은 Stanford Cart 보고서에 있었다.

1994년 Jianbo Shi와 Carlo Tomasi는 ["Good Features to Track"](https://cecas.clemson.edu/~stb/klt/shi-tomasi-good-features-cvpr1994.pdf)(CVPR 1994)에서 Harris 점수 대신 $\min(\lambda_1, \lambda_2)$를 직접 사용하는 것이 optical flow 추적에 더 안정적임을 보였다. 이 기준이 Shi-Tomasi 코너 검출이다. OpenCV는 `goodFeaturesToTrack` 함수로 이를 구현했다. 30년이 지난 오늘도 그 함수는 그대로다.

---

## 2.2 추적의 원형: Lucas-Kanade와 KLT

Harris의 행렬 $M$은 점을 찾는다. 찾은 점을 다음 프레임에서 다시 찾는 것은 별개 문제다. Bruce Lucas와 Takeo Kanade는 1981년 ["An Iterative Image Registration Technique"](https://www.ijcai.org/Proceedings/81-2/Papers/017.pdf)에서 프레임 간 픽셀 이동을 밝기 불변 가정(brightness constancy assumption) 아래 최소화 문제로 정식화했다.

밝기 불변 가정: 픽셀 $(x,y)$의 강도는 움직임 전후로 같다.

$$I(x, y, t) = I(x + u, y + v, t + 1)$$

테일러 전개 후 선형화하면:

$$I_x u + I_y v + I_t = 0$$

이 방정식 하나에 미지수가 둘이다. Lucas-Kanade는 $3\times3$ 또는 $5\times5$ 창 안의 픽셀들이 같은 $(u,v)$로 움직인다는 가정을 추가해 overdetermined 시스템을 만들고 최소자승으로 푼다.

$$\begin{pmatrix} \sum I_x^2 & \sum I_x I_y \\ \sum I_x I_y & \sum I_y^2 \end{pmatrix} \begin{pmatrix} u \\ v \end{pmatrix} = -\begin{pmatrix} \sum I_x I_t \\ \sum I_y I_t \end{pmatrix}$$

왼쪽 행렬이 Harris의 구조 행렬 $M$과 동일하다. 코너 검출과 optical flow가 같은 수학 위에 있다는 뜻이다.

Tomasi와 Kanade는 1991년 tech report ["Detection and Tracking of Point Features"](https://cecas.clemson.edu/~stb/klt/tomasi-kanade-techreport-1991.pdf)에서 추적 창의 품질을 eigenvalue 기준으로 선택하고 Newton-Raphson 반복으로 displacement를 정제하는 구체적 구현을 제시했다. 이후 Bouguet(Intel, 2000)이 이미지 피라미드 기반 coarse-to-fine 전략을 더해 큰 이동에서도 수렴하도록 확장했고, 이 조합이 KLT(Kanade-Lucas-Tomasi) 추적기로 정착했다. [VINS-Mono](https://arxiv.org/abs/1708.03852)(2018) 같은 실시간 VIO가 여전히 이 계보의 front-end를 돌린다. 1981년의 최소자승 추적기가 40여 년 뒤 스마트폰 드론의 VIO에서 돌아가는 셈이다.

> 🔗 **차용.** Lucas-Kanade(1981) → KLT tracker → Qin et al. VINS-Mono(2018): 38년 전 optical flow가 실시간 VIO의 feature tracking backbone으로 그대로 살아있다.

---

## 2.3 SIFT — 불변성과 특허

KLT는 같은 카메라가 조금씩 이동하는 상황에 맞다. 다른 카메라로, 다른 날 찍은 이미지에서 같은 점을 연결하는 문제는 다른 차원이다. 시점이 달라지면 같은 점이 패치 모양과 크기, 방향까지 달라져 단순 픽셀 비교가 통하지 않는다. **서술자(descriptor)**가 필요한 이유다.

David Lowe(UBC)는 1999년 ICCV에서 아이디어를 발표했다. 당시 발표 제목은 "Object Recognition from Local Scale-Invariant Features"였고, 128차원 벡터를 비교해 서로 다른 사진에서 같은 물체를 찾아내는 시연을 보였다. 5년 뒤인 2004년 IJCV에 완성판 ["Distinctive Image Features from Scale-Invariant Keypoints"](https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf)가 실렸고, 이것이 오늘날 SIFT로 인용되는 논문이다. SIFT(Scale-Invariant Feature Transform)는 두 단계로 구성된다.

**검출 단계**: DoG(Difference of Gaussians)를 여러 scale에서 계산해 local extremum을 keypoint로 선택한다. DoG는 Laplacian of Gaussian의 근사다. $L(x,y,\sigma) = G(x,y,\sigma) * I(x,y)$를 Gaussian 스무딩 이미지라 하면:

$$D(x, y, \sigma) = L(x, y, k\sigma) - L(x, y, \sigma)$$

여기서 $k$는 인접 scale 간 비율(보통 $2^{1/s}$, $s$는 octave당 스케일 수). 여러 octave에 걸쳐 극값을 찾으면 scale 변화에도 같은 점을 탐지할 수 있다.

**서술자 단계**: keypoint 주변 $16\times16$ 창을 $4\times4$ 블록으로 나누고 각 블록의 gradient 방향 히스토그램(8빈)을 연결해 128차원 벡터를 만든다. keypoint의 dominant gradient 방향을 기준으로 회전시키므로 회전 불변성도 확보한다.

결과는 scale, rotation, 부분적인 affine 변형에 강인한 128차원 서술자였다. KITTI 이전 시대, SLAM 벤치마크가 없던 시절에도 연구자들이 SIFT를 쓸 수밖에 없었던 이유다.

Lowe는 2000년 3월 SIFT를 특허 출원했고, 2004년 3월 등록됐다(US6711293B1, 우선권 1999년 3월). 이 특허는 상업용 사용에 비용을 부과했고, 2020년 3월 만료 전까지 SIFT를 대체하려는 시도의 동기 중 하나가 되었다.

> 📜 **예언 vs 실제.** Lowe는 2004년 SIFT 논문의 "9 Conclusions"에서 서술자의 확장 가능성을 "view matching for 3D reconstruction, motion tracking and segmentation, robot localization, image panorama assembly, epipolar calibration"로 나열했다. 방향 자체는 대부분 맞았다—SfM·SLAM·파노라마·초기 영상 기반 로봇 위치추정이 2000년대 후반 SIFT에 기댔다. 다만 장기 대응 문제에서 SIFT의 위치는 CNN 이후 흔들렸다. 2012년 AlexNet 이후 물체 인식 쪽 수요는 CNN으로 이동했고, SLAM용 local descriptor 자리도 SuperPoint·R2D2 같은 학습 서술자가 점차 가져갔다. 응용 영역 예측은 적중, 서술자 형태는 기술변화. `[부분 적중]`

---

## 2.4 SURF — 속도와 정확도 절충

SIFT의 128차원 서술자는 정확하지만 느렸다. 당시 데스크톱 CPU에서 이미지 한 장당 수백 밀리초. 실시간 SLAM에는 쓸 수 없었다. Herbert Bay(ETH Zurich)는 2006년 ECCV에 ["SURF: Speeded-Up Robust Features"](https://people.ee.ethz.ch/~surf/eccv06.pdf)를 발표했다. 핵심 아이디어는 두 가지다.

DoG 대신 *Hessian 행렬의 행렬식*으로 keypoint를 탐지한다. integral image를 이용한 box filter로 Gaussian 이차 미분을 근사해 계산 속도를 높인다. 서술자는 64차원으로 SIFT의 절반. keypoint 주변을 $4\times4$ 하위 영역으로 나누고, 각 영역에서 Haar wavelet 응답 $d_x, d_y$의 합 $(\sum d_x,\, \sum d_y,\, \sum|d_x|,\, \sum|d_y|)$ 4값을 연결해 $4\times4\times4=64$차원을 구성한다. 128차원 확장(SURF-128)도 존재하나 기본값은 64차원이다.

SURF는 SIFT보다 3–7배 빨랐다. 그러나 128차원 vs 64차원의 정확도 차이가 남았고, Bay도 특허를 피하지 못했다(ETH Zurich 특허). SIFT는 속도 때문에 밀렸고, SURF는 정확도와 특허 두 가지 때문에 밀렸다. 두 문제를 동시에 푼 것이 ORB였다.

> 🔗 **차용.** Lowe(1999/2004)의 DoG scale-space → Bay(2006)의 Hessian integral image: scale-invariance를 얻는 두 가지 답. DoG는 이론적으로 우아하고, Hessian 근사는 공학적으로 빠르다.

---

## 2.5 ORB — binary descriptor와 특허 해방

2011년 Ethan Rublee(Willow Garage), Vincent Rabaud, Kurt Konolige, Gary Bradski는 ICCV에 ["ORB: An Efficient Alternative to SIFT or SURF"](https://www.gwylab.com/download/ORB_2012.pdf)를 발표했다. 제목이 직접적이다. Willow Garage는 ROS의 산실이기도 했다. 로보틱스 연구자가 쓸 수 있는 feature를 만들겠다는 동기가 제목에 그대로 담겼다.

ORB는 두 기존 기법을 조합하고 개선했다.

**검출**: [FAST](https://www.edwardrosten.com/work/rosten_2006_machine.pdf)(Features from Accelerated Segment Test, Rosten & Drummond 2006). 픽셀 주변 16개 점을 순환하며 충분히 밝거나 어두운 연속 호가 있으면 코너로 판정한다. SIFT의 DoG보다 10배 이상 빠르다. ORB는 FAST에 Harris 점수를 추가해 응답이 강한 것만 남긴다.

**서술자**: [BRIEF](https://www.cs.ubc.ca/~lowe/525/papers/calonder_eccv10.pdf)(Binary Robust Independent Elementary Features, Calonder et al. 2010). keypoint 주변 패치에서 무작위로 선택한 점 쌍의 밝기를 비교해 비트열을 만든다. 256비트가 기본. 유클리드 거리 대신 Hamming 거리로 매칭하므로 XOR 연산 하나로 비교 가능하다.

BRIEF의 약점은 회전 불변성 부재였다. Rublee는 FAST 코너의 intensity centroid 방향으로 패치를 회전 보정해 **rBRIEF(rotated BRIEF)**를 만들었다. 방향 추정이 들어오면서 BRIEF는 비로소 실전에서 쓸 수 있는 서술자가 됐다.

$$\theta = \text{atan2}(m_{01},\, m_{10}), \quad m_{pq} = \sum_{x,y} x^p y^q I(x,y)$$

계산 속도는 SIFT의 100배였고, 특허가 없었으며, OpenCV에 즉시 통합됐다. [ORB-SLAM](https://arxiv.org/abs/1502.00956)(Mur-Artal et al. 2015)은 이름 그대로 ORB를 기반으로 했고, 이후 삼부작까지 이어졌다. ORB-SLAM3는 2021년에도 front-end를 바꾸지 않았다.

> 🔗 **차용.** Calonder et al.(2010)의 BRIEF → Rublee et al.(2011)의 ORB: binary descriptor에 intensity centroid 기반 방향 추정을 추가해 회전 불변성을 확보했다.

---

## 2.6 학습 기반 descriptor

ORB가 실용적 정점이라면, 그 뒤의 질문은 자연스럽다. 손으로 설계한 규칙이 아닌 학습된 규칙이 더 나은가. 2016년 Yi et al.의 [LIFT](https://arxiv.org/abs/1603.09114)(Learned Invariant Feature Transform, ECCV 2016)는 검출·방향 추정·서술자 세 단계를 CNN으로 대체하려 했다. 단계별로 따로 학습한 세 네트워크를 파이프라인으로 연결하는 구조였다.

2018년 DeTone et al.의 [SuperPoint](https://arxiv.org/abs/1712.07629)(CVPRW 2018)는 homographic adaptation이라는 자기지도 학습법으로 keypoint 검출과 256차원 서술자를 동시에 학습했다. 합성 데이터로 사전 학습 후 실제 이미지에 적응. SLAM 커뮤니티에서 처음으로 주목받은 learned descriptor였다.

그러나 2026년 기준으로도 전통 descriptor가 사라지지 않았다. ORB는 임베디드 디바이스에서 SuperPoint보다 빠르고, 도메인 밖 이미지에서 일반화가 불안정한 learned descriptor보다 예측 가능한 동작을 보인다. [AnyLoc](https://arxiv.org/abs/2308.00688)(Keetha et al. 2023)처럼 DINOv2 기반 feature가 장소 인식에 도입되었지만, ORB-SLAM3는 2021년 발표 이후 여전히 ORB를 쓴다. 1977년 Moravec의 직관이 2020년대 로봇 위에서 돌아가고 있다.

---

## 2.7 🧭 아직 열린 것

**학습 기반 descriptor의 일반화 한계.** SuperPoint, R2D2, DISK 등 learned descriptor는 학습 도메인에서 전통 방법을 능가하지만 새로운 환경(underwater, thermal, low-light)에서는 일관성이 없다. 어느 쪽이 낫다는 합의가 없다. 이 질문은 2026년에도 공개된 채로 남아 있다.

**Wide-baseline 매칭의 실패 모드.** Harris나 ORB 기반 매칭은 카메라 시점 변화가 45도를 넘으면 급격히 성능이 떨어진다. Affine-covariant detector(ASIFT, MSER)가 일부 보완했지만, 완전한 해법은 없다. [DUSt3R](https://arxiv.org/abs/2312.14132)(Wang et al. 2023)가 matching 자체를 회피하는 방향으로 돌파구를 열었지만, 이것이 descriptor 문제의 종말인지 우회인지는 아직 판단하기 이르다.

---

Harris의 직관과 Lowe의 불변성이 기반을 놓았고, Rublee의 속도 최적화가 그것을 현장으로 끌어냈다. 도구상자가 완성됐다. 이 기법들은 각자 이미지 한 장 혹은 두 장 사이에서 일하도록 설계됐다. 수십, 수백 장의 이미지를 동시에 기하적으로 일관되게 연결하려면 다른 층이 필요했다.

---

*참고 문헌*

- Harris, C. & Stephens, M. (1988). A Combined Corner and Edge Detector. *Proc. Alvey Vision Conference*.
- Lucas, B. D. & Kanade, T. (1981). An Iterative Image Registration Technique with an Application to Stereo Vision. *IJCAI*.
- Shi, J. & Tomasi, C. (1994). Good Features to Track. *CVPR*.
- [Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints.](https://doi.org/10.1023/B:VISI.0000029664.99615.94) *IJCV 60(2)*.
- Bay, H., Tuytelaars, T. & Van Gool, L. (2006). SURF: Speeded-Up Robust Features. *ECCV*.
- Calonder, M. et al. (2010). BRIEF: Binary Robust Independent Elementary Features. *ECCV*.
- [Rublee, E. et al. (2011). ORB: An Efficient Alternative to SIFT or SURF.](https://doi.org/10.1109/ICCV.2011.6126544) *ICCV*.
- DeTone, D., Malisiewicz, T. & Rabinovich, A. (2018). SuperPoint: Self-Supervised Interest Point Detection and Description. *CVPRW*. [arXiv:1712.07629](https://arxiv.org/abs/1712.07629)

---

# Ch.3 — Structure from Motion: Longuet-Higgins에서 COLMAP까지

Harris와 Lowe가 이미지 안에서 "볼 만한 점"을 골라내는 방법을 다듬는 동안, 다른 계보는 그 점들이 두 장의 사진에 동시에 찍혔을 때 무엇을 알 수 있는가를 물었다. 특징을 *검출*하는 문제와 특징으로부터 *공간을 재구성*하는 문제는 같은 시기에 각자 발전했고, 2000년대 중반에야 하나의 파이프라인으로 합쳐졌다.

1981년 케임브리지 이론심리학자 H.C. Longuet-Higgins는 *Nature*에 세 페이지짜리 논문을 실었다. 제목은 "[A Computer Algorithm for Reconstructing a Scene from Two Projections](https://cseweb.ucsd.edu/classes/fa01/cse291/hclh/SceneReconstruction.pdf)". 그는 두 장의 사진에 찍힌 같은 점들의 좌표 여덟 쌍만으로 카메라가 어떻게 움직였는지, 그리고 그 장면이 3차원에서 어떤 형태인지를 동시에 풀어낼 수 있음을 보였다. 로봇공학자도 컴퓨터 비전 연구자도 아니었다. 그 세 페이지에서 Structure from Motion(SfM)이 시작되었고, 2016년 Johannes Schönberger의 COLMAP이 나오면서야 그 수학이 공학으로 구현되었다.

---

## 3.1 Essential Matrix와 8-point Algorithm

Longuet-Higgins의 출발점은 단순했다. 두 카메라로 같은 점을 찍으면, 그 점의 이미지 좌표 쌍 사이에 대수적 제약이 존재한다. 좌표계를 정규화하면 이 제약은 행렬 하나로 집약된다. 그는 이것을 **essential matrix** $\mathbf{E}$로 정의했다.

두 카메라의 중심을 각각 $\mathbf{O}_1$, $\mathbf{O}_2$, 대응점을 정규화 좌표 $\mathbf{x}_1$, $\mathbf{x}_2$라 하면 제약은:

$$\mathbf{x}_2^\top \mathbf{E} \mathbf{x}_1 = 0$$

$\mathbf{E}$는 카메라 사이의 회전 $\mathbf{R}$과 이동 $\mathbf{t}$로부터 $\mathbf{E} = [\mathbf{t}]_\times \mathbf{R}$로 인수분해된다. 여기서 $[\mathbf{t}]_\times$는 $\mathbf{t}$의 skew-symmetric 행렬이다.

Essential matrix는 스케일 모호성을 제거하면 자유도가 5이다. 그러나 5개 대응점으로 푸는 non-linear 5-point algorithm([Nistér 2004](http://www.cad.zju.edu.cn/home/gfzhang/training/SFM/2004-PAMI-David%20Nister-An%20Efficient%20Solution%20to%20the%20Five-Point%20Relative%20Pose%20Problem.pdf))이 등장하기 전까지, 표준 접근은 rank-2 제약과 단위 스케일 제약을 강제하기 전 단계에서 행렬을 9개 원소 중 스케일 1개를 고정해 8개의 미지수로 보고 8개 대응점으로 선형 시스템을 푸는 것이었다. 이것이 **8-point algorithm**이다. Longuet-Higgins 자신은 정확히 8개 점으로 유일해를 구하는 절차를 제시했다. 구현은 간단했고, 계산량도 작았다.

문제는 수치 안정성이었다. 이미지 좌표가 수백~수천 픽셀 단위이면 계수행렬의 원소 크기가 크게 달라져 SVD가 불안정해진다.

> 🔗 **차용.** Hartley는 1997년 정규화된 8-point algorithm([In Defense of the Eight-Point Algorithm](https://www.cse.unr.edu/~bebis/CS485/Handouts/hartley.pdf))에서 이미지 좌표를 평균 0, 평균 거리 $\sqrt{2}$로 선형 변환한 뒤 essential matrix를 추정하는 방식을 내놓았다. Longuet-Higgins의 기하학은 그대로 두고, 수치 조건만 고쳤다. 이후 모든 교과서가 이 정규화 버전을 표준으로 삼았다.

Fundamental matrix $\mathbf{F}$는 essential matrix의 일반화다. 카메라 내부 파라미터 $\mathbf{K}$를 알지 못해도 $\mathbf{x}_2^\top \mathbf{F} \mathbf{x}_1 = 0$이 성립한다. 두 카메라의 내부 파라미터를 각각 $\mathbf{K}_1$, $\mathbf{K}_2$라 하면 관계는 $\mathbf{F} = \mathbf{K}_2^{-\top} \mathbf{E} \mathbf{K}_1^{-1}$이다. 같은 카메라로 찍은 경우($\mathbf{K}_1 = \mathbf{K}_2 = \mathbf{K}$)에는 $\mathbf{F} = \mathbf{K}^{-\top} \mathbf{E} \mathbf{K}^{-1}$로 단순화된다. SfM 파이프라인에서 $\mathbf{K}$를 모를 때는 $\mathbf{F}$를 먼저 추정하고, $\mathbf{K}$를 알 때는 $\mathbf{E}$를 직접 푼다.

---

## 3.2 Tomasi-Kanade Factorization

1981년 이후 십 년간 SfM은 주로 두 장 사진 사이의 기하학으로 연구되었다. 여러 장 사진을 동시에 처리하는 방법은 별도의 문제였다. 1992년 Carlo Tomasi와 Takeo Kanade가 CMU에서 **[factorization method](https://people.eecs.berkeley.edu/~yang/courses/cs294-6/papers/TomasiC_Shape%20and%20motion%20from%20image%20streams%20under%20orthography.pdf)**를 발표하면서 이 문제의 윤곽이 드러났다.

아이디어는 다음과 같다. $F$장의 프레임, $P$개의 포인트를 관측한다면, 이미지 좌표를 $2F \times P$ 행렬 $\mathbf{W}$로 쌓을 수 있다. 각 원소 $w_{fp}$는 $f$번째 프레임에서 $p$번째 포인트의 좌표다. orthographic(scaled orthographic) 카메라 모델 아래에서 $\mathbf{W}$는 rank 3 행렬이다. 원 논문(Tomasi & Kanade 1992)은 정확히 이 가정에서 출발했다. 그러면:

$$\mathbf{W} = \mathbf{M} \mathbf{S}$$

여기서 $\mathbf{M}$은 $2F \times 3$ 모션 행렬, $\mathbf{S}$는 $3 \times P$ 구조 행렬이다. SVD로 $\mathbf{W}$의 상위 3개 특이값만 유지하면 $\mathbf{M}$과 $\mathbf{S}$를 동시에 얻는다.

한 번의 SVD로 모든 프레임의 모션과 모든 포인트의 3D 위치를 동시에 추정한다는 점이 핵심이었다. 계산 복잡도는 $O(F \cdot P)$로 가볍고, 구현이 쉬웠다.

> 🔗 **차용.** Nistér, Naroditsky, Bergen의 2004년 CVPR 논문 "Visual Odometry"는 실시간 에고모션 추정을 이 계보의 응용 문제로 돌려놓은 것으로 후속 문헌에 널리 인용된다. Tomasi-Kanade의 batch factorization을 그대로 쓰는 대신 짧은 윈도우 안에서 프레임 간 상대 포즈를 풀어나가는 쪽으로 방향이 옮겨갔고, 이는 batch 정확도 대신 latency를 택하는 흐름의 초기 지점으로 남았다.

한계는 orthographic/affine 가정에 있었다. Affine 카메라는 원근 왜곡(perspective distortion)을 무시한다. 이 모델은 장면의 깊이 변화가 카메라까지의 거리에 비해 충분히 작을 때(즉 원거리 소물 촬영)에만 유효하다. 카메라와 가까운 장면, 시야각이 넓은 렌즈, 혹은 전경·배경 깊이 차이가 큰 환경에서는 오차가 컸다. 1990년대 후반부터 perspective camera로의 확장이 여러 방향에서 시도되었고, 이는 bundle adjustment의 재발견으로 이어졌다.

---

## 3.3 Hartley & Zisserman과 정전(canon)화

Tomasi-Kanade의 factorization이 multiple view 문제의 틀을 잡았다면, 남은 과제는 perspective camera로의 확장과 흩어진 수학을 하나의 언어로 묶는 일이었다.

2000년 Richard Hartley와 Andrew Zisserman의 교과서 *[Multiple View Geometry in Computer Vision](https://www.robots.ox.ac.uk/~vgg/hzbook/)*이 나왔다. 680쪽. 1981년부터 1990년대까지 여기저기 흩어진 SfM 수학을 사영기하(projective geometry)의 언어로 통합했다.

Hartley & Zisserman이 한 것은 단순 정리가 아니었다. essential matrix, fundamental matrix, homography, camera calibration, bundle adjustment를 모두 사영기하의 단일 프레임워크에서 끌어냈다. 각자 따로 돌던 개념들이 같은 뿌리에서 나온다는 것이 처음으로 명확해졌다.

bundle adjustment는 이 책에서 특히 무게 있게 다뤘다. Triggs et al.(1999)이 Ch.1에서 정식 도입한 reprojection error 최소화 문제를, Hartley & Zisserman은 사영기하 프레임워크 안에 놓고 *robust cost function* $\rho$를 명시적으로 얹었다. outlier가 섞인 실제 데이터에서 최적화가 무너지지 않도록 Huber나 Cauchy 함수로 오차를 눌렀다. Levenberg-Marquardt로 풀되, Jacobian의 희소 구조를 써서 계산량을 줄였다.

2000년대 초반 SLAM·VO 논문 대부분이 이 교과서를 표준 참조로 달았다. 개념 정의가 이 책 하나로 통일되면서, Photo Tourism 같은 대규모 응용은 개념 재정의 없이 구현에 집중할 수 있었다.

---

## 3.4 Photo Tourism과 Bundler — 인터넷 규모 SfM

2006년 Noah Snavely, Steven Seitz, Richard Szeliski는 SIGGRAPH 논문 "[Photo Tourism](https://doi.org/10.1145/1179352.1141964)"을 발표했다. 인터넷에 업로드된 관광지 사진들(피렌체 두오모, 로마 트레비 분수)을 모아서 3D 재구성을 시도했다.

설정 자체가 도전이었다. 카메라도 날씨도 구도도 제각각이었고, 일부 사진은 관계없는 실내 컷이 섞여 있었다. 체계적으로 촬영한 데이터셋이 아니라, 수천 명이 아무 순서 없이 올린 이미지들이었다.

Snavely의 파이프라인은 다음 순서로 작동했다. SIFT 특징 검출과 매칭으로 이미지 쌍 사이의 대응점을 찾는다. Fundamental matrix로 기하적으로 불일치하는 매칭을 RANSAC으로 제거한다. 연결성이 높은 이미지 쌍부터 시작해 카메라를 하나씩 추가하는 incremental SfM을 수행한다. 카메라를 추가할 때마다 bundle adjustment로 전체 포즈와 포인트를 재최적화한다.

논문이 보고한 데이터셋은 Notre Dame 대성당(2,635장 후보 중 597장 등록)·Trevi 분수(Rome, 466장 중 360장)·Yosemite Half Dome(1,882장 중 325장)·Great Wall(120장 중 82장)·Trafalgar Square(1,893장 중 278장) 등이었고, 평균 reprojection error는 1,611×1,128 픽셀 이미지에서 약 1.5 픽셀이었다. 통제되지 않은 인터넷 이미지로 이 규모를 재구성한 시도는 이전에 없었다.

이 파이프라인의 구현체가 Bundler였다. Snavely가 오픈소스로 풀었고, SfM 연구자들의 기본 출발점이 되었다.

---

## 3.5 COLMAP — 공학적 성숙

> 📜 **예언 vs 실제.** Snavely et al. 2006 "Discussion and future work" 섹션은 "Ultimately, we wish to scale up our reconstruction algorithm to handle millions of photographs"라고 명시하며, 더 나은 이미지 등록 순서, 렌즈 왜곡 모델링, 반복 구조 처리, 비연결 구조 재구성을 남은 과제로 꼽았다. 규모 확장은 COLMAP(Schönberger 2016)과 OpenSfM이 수만~수십만 장 규모로 이어받았고, 실시간·온라인 처리는 SfM이 아니라 SLAM 계보가 별도로 답했다—incremental refinement 대신 fixed-lag smoother와 loop closure로. Snavely가 명시한 항목 중에서는 규모 확장이 가장 명확히 채워졌다. `[부분 적중]`

2016년 Johannes Schönberger와 Jan-Michael Frahm은 CVPR 논문 "[Structure-from-Motion Revisited](https://openaccess.thecvf.com/content_cvpr_2016/papers/Schonberger_Structure-From-Motion_Revisited_CVPR_2016_paper.pdf)"를 발표했다. 제목의 "Revisited"는 겸손한 표현이었지만, 실제로는 Bundler 이후 십 년간 쌓인 개선들을 체계적으로 묶은 재설계였다.

COLMAP이 Bundler와 가장 크게 달라진 점은 세 곳이다.

첫째, 카메라 추가 순서. Bundler는 연결성이 높은 쌍부터 시작했지만, 어떤 쌍을 먼저 확장할지에 대한 체계적 기준이 없었다. COLMAP은 초기 이미지 쌍 선택과 카메라 등록 순서를 triangulation angle, feature track 길이, visibility score 기반으로 자동화했다. 재구성의 안정성이 크게 높아졌다.

둘째, bundle adjustment 주기. 매 카메라 추가 후 full bundle adjustment는 비용이 크다. COLMAP은 local bundle adjustment(최근 추가된 카메라와 공유 포인트가 많은 카메라들만 묶어서 최적화)와 주기적 global bundle adjustment를 교대하는 방식을 도입했다.

셋째, 기하적 검증. 매칭된 특징점 쌍에 대해 fundamental matrix와 homography 두 모델로 각각 RANSAC을 돌린다. Fundamental matrix는 일반적인 비평면 장면, homography는 평면 장면이나 순수 회전을 모델링한다. COLMAP은 두 모델의 inlier 수를 비교해 장면 유형을 판별하고, 어느 쪽에도 들어오지 않는 매칭을 걸러낸다. 불량 매칭과 평면-퇴화(planar degeneracy) 상황에서 Bundler보다 버텼다.

> 🔗 **차용.** COLMAP의 incremental bundle adjustment 전략은 Snavely의 Bundler 파이프라인을 모듈화하고 각 단계의 품질 관리를 추가한 것이다. 알고리즘의 핵심 수학(essential matrix 추정, triangulation, Levenberg-Marquardt)은 Hartley & Zisserman 교과서의 것이다. COLMAP의 기여는 엔지니어링 판단의 체계화에 있었다.

COLMAP이 사실상 표준이 된 것은 성능 때문만이 아니었다. 코드베이스가 정돈되어 있었고, 문서도 충분했으며, CUDA 가속으로 수천 장도 시간 내에 돌렸다. 2020년 NeRF가 나온 뒤 모든 NeRF 학습 코드가 COLMAP 출력(카메라 포즈 + sparse point cloud)을 입력으로 받았다. 3D Gaussian Splatting도 마찬가지였다. COLMAP은 SfM 도구이기 이전에 3D 재구성 연구의 입구가 되었다.

---

## 3.6 SfM과 SLAM의 분화

SfM과 SLAM은 같은 수학을 쓰면서도 근본적으로 다른 문제를 푼다. 이 구분이 뚜렷해진 것은 2000년대 초였다.

SfM은 *오프라인*이다. 모든 이미지를 수집한 뒤 처리하므로 시간 제약이 없고, 전체 데이터를 반복 참조하면서 global bundle adjustment를 여러 번 돌릴 수 있다. 카메라 포즈가 틀렸으면 되돌아가 다시 계산하면 된다.

SLAM은 *온라인*이다. 센서 데이터가 실시간으로 유입되고, 현재 시점의 로봇 위치를 그 자리에서 내놓아야 한다. 과거 데이터를 무한정 참조할 수 없으며, 지도가 자라면서 계산량이 커지고, 루프를 완주해 처음 방문한 장소로 돌아왔을 때 accumulated drift를 교정해야 한다.

두 분야가 가장 크게 갈리는 지점은 루프 클로저다. SfM에서는 global bundle adjustment가 모든 불일치를 정리한다. SLAM에서는 루프가 닫히는 순간을 탐지해서 그 시점의 drift를 국소적으로 교정해야 한다. 이를 위한 기법(visual place recognition, pose graph optimization, covisibility-based local optimization)은 SfM에 존재하지 않는 SLAM 고유의 문제였다.

불확실성 전파도 달랐다. SLAM은 현재 포즈의 불확실성을 실시간으로 추적하고 새 관측마다 갱신한다. EKF나 factor graph 형태의 probabilistic 표현이 필요하다. SfM에서는 최적화가 끝난 뒤 covariance를 사후에 계산하면 되고, 실시간 추적은 필수가 아니다.

Davison의 [MonoSLAM(2003)](https://www.doc.ic.ac.uk/~ajd/Publications/davison_iccv2003.pdf)은 스스로를 "real-time SfM"으로 불렀다. 그러나 EKF 상태벡터에 카메라 포즈와 landmark를 함께 유지하는 구조는 SfM의 global batch와 달랐다. 2000년대를 거치며 두 분야는 각자의 문제 설정을 가진 독립된 계보로 갈라졌다.

---

## 3.7 🧭 아직 열린 것

**동적 물체 포함 SfM.** COLMAP을 포함한 모든 현행 SfM 시스템은 static world를 가정한다. 장면의 모든 포인트가 움직이지 않는다는 전제로 bundle adjustment를 풀기 때문에, 자동차나 보행자가 많은 장면에서는 오염된 매칭이 최적화를 왜곡한다. RANSAC이 일부를 걸러내지만 근본적인 해결은 아니다. Dynamic SfM을 위한 세그멘테이션 통합, 물체별 독립 모션 추정 등 연구가 진행 중이나, COLMAP 수준의 범용 구현체는 2026년 기준 없다.

**SfM과 SLAM의 경계 흐려짐.** 2023년 [DUSt3R](https://arxiv.org/abs/2312.14132)(Wang et al.)는 사전 훈련된 네트워크 하나로 이미지 두 장을 받아 dense point map과 카메라 포즈를 동시에 냈다. 특징점 매칭도 RANSAC도 bundle adjustment 초기화도 거치지 않았다. [MASt3R](https://arxiv.org/abs/2406.09756)(2024)로 확장되면서 수십 장 재구성도 됐다. 전통적인 SfM 파이프라인의 각 모듈이 하나씩 대체되고 있다. COLMAP이 NeRF·3DGS의 입구였다면, DUSt3R 류는 그 입구마저 바꾸려 한다. 이 패러다임이 COLMAP을 실질적으로 밀어낼지, 특정 도메인에서만 이길지는 아직 모른다.

---

SfM이 정밀한 오프라인 재구성을 다듬는 동안, 다른 쪽에서는 전혀 다른 질문이 쌓이고 있었다. 사진이 아니라 움직이는 로봇이다. 이미지가 아직 수집되지 않았다. 포즈 추정을 지금 당장 내놓아야 한다. Randall Smith와 Peter Cheeseman이 [1986년에 던진 질문](https://people.csail.mit.edu/brooks/idocs/Smith_Cheeseman.pdf) — 불확실한 공간관계를 어떻게 전파하는가 — 이 그 압박 아래서 SLAM이라는 별개의 분야를 키웠다.

---

# Ch.4 — Smith-Cheeseman과 EKF-SLAM의 흥망

1부에서 다룬 photogrammetry·SfM·bundle adjustment는 한 가지를 전제했다. 카메라는 정지해 있거나, 촬영 후 오프라인으로 모든 이미지를 한꺼번에 처리할 여유가 있다는 것. Hartley-Zisserman의 기하학, RANSAC의 강건 추정, Levenberg-Marquardt의 반복 최적화 — 이 도구들은 세상을 측정하는 법을 알았지만, 움직이는 로봇이 *지금 이 순간* 어디에 있는지는 묻지 않았다. 2부는 그 질문에서 시작한다. 지도를 만들면서 동시에 자신의 위치를 아는 것, 불확실성이 쌓이는 와중에도 추정을 포기하지 않는 것. 확률론적 지도의 문제가 열린 것은 SRI International의 작은 메모에서였다.

Randall Smith와 Peter Cheeseman은 1986년 로봇이 공간 속에서 무언가를 측정할 때 그 측정값이 얼마나 불확실한지를 수학적으로 다루려 했다. SRI International에서 나온 그들의 아이디어는 Kalman(1960)의 필터 수학을 이어받되, 단일 상태 추정이 아닌 *공간관계의 네트워크* 전체에 불확실성을 전파하는 방향으로 확장했다. 그로부터 수년 뒤 Sydney에서 Hugh Durrant-Whyte가, MIT에서 John Leonard가 이 수학에 "로봇이 지도를 만들면서 동시에 자신의 위치를 추정한다"는 문제 정식을 결합했다. "SLAM"이라는 약어는 그 접합의 산물이다.

---

## 4.1 불확실 공간관계의 수학 — Smith, Self, Cheeseman (1988)

1986년 SRI International의 Randall Smith와 Peter Cheeseman은 로봇이 여러 장소를 거쳐 측정값을 누적할 때 오차가 어떻게 전파되는지를 수식으로 잡으려 했다. 그 작업 노트가 1988년 논문 ["Estimating Uncertain Spatial Relationships in Robotics"](https://arxiv.org/abs/1304.3111)으로 나왔다. 질문 자체는 명료했다. 로봇이 A에서 B를 측정하고 B에서 C를 측정했을 때, A에서 C까지의 불확실성은 어떻게 계산되는가?

[Kalman 필터](https://www.cs.unc.edu/~welch/kalman/kalmanPaper.html)는 이미 있었다. 레이더 추적, 탄도 계산, 위성 궤도 보정에 1960년부터 쓰였다. Smith와 Cheeseman이 한 일은 Kalman의 공분산 전파 방정식을 공간 변환의 합성(composition)에 맞게 재공식화한 것이다. 로봇 pose $\mathbf{x}_r$과 landmark 위치 $\mathbf{m}_i$를 하나의 state vector에 담고, 그 전체의 joint covariance $\mathbf{P}$를 유지한다.

$$\mathbf{x} = [\mathbf{x}_r^\top,\ \mathbf{m}_1^\top,\ \ldots,\ \mathbf{m}_N^\top]^\top$$

$$\mathbf{P} = \begin{bmatrix} \mathbf{P}_{rr} & \mathbf{P}_{rm} \\ \mathbf{P}_{mr} & \mathbf{P}_{mm} \end{bmatrix}$$

off-diagonal 블록 $\mathbf{P}_{rm}$이 핵심이었다. 로봇 위치 불확실성과 landmark 위치 불확실성이 *상관되어* 있다는 것, 그 상관관계를 추적해야 일관된 추정이 가능하다는 것. 논문은 이것을 명시적으로 증명했고, SLAM 분야 전체가 이 출발점에 섰다.

> 🔗 **차용.** Smith-Cheeseman(1988)의 공간관계 수학은 Kalman(1960)의 공분산 전파를 직접 계승한다. 단일 이동 물체를 추적하던 기법이 로봇과 지도 요소 전체를 동시에 추적하는 틀로 바뀌었다.

---

## 4.2 "SLAM"이라는 이름의 정착

Smith-Cheeseman의 1988년 논문에는 "SLAM"이라는 단어가 없다. Oxford에서 Sydney로 옮긴 Hugh Durrant-Whyte와 MIT의 John Leonard가 1990년대 초 각자의 연구실에서 같은 문제를 다른 이름으로 부르고 있었다. 두 그룹이 서로를 인용하기 시작하면서 공통 용어가 필요해졌고, "SLAM"은 그렇게 수렴해 굳었다. 정확히 어느 문서에서 처음 쓰였는지는 연구자마다 기억이 다르다. 공식 선점 논문은 없다.

Leonard와 Durrant-Whyte의 1991년 논문 ["Simultaneous Map Building and Localization for an Autonomous Mobile Robot"](https://doi.org/10.1109/IROS.1991.174711)이 이 문제를 로봇공학 메인스트림에서 제목으로 명시한 초기 사례로 자주 인용된다. "Mapping"과 "Localization"이 분리 불가능하게 얽혀 있다는 것, 그것을 동시에(simultaneously) 해야 한다는 것, 이 직관이 약어 이전에 있었다.

"Simultaneous Localization and Mapping", 줄여서 SLAM. 이후 10년간 이 이름이 분야 전체를 수렴시키는 구심이 된다.

> 🔗 **차용.** [Bar-Shalom의 다중 표적 추적](https://archive.org/details/trackingdataasso0000bars)(multi-target tracking, 1988년 단행본으로 정리됨)은 여러 물체의 state를 동시에 추정하는 프레임워크를 제공했다. Leonard와 Durrant-Whyte는 이 프레임워크에서 "표적 위치"를 "landmark 위치"로, "추적기 위치"를 "로봇 pose"로 대응시켰다고 볼 수 있다. 레이더 기술이 로봇 실내 매핑으로 번역된 사례다.

---

## 4.3 EKF-SLAM의 공식

Extended Kalman Filter(EKF)가 SLAM에 적용된 것은 자연스러운 수렴이었다. 1988년 이전부터 비선형 시스템 추정에 사용되던 EKF는 predict-update 두 단계로 작동한다.

predict 단계: 로봇이 움직이면 모션 모델 $f(\cdot)$로 state를 예측하고, Jacobian $\mathbf{F}$로 공분산을 전파.

$$\hat{\mathbf{x}}^- = f(\hat{\mathbf{x}}, \mathbf{u})$$
$$\mathbf{P}^- = \mathbf{F}\mathbf{P}\mathbf{F}^\top + \mathbf{Q}$$

update 단계: 센서 측정값 $\mathbf{z}$가 오면 관측 모델 $h(\cdot)$의 Jacobian $\mathbf{H}$로 Kalman gain $\mathbf{K}$를 계산해 state와 공분산을 갱신.

$$\mathbf{K} = \mathbf{P}^-\mathbf{H}^\top(\mathbf{H}\mathbf{P}^-\mathbf{H}^\top + \mathbf{R})^{-1}$$
$$\hat{\mathbf{x}} = \hat{\mathbf{x}}^- + \mathbf{K}(\mathbf{z} - h(\hat{\mathbf{x}}^-))$$
$$\mathbf{P} = (\mathbf{I} - \mathbf{K}\mathbf{H})\mathbf{P}^-$$

이 두 단계의 반복이 EKF-SLAM의 전부다. 구조는 단순하지만 그 단순함이 처음부터 확장성의 천장을 안고 있었다.

문제는 state 차원이다. 6DOF pose에 3D landmark $N$개를 담으면 state vector 차원은 $6 + 3N$, 공분산 행렬은 $(6+3N)^2$ 원소의 $O(N^2)$ 구조. update 한 번에 Kalman gain 계산($\mathbf{S} = \mathbf{H}\mathbf{P}^-\mathbf{H}^\top + \mathbf{R}$의 역행렬)과 공분산 갱신 모두 $O(N^2)$ 비용이다. landmark 100개면 $306 \times 306 \approx 9.4$만 원소, 1,000개면 $3006 \times 3006 \approx 900$만 원소. 2000년대 초 일반 PC로 실시간을 유지할 수 있는 landmark 수는 수십에서 백 단위가 한계였다.

[Andrew Davison의 MonoSLAM(2003)](https://www.doc.ic.ac.uk/~ajd/Publications/davison_iccv2003.pdf)이 실시간 시연에서 landmark 수십 개 수준에 갇힌 것은 우연이 아니었다. EKF-SLAM의 $O(N^2)$ 벽이 그 숫자를 결정했다.

---

## 4.4 확장성의 벽

2003년 ICCV에서 Davison이 웹캠 하나로 실시간 3D 추적을 시연했을 때, 수십 개 수준의 feature로 책상 하나 크기의 공간을 매핑했다. 당시 상업용 SLAM 시스템이 없던 환경에서 실시간 단안 추적은 드문 시연이었다. 그 한계는 공분산 행렬의 크기에서 왔다.

100 landmarks에서 covariance 행렬은 $306 \times 306$ (6DOF pose + 3D landmark 100개 기준, state 차원 $6 + 3 \times 100 = 306$). 1,000개면 $3006 \times 3006$. 매 시간 단계마다 이것을 역행렬 연산과 함께 갱신해야 한다. 더불어 EKF는 joint 분포 전체를 한 덩어리로 유지하기 때문에, 새 landmark가 추가되면 기존 모든 landmark와의 cross-correlation이 즉시 생성된다. 지도가 커질수록 update 비용이 기하급수적으로 증가한다.

2000년대 중반까지 시도된 해법은 submap이었다. 전체 지도를 겹쳐지는 소영역으로 나누고, 각 submap에서만 EKF를 돌린 뒤 submap 사이를 별도 연결 구조로 잇는다. [Chong과 Kleeman(1999)](http://www.cs.cmu.edu/afs/cs/Web/People/motionplanning/papers/sbp_papers/integrated1/chong_feature_map.pdf)이 초기 형태를 제안했다. 그러나 submap 경계에서의 정보 손실과 루프 클로저의 어려움, 그리고 구현 복잡도가 submap 접근을 실용화하는 데 마찰을 일으켰다.

> 🔗 **차용.** Chong-Kleeman(1999)의 submap 분할 아이디어는 이후 현대 SLAM의 local window 최적화로 계승된다. ORB-SLAM의 local map, VINS-Mono의 sliding window가 conceptually 같은 원리 위에 있다. 단지 구현 도구가 EKF에서 bundle adjustment로 바뀌었을 뿐이다.

---

## 4.5 Consistency 문제: Julier-Uhlmann의 반례

EKF-SLAM의 더 깊은 결함은 2001년 ICRA에서 터졌다. Simon Julier와 Jeffrey Uhlmann이 EKF 기반 SLAM의 거동을 수치 실험으로 분석하며 필터가 자기 자신을 너무 믿는다는 것을 보였다. 그들이 IEEE ICRA에 낸 논문 제목은 ["A Counter Example to the Theory of Simultaneous Localization and Map Building"](https://doi.org/10.1109/ROBOT.2001.933257)이었다. 도발적이었고, 내용도 그랬다.

2차 문헌들이 이 논문을 인용하며 요약하는 핵심은, EKF-SLAM이 asymptotically *overconfident*하다는 것이다. 즉, 실제 추정 오류는 커지는데 필터가 계산하는 공분산(불확실성)은 실제보다 작게 수렴한다. 이것이 inconsistency다.

원인은 linearization error에 있다. EKF는 비선형 모션 모델과 관측 모델을 일차 Taylor 전개로 근사한다. 이 근사 오류가 매 단계 누적되면 공분산이 실제 오류를 과소 평가하기 시작한다. 로봇이 "나는 여기 있다"고 과도하게 확신하면, 이후 measurements를 필터가 덜 신뢰하게 되어 오류가 교정되지 않고 쌓인다.

2007년 [Shoudong Huang과 Gamini Dissanayake](https://doi.org/10.1109/TRO.2007.903811)는 이 inconsistency의 원인을 더 정밀하게 해부했다. 논문의 핵심 진단은 두 가지였다. 현재 상태 추정치에서 평가된 Jacobian들 사이의 기본 제약(constraint)이 무너지는 것이 EKF-SLAM 비일관성의 주된 원인이고, 그 결과 로봇 방향각(yaw)의 분산이 실제로는 유지되어야 하는데도 잘못 0으로 수렴할 수 있다는 것이었다. 선형화 시점에 따라 시스템의 관측 가능한 자유도가 달라지고, 관측 불가능한 방향에 필터가 임의의 정보를 주입하게 된다는 이후 observability 기반 계열의 해석은 이 논문에서 출발한다.

> 📜 **예언 vs 실제.** Julier와 Uhlmann의 2001년 반례 이후, consistent estimation을 달성하려는 필터 설계 시도가 이어졌다. Unscented Kalman Filter(UKF), Invariant EKF, robust covariance 등 필터 계열의 변형들이 10년 가까이 제안됐다. 그러나 2026년 시점에서 되돌아보면 이 문제의 실용적 해법은 *필터가 아닌 최적화*였다. [iSAM](https://www.cs.cmu.edu/~kaess/pub/Kaess08tro.pdf)(Kaess et al., 2008), [g2o](http://ais.informatik.uni-freiburg.de/publications/papers/kuemmerle11icra.pdf)(Kümmerle et al., 2011), GTSAM이 filter를 사실상 대체했다. Jacobian linearization을 current estimate에 고정하지 않고 반복 최적화로 갱신하는 방식은 inconsistency를 구조적으로 회피한다. 반례가 요구한 "새 필터"의 자리를 결국 필터가 아닌 구조가 채웠다. `[무산]`

---

## 4.6 FastSLAM — 분할통치

EKF-SLAM의 $O(N^2)$ 벽을 다른 방식으로 공격한 것이 [FastSLAM](https://cdn.aaai.org/AAAI/2002/AAAI02-089.pdf)이다. Michael Montemerlo, Sebastian Thrun(Stanford), Daphne Koller, Ben Wegbreit가 2002년 AAAI에서 발표했다.

핵심 관찰은 Rao-Blackwellization이다. 로봇 경로 $x_{0:t}$가 주어지면 각 landmark의 위치 추정이 *서로 독립*이 된다. 따라서 경로를 particle filter로 표현하고(각 particle이 하나의 가능한 경로를 대표), 각 particle마다 별도의 landmark EKF를 독립적으로 운용할 수 있다.

particle $K$개, landmark $N$개면 per-step 복잡도는 $O(K \log N)$으로, EKF-SLAM의 $O(N^2)$와 달리 $N$에 대해 준선형(sublinear)으로 증가한다(KD-tree 기반 landmark 탐색 사용 시). landmark 수가 많아져도 per-particle EKF는 서로 독립이라 $N \times N$ 전체 공분산을 유지할 필요가 없다. $K$는 수십~수백 수준으로 고정되므로 실질적 이득이 컸다.

FastSLAM은 작동했다. 실내 환경에서 수백 개 landmark까지 실시간을 유지했고, 기술 이전도 빨랐다. 그러나 문제들이 쌓였다. particle depletion: 지도가 커지면 대부분의 particle이 불량 경로를 대표하게 되고, effective sample 수가 급감한다. 루프 클로저 상황에서 경로 가중치 재조정이 어렵다. 무엇보다 particle 수를 늘려도 large-scale 환경에서 드리프트가 축적되는 문제는 해결되지 않았다.

[FastSLAM 2.0](https://www.ijcai.org/Proceedings/03/Papers/165.pdf)(Montemerlo et al. 2003)이 proposal distribution을 개선했지만, 방법론이 filter 패러다임 안에 갇혀 있는 한 확장성의 천장이 있었다. 그 천장을 결국 피해 간 방법은 filter 계열이 아니었다.

---

## 4.7 EKF의 퇴장

그래프 기반 접근이 2005년 이후 빠르게 현실화되면서 EKF-SLAM은 주력에서 물러났다. [Feng Lu와 Evangelos Milios의 1997년 그래프 아이디어](https://doi.org/10.1023/A:1008854305733)가 [Olson-Leonard-Teller(2006)](https://april.eecs.umich.edu/pdfs/olson2006icra.pdf)의 efficient solver와, 이후 g2o·GTSAM·iSAM2의 실시간 인수분해 기법과 결합하자, EKF의 장점이었던 "incremental update"는 더 이상 차별점이 아니었다.

루프 클로저에서 차이가 드러났다. 로봇이 출발점으로 돌아왔을 때 지도 오류를 수정하는 것. EKF는 이 순간 전체 공분산 행렬을 업데이트해야 한다. 비용이 $O(N^2)$. 그래프 최적화는 pose 그래프에 새 엣지 하나를 추가하고 sparse 행렬을 재분해한다. Sparse 구조를 쓰면 비용이 훨씬 낮다.

2010년경을 기점으로 새로운 SLAM 시스템에서 backend로 EKF를 선택하는 경우는 드물어졌다. 특수 제약(매우 제한된 연산 자원, real-time filter 요구)이 있는 경우에만 잔존했다.

> 📜 **예언 vs 실제.** Durrant-Whyte와 Bailey의 [2006년 IEEE Robotics & Automation Magazine 튜토리얼](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/Durrant-Whyte_Bailey_SLAM-tutorial-I.pdf)은 SLAM의 확장성 문제를 논하며 submap 분해와 information filter가 대규모 환경에서의 해법이 될 것으로 전망했다. Information filter(EKF의 역공분산 형태)는 sparse information matrix를 이용해 landmark가 늘어도 연산이 느려지지 않을 것으로 기대됐다. 실제 전개는 달랐다. Information filter 계열(SEIF 등)은 sparsity를 강제로 유지하는 과정에서 marginalization error가 생겼다. Submap은 일부 시스템에 흡수되었으나 주류 해법이 되지 못했다. 2010년대를 지배한 것은 factor graph + iterative 최적화였다. `[기술변화]`

---

## 4.8 🧭 아직 열린 것

Filter vs Optimization의 공존. EKF가 backend 주력에서 물러났다고 해서 사라진 것은 아니다. 2026년 기준으로 자율주행 일부 구현은 여전히 필터 기반을 선호한다. 최적화 기반 SLAM은 반복 수렴이 필요하고, 실시간 보장이 어려운 경우가 있다. 저비용 임베디드 시스템에서 sparse EKF나 UKF가 재등장하는 사례가 있다. "필터는 죽었다"는 선언은 정확하지 않다. 용도와 제약에 따라 공존한다.

비가우시안 불확실성. EKF의 가장 근본적인 가정은 불확실성이 가우시안 분포를 따른다는 것이다. 현실의 센서 오류는 다중 모드(멀티모달)이거나 heavy-tail 분포를 갖는 경우가 많다. 특히 대칭성이 없는 perceptual aliasing(서로 다른 장소가 같아 보이는 것) 상황에서 단일 가우시안은 실제 불확실성을 심각하게 단순화한다. Particle filter는 이론상 비가우시안을 표현하지만 고차원 state에서는 비실용적이다. Stein particle, normalizing flow, 학습 기반 uncertainty estimation이 시도되고 있으나, 2026년 기준으로 이것이 실시간 SLAM에서 검증된 형태는 제한적이다.

---

EKF-SLAM이 100 landmark 수준의 실시간 천장에 부딪히는 동안, Imperial College의 Andrew Davison은 그 한정된 숫자로 다른 무언가를 증명하고 있었다. 카메라 한 대만 들고 별도 센서 없이 실시간으로. 숫자의 한계는 그대로였지만 그것을 다루는 방식이 달라졌다.

---

# Ch.5 — MonoSLAM → PTAM: 실시간의 몽상과 분리 혁명

앞 챕터는 EKF-SLAM이 어떻게 확률론적으로 일관된 지도 구축 방법을 완성했는지, 그리고 그 공분산 행렬이 landmark 수 $N$에 대해 $O(N^2)$로 커지는 구조적 벽에 부딪혔는지를 보였다. 설계가 그렇게 생긴 결과였다. Davison과 Klein은 여기서부터 각자 다른 방향으로 걸었다.

2003년, Davison은 Imperial College 실험실에서 웹캠 한 대를 노트북에 꽂았다. 1988년 Smith와 Cheeseman이 세운 확률 공간관계 수학, 그 위에 Leonard와 Durrant-Whyte가 얹은 EKF-SLAM 틀을 그대로 가져왔지만 센서는 카메라 하나뿐이었다. IMU도 스테레오도 레이저도 없는 상태에서 Shi-Tomasi 1994 코너 검출기와 Kalman 예측-갱신 루프만 붙여 실시간으로 돌렸다. 당시 기준으로 무모한 조합이었다.

4년 뒤 2007년, Oxford의 Klein과 Murray가 같은 해에 다른 답을 냈다. tracking과 mapping을 두 스레드로 쪼갠 것이다. 그 분리가 이후 10년 Visual SLAM의 뼈대가 되었다.

---

## 1. 2003년의 데모

2003년 ICCV에서 Davison이 공개한 [Real-Time Simultaneous Localisation and Mapping with a Single Camera](https://doi.org/10.1109/ICCV.2003.1238654)는 장내를 술렁이게 했다. *그것이 가능하다는 것* 자체가 충격이었다.

당시 SLAM 분야의 주류는 레이저 센서였다. LiDAR는 2D 거리를 직접 제공했고, 스테레오 카메라는 픽셀 수준에서 깊이를 복원했다. 단안 카메라는 깊이 정보 자체가 없었다. 단안으로 3D 구조를 추정하려면 최소 두 프레임이 필요했고, 초기 깊이 추정의 불확실성이 EKF 상태 벡터 전체로 전파되었다. 이론적으로 가능했지만 실시간으로 돌린다는 것은 별개의 문제였다.

Davison이 단안을 고른 것은 실용적 제약 때문이었다. IMU는 추가 하드웨어였고, 스테레오는 캘리브레이션 부담이 있었다. 그가 원한 것은 "카메라 하나로 증명하는 것"이었다. 증명에 성공하면 나머지는 얹을 수 있었다. 그 논리는 맞았다. 틀린 것은 EKF가 그 "나머지"를 실제로 수용할 수 있는 구조인지였다.

---

## 2. EKF의 아름다움과 벽

2007년 IEEE PAMI에 실린 [MonoSLAM](https://doi.org/10.1109/TPAMI.2007.1049)는 Davison, Ian Reid, Nicholas Molton, Olivier Stasse의 공동 저자로, ICCV 2003 데모의 완성된 논문 형태였다.

MonoSLAM의 상태 벡터는 [Smith-Cheeseman(1988)](https://arxiv.org/abs/1304.3111)과 [Leonard-Durrant-Whyte(1991)](https://ieeexplore.ieee.org/document/174711/)의 정식(Ch.4)을 단안 카메라에 직접 이식했다. 카메라 상태 $\mathbf{x}_v \in \mathbb{R}^{13}$ — 위치 3, 사원수 방향 4, 속도 3, 각속도 3 — 과 landmark 집합 $\mathbf{y}_i \in \mathbb{R}^3$를 하나의 벡터 $\mathbf{x} = (\mathbf{x}_v^\top, \mathbf{y}_1^\top, \ldots, \mathbf{y}_N^\top)^\top \in \mathbb{R}^{13+3N}$에 담고, 그 전체 공분산 $(13+3N)\times(13+3N)$ 행렬 $\mathbf{P}$를 매 프레임 predict-update 루프로 유지했다. predict 단계에서는 카메라 운동 모델 $f$의 자코비안 $\mathbf{F}$로 공분산을 전파했고($\mathbf{P}^- = \mathbf{F}\mathbf{P}\mathbf{F}^\top + \mathbf{Q}$), update 단계에서는 투영 함수의 자코비안 $\mathbf{H}_i$로 칼만 이득을 계산해 상태와 공분산을 갱신했다. EKF predict-update 수식 자체는 Ch.4 §4.3의 것과 동일하다. 달라진 것은 상태 벡터 안에 카메라 속도·각속도가 함께 들어간 점이었다(이동 물체인 카메라의 동역학 모델이 필요했기 때문이다).

공분산 갱신 $(\mathbf{I} - \mathbf{K}_i\mathbf{H}_i)\mathbf{P}^-$의 지배 비용은 $(13+3N)^2$ 행렬 곱셈으로, landmark 수 $N$에 대해 $O(N^2)$였다. 논문 §III은 30 Hz 실시간 처리에서 유지 가능한 feature 수의 상한이 "약 100개" 수준이라고 명시한다.

> 🔗 **차용.** MonoSLAM의 EKF 상태 벡터 구조는 Smith-Cheeseman-Durrant-Whyte(1988-1991)의 확률적 공간관계 표현을 단안 카메라에 직접 이식한 것이다. Kalman 필터 자체는 1960년부터 있었지만, 로봇 pose와 landmark를 같은 벡터에 넣는 "augmented state vector" 관행이 확립된 것은 Leonard-Durrant-Whyte 1991의 스타일이었다.

이 숫자는 시스템 한계를 드러냈다. Davison은 이를 알고 있었다. Davison은 논문에서 sub-mapping 전략으로의 확장을 향후 방향으로 제시했다. 그러나 EKF 내부에서 계층적 구조를 만드는 것은 근본적으로 어려웠다. 공분산 행렬이 모든 landmark 간 상관관계를 빠짐없이 담고 있었기 때문이다.

[Shi-Tomasi(1994)](https://doi.org/10.1109/CVPR.1994.323794) 코너가 MonoSLAM의 시각 특징으로 선택된 것도 이 맥락에서 읽힌다. "Good Features to Track"의 선택 기준은 추적하기 좋은 점을 고르는 것이었다. 애초에 추적이 실패할 가능성이 낮은 코너만 상태 벡터에 넣으면 EKF의 갱신이 더 안정적이었다. PAMI 논문은 광각 렌즈에서 매 프레임 약 12개의 특징이 안정적으로 보이도록 map management를 구성한다고 명시한다. 이 한정된 수의 특징이 모두 잘 추적되는 한, EKF는 돌아갔다.

> 🔗 **차용.** PTAM이 아니라 MonoSLAM에서 이미 Shi-Tomasi 1994의 코너 검출기가 쓰였다. "좋은 특징을 선택해서 추적한다"는 설계 철학은 Shi-Tomasi → MonoSLAM → PTAM의 직접적인 계보다.

---

## 3. 2007년, 같은 해

그 한정된 숫자가 EKF의 천장을 드러냈다. Oxford에서 그 천장을 올려다보던 사람이 Klein이었다.

2007년 ISMAR에 Klein과 Murray가 [Parallel Tracking and Mapping for Small AR Workspaces](https://doi.org/10.1109/ISMAR.2007.4538852)를 올렸다. 같은 해 PAMI에는 Davison의 MonoSLAM 정식판이 실렸다. 두 논문이 한 해에 나온 건 우연이 아니었다.

Klein은 당시 Murray 그룹 박사과정이었다. Murray 그룹은 Oxford Active Vision Laboratory의 직계였고, 몇 년 전까지 Davison이 박사과정 학생으로 있던 바로 그 방이었다. Murray는 Davison의 지도교수였다. Klein이 MonoSLAM을 보지 않았을 수 없다. 그가 본 건 EKF가 아니라, 단안 카메라가 실시간으로 돈다는 사실 그 자체였다.

가능성은 확인됐다. 남은 건 "어떻게 확장할 것인가"였다. Klein은 EKF를 버리기로 했다.

---

## 4. 분리

PTAM의 핵심 아이디어는 하나였다. Tracking(카메라 pose 추적)과 Mapping(3D 지도 구축)을 분리해서 두 개의 병렬 스레드로 실행한다.

EKF에서 이 둘은 같은 루프 안에 섞여 있었다. 매 프레임마다 예측-갱신 한 사이클을 돌리면서, 카메라가 움직이면 상태를 예측하고 이미지에서 landmark를 찾으면 다시 갱신했다.

PTAM은 이것을 풀었다. Tracking 스레드는 매 프레임 카메라 pose를 추정하는 일만 한다. 현재 keyframe 집합에서 보이는 3D 점들의 2D 투영과 실제 관측을 매칭해서 pose를 실시간으로 계산한다. Mapping 스레드는 새 keyframe이 추가될 때마다 bundle adjustment를 실행한다. Tracking 스레드가 독립적으로 돌아가기 때문에 Mapping이 느려져도 무방했다.

Mapping 스레드의 bundle adjustment는 keyframe 집합 $\mathcal{K}$와 3D 점 집합 $\mathcal{P}$에 대해 재투영 오차의 합을 최소화했다:
$$\min_{\{\mathbf{T}_k\}, \{\mathbf{p}_j\}} \sum_{k \in \mathcal{K}} \sum_{j \in \mathcal{P}_k} \rho\!\left(\left\|\mathbf{z}_{kj} - \pi(\mathbf{T}_k,\, \mathbf{p}_j)\right\|^2_{\mathbf{\Sigma}_{kj}}\right)$$
여기서 $\mathbf{T}_k \in SE(3)$는 keyframe $k$의 pose, $\mathbf{p}_j \in \mathbb{R}^3$는 3D 점, $\pi$는 카메라 투영 함수, $\mathbf{z}_{kj}$는 keyframe $k$에서 점 $j$의 관측 픽셀 좌표, $\mathbf{\Sigma}_{kj}$는 측정 공분산, $\rho$는 Huber 함수 등의 robust kernel이다. Mapping 스레드는 이 최적화를 Levenberg–Marquardt로 반복해서 풀었다. 비동기로 돌기 때문에 Tracking 스레드의 실시간성에 영향을 주지 않았다.

> 🔗 **차용.** PTAM의 Mapping 스레드에서 실행되는 bundle adjustment는 [Triggs et al. 1999 "Bundle Adjustment — A Modern Synthesis"](https://doi.org/10.1007/3-540-44480-7_21)의 직접 적용이다. 1부에서 다룬 사진측량의 100년 전통이 SLAM backend에 처음으로 제대로 자리를 잡은 지점이 여기다. EKF에서는 공분산 행렬의 크기 제약 때문에 전체 BA가 불가능했다. 스레드 분리로 그 제약이 사라졌다.

이 분리는 단순해 보이지만 결과는 달랐다. Mapping 스레드가 비동기로 bundle adjustment를 실행하기 때문에, 지도에 들어갈 수 있는 landmark 수가 EKF의 $O(N^2)$ 제약을 벗어났다. PTAM이 사용한 keyframe의 수는 수백 개였다. 각 keyframe에는 수백 개의 patch feature가 있었다. MonoSLAM의 수십 landmark 규모와는 다른 세계였다.

초기 맵 구축 방법도 달랐다. PTAM은 사용자가 카메라를 천천히 움직이는 초기화 단계에서 [Nistér 2004](https://doi.org/10.1109/TPAMI.2004.17)의 5-point 알고리즘 계열(PTAM 논문은 그 후속인 Stewénius·Engels·Nistér 2006을 인용)로 essential matrix를 추정하고, 첫 keyframe 쌍에서 초기 3D 구조를 복원했다. 이것 역시 차용이었다.

Essential matrix $\mathbf{E}$는 두 카메라 좌표계 사이의 순수 기하관계를 담는 $3\times 3$ 행렬로, 대응점 쌍 $(\mathbf{p}, \mathbf{p}')$에 대해 ${\mathbf{p}'}^\top \mathbf{E}\, \mathbf{p} = 0$을 만족한다. $\mathbf{E}$는 내부적으로 $\mathbf{E} = \mathbf{t}_\times \mathbf{R}$ ($\mathbf{t}_\times$는 병진의 반대칭 행렬, $\mathbf{R}$은 회전)으로 분해되므로 자유도가 5이다. 따라서 최소 5쌍의 대응점으로 유일해(최대 10개 실수 해)를 구할 수 있다. Nistér의 기여는 이 5-point 연립방정식을 Gröbner basis를 이용해 효율적으로 풀어 RANSAC 루프 안에서 실시간으로 돌릴 수 있게 한 것이다. PTAM은 이 solver를 초기화 단계에서 RANSAC과 함께 사용해 첫 두 keyframe 사이의 상대 pose를 추정하고 초기 3D 점군을 삼각측량으로 복원했다.

> 🔗 **차용.** PTAM의 5-point essential matrix 초기화는 David Nistér 2004 "An Efficient Solution to the Five-Point Relative Pose Problem"이 열어 놓은 minimal-solver 계보를 따른다(PTAM 논문은 그 후속 Stewénius·Engels·Nistér 2006 ISPRS를 직접 인용). 5-point solver는 단안 카메라의 초기 맵 구축에 필요한 최소 대응쌍을 사용하는 minimal solver였고, PTAM은 이 솔버를 RANSAC 루프에 태워 초기 두 keyframe의 상대 pose를 실시간에 가깝게 추정했다.

> 🔗 **차용.** PTAM의 keyframe 구조는 Leonard-Durrant-Whyte의 submap 아이디어에서 맥이 닿는다. "전체 맵을 한 번에 최적화하기 어려우면 지역 단위로 나눈다"는 발상이 PTAM에서는 keyframe 집합으로 표현되었다. 후속 ORB-SLAM의 covisibility graph는 이 keyframe 관리를 더 정교하게 만든 버전이다.

---

## 5. 새 아키텍처의 확산

PTAM은 AR(증강현실) 워크스페이스를 대상으로 설계되었다. 논문 제목에도 "Small AR Workspaces"가 명시되어 있다. Tracking 스레드의 재현성이 좋았고, 실시간성이 확실했기 때문에 AR 응용에 바로 쓸 수 있었다.

상업적 흡수는 빠르게 일어났다. 2010년대 초 Metaio(독일 AR 스타트업, 2015년 Apple에 인수)와 Qualcomm의 Vuforia SDK는 PTAM과 유사한 tracking/mapping 분리 구조를 채용했다. 소비자 스마트폰에서 처음으로 안정적인 planar AR이 돌아갔다.

학계에서의 영향은 더 직접적이었다. 2015년 Raul Mur-Artal, J.M.M. Montiel, Juan D. Tardós가 발표한 [ORB-SLAM](https://arxiv.org/abs/1502.00956)은 PTAM의 구조를 계승했다. 특징점은 patch에서 ORB 디스크립터로 바꾸고, keyframe 관리는 covisibility graph로 정교화했으며, loop closure를 새로 얹었다. PTAM이 없었으면 ORB-SLAM의 설계도가 달랐을 것이다.

2018년 Qin, Li, Shen의 [VINS-Mono](https://arxiv.org/abs/1708.03852) 역시 sliding window 최적화 + loop closure의 이중 스레드 구조를 갖는다. tracking/mapping 분리의 계보가 VIO로 확장된 사례다.

---

## 6. Davison vs Klein & Murray — 관점 비교

2007년에 두 논문이 나왔다. MonoSLAM PAMI는 2003년 데모의 완성판이었다. PTAM은 같은 해에 MonoSLAM의 한계를 돌파하는 새 구조로 나왔다.

MonoSLAM이 EKF를 붙들고 있었던 이유는 확률론적 일관성(consistency)에 있었다. EKF는 상태의 불확실성을 공분산 행렬로 명시적으로 관리했다. 지도의 각 landmark가 얼마나 불확실한지, landmark 간 공분산이 어떻게 연결되는지를 수학이 추적했다. 이 관점에서 bundle adjustment는 최소자승 최적화였고, 불확실성 표현을 줄이는 대신 확장성을 얻는 거래로 읽혔다.

Klein & Murray는 그 대가를 기꺼이 치렀다. AR 응용에서 중요한 것은 카메라 pose의 실시간 추적이었다. 지도의 불확실성을 센티미터 단위로 추적할 필요는 없었다. Bundle adjustment로 지도를 주기적으로 refine하면 충분했다.

이후 SLAM 분야의 방향은 이 거래 쪽으로 기울었다. 2010년대 이후 graph-based 최적화와 bundle adjustment가 주류가 되었고, EKF-SLAM은 계산 자원이 극도로 제한된 응용 외에서는 대부분 전면에서 물러났다. 다만 MonoSLAM이 붙들었던 확률론적 관심사가 사라진 것은 아니었다. Davison의 lab은 PTAM 계보로 건너뛰는 대신 이후 몇 단계에 걸쳐 factor graph 기반 추정, 그리고 Gaussian Belief Propagation(GBP)·Robot Web 쪽으로 옮겨갔다. 23년 뒤 SLAM Handbook Ch.18에서 Davison은 같은 흐름을 EKF→BA→factor graph→GBP로 이어지는 representation 변경의 연속으로 해석한다. 본인이 MonoSLAM을 직접 호명해 평가하는 대목은 없고, 각 표현 교체가 시스템 전체의 재설계를 유발한다는 일반 원리로 치환해 서술한다.

---

## 📜 예언 vs 실제

> **Davison 2007 PAMI MonoSLAM**: Davison은 Conclusion에서 더 큰 실내·실외 환경, 더 빠른 움직임, 가림·조명 변화가 있는 복잡한 장면을 다음 과제로 꼽았다. 구체 수단으로 sub-map 전략과 100 Hz 이상의 고프레임률 CMOS 카메라를 거론했고, sparse map을 "higher-order entities"(표면 등)의 dense 표현으로 확장할 여지도 함께 언급했다.
>
> 이 예측들의 운명은 각기 달랐다. Sub-map 아이디어는 PTAM의 keyframe 구조와 ORB-SLAM의 covisibility graph를 거쳐 부분적으로 흡수되었다. 그러나 EKF를 유지하면서 계층적 확장을 달성한 시스템은 나오지 않았다 — 계층화는 BA 기반 아키텍처 전환과 함께 왔다. 고프레임률 카메라는 2010년대 이벤트 카메라 연구에서 다른 경로로 구체화되었다. 동적 장면 강건성은 2026년 기준 여전히 열려 있다. DynaSLAM, FlowSLAM 등 여러 시도가 있었지만 "기본 파이프라인에 포함된 해법"은 아직 없다. IMU 통합은 Davison이 Future Work에서 직접 지목하진 않았지만(관련 연구는 논문 본문에서 참조) 2010년대 Visual-Inertial Odometry(VIO) 연구 붐이 맡은 방향이다. 확률론적 일관성이라는 관심사 자체는 폐기되지 않고 factor graph·GBP 쪽으로 옮겨갔다 — Davison 본인은 23년 뒤 Handbook Ch.18에서 이 이동을 "representation 변경의 연속"으로 묘사하며 MonoSLAM을 계보의 한 단계로 재배치한다. `[진행형]`

> **Klein & Murray 2007 PTAM**: Klein과 Murray는 §8(Failure modes / Mapping inadequacies)에서 시스템의 한계로 corner-기반 추적의 모션 블러 취약성, point cloud 중심의 지도가 가진 기하 이해 부족, 그리고 "not designed to close large loops in the SLAM sense"를 열거했다. 즉, 큰 루프의 전역 일관성 확보가 PTAM의 설계 범위 밖임을 분명히 했다.
>
> 2015년 ORB-SLAM은 이 한계들을 정면으로 겨냥했다. [DBoW2](http://doriangalvez.com/papers/GalvezTRO12.pdf) 기반 appearance loop closure와 covisibility graph 기반 keyframe 관리가 얹혔고, 특징은 patch 대신 ORB descriptor로 교체됐다. PTAM이 "우리 문제가 아니다"라고 선을 그은 곳에서 ORB-SLAM이 지도 확장을 이어 받은 구도다. Klein & Murray 자신이 명시적으로 "appearance-based loop closure가 답"이라고 적은 것은 아니지만, 한계 지점의 지적이 후속 계보의 출발점으로 정확히 맞았다. `[한계 지점 적중]`

---

## 🧭 아직 열린 것

**Monocular scale 복원.** MonoSLAM부터 PTAM까지, 단안 카메라 시스템은 모두 scale ambiguity를 안고 있다. 이미지 한 장에서 절대 거리를 알 수 없다는 것은 기하학적 사실이다. IMU를 추가하면 중력 방향과 가속도계 판독값으로 scale이 observability를 갖는다. 그러나 IMU 없는 순수 monocular 시스템에서 scale 복원은 2026년에도 근본적으로 해결되지 않았다. 학습 기반 monocular depth estimation([MiDaS](https://arxiv.org/abs/1907.01341), [Depth Anything](https://arxiv.org/abs/2401.10891))이 단일 이미지에서 상대적 깊이를 추정하지만, 이것을 metric scale로 변환하려면 여전히 외부 참조(지면 가정, 사전 알려진 물체 크기 등)가 필요하다.

**단일 VO 시스템의 환경 범용성.** MonoSLAM은 실내 데스크탑 환경만 다루었다. PTAM은 "Small AR Workspaces"라고 스스로 범위를 제한했다. 이후 ORB-SLAM2가 실내·실외·RGB-D를 아우르려 했지만, 조명 변화가 극단적인 환경이나 low-texture 공간에서는 여전히 tracking failure가 발생한다. 단일 파이프라인이 실내 복도, 야외 도심, 야간 환경, 텍스처 없는 흰 벽 전부를 견고하게 처리하는 시스템은 2026년 기준 아직 없다. Multi-modal fusion(카메라 + LiDAR + IMU)이 일부 커버하지만, 카메라 단독 시스템의 범용성은 여전히 미결이다.

**저조도·동적 환경에서의 특징점 추적.** MonoSLAM이 요구했던 것은 충분한 조명과 정적인 장면이었다. 2007년의 PTAM도 마찬가지였다. 2026년 현재 이 두 가정은 여전히 대부분의 feature-based SLAM 시스템에서 암묵적으로 유지된다. 저조도에서 ORB feature는 검출 자체가 실패하고, 움직이는 사람이 많은 장면에서는 dynamic point가 static point로 잘못 분류된다. 이 문제를 학습 기반 optical flow나 semantic segmentation으로 우회하는 시도가 있지만, 실시간 범용 해법으로 자리잡은 시스템은 아직 없다.

---

PTAM이 확립한 tracking/mapping 분리는 한 가지를 해결하지 못했다. keyframe이 쌓일수록 누적 오차가 loop에서 폭발했다. 그 답은 PTAM과 같은 해에 나온 것이 아니었다. 1997년, CMU 지하 복도에서 Feng Lu와 Evangelos Milios가 레이저 스캔 문제를 붙들고 있던 바로 그 시점에 이미 형태를 갖추고 있었다.

---

# Ch.6 — Graph SLAM 혁명

1997년 카네기 멜런의 한 지하 복도. Feng Lu와 Evangelos Milios는 레이저 스캔 여러 장을 서로 일관성 있게 맞추는 문제를 붙들고 있었다. EKF는 표준 선택지였지만, 두 사람은 다른 길을 택했다. 포즈들 사이의 상대 측정값을 직접 그래프로 모델링하고, 그 그래프 위에서 최소자승 최적화를 돌리는 것이었다. 결과는 Kalman 계열이 도달하지 못한 전역 일관성이었다. 다만 Lu-Milios가 이 방향의 유일한 시조는 아니다. 그보다 10여 년 앞서 LAAS의 [Chatila와 Laumond(1985)](https://www.semanticscholar.org/paper/Position-referencing-and-consistent-world-modeling-Chatila-Laumond/c34a678e40a7d80cb3683f07fc837179fd9bf3ee)가 이동 로봇의 참조 좌표계와 일관된 월드 모델을 이미 smoothing의 언어로 논의했고, 1999년 [Gutmann과 Konolige](https://www.semanticscholar.org/paper/Incremental-mapping-of-large-cyclic-environments-Gutmann-Konolige/3c1bda51b8ca59f1836ed1b96c485d905804989a)가 대형 순환 환경의 증분 지도 작성에 포즈 그래프 정합을 적용했으며, 2000년대 초 Thrun 그룹이 *full SLAM* 문제로 이 접근을 정식화해 상용화 궤도에 올렸다. [Folkesson과 Christensen(2004)](http://www.hichristensen.net/hic-papers/folkesson-icra2004.pdf), Konolige, 그리고 Dellaert가 뒤이어 각자의 정식화를 내놓았다. Lu-Milios 1997이 오늘날 가장 많이 인용되는 이유는 "레이저 스캔 정합 + 배치 최소자승"이라는 구체적 파이프라인을 완결된 형태로 제시했기 때문이지, 그 방향을 홀로 열었기 때문은 아니다. Smith-Cheeseman이 확률 지도의 수학적 토대를 놓았고 Davison이 실시간 단안 SLAM의 가능성을 증명했다면, 이 병렬 기여자들은 SLAM을 그래프 추론 문제로 재정의하는 여러 수를 거의 동시다발적으로 두고 있었다. Klein과 Murray의 PTAM(2007)은 tracking과 mapping을 분리하여 실시간 성능을 얻었지만, 수백 개 포즈가 누적될수록 EKF 백엔드의 $O(N^2)$ 갱신 비용이 병목이 되었다. 그 문제의 해답은 이미 10년 전부터 CMU 지하 복도와 LAAS, 스탠퍼드, KTH의 연구실들에서 각자의 형태로 준비되고 있었다.

---

## 6.1 레이저 스캔에서 포즈 그래프로: Lu-Milios 1997

[Lu & Milios 1997. "Globally Consistent Range Scan Alignment"](https://doi.org/10.1023/A:1008854305733)이 등장하기 전까지, 연속 레이저 스캔의 정합(alignment)은 ICP(Iterative Closest Point) 계열의 국소 정합으로 이어 붙이는 경우가 많았다. ICP는 두 스캔을 국소적으로 잘 맞추지만, 드리프트가 누적되면 수십 미터 이후 지도가 뒤틀린다. 루프를 다시 돌아왔을 때 출발점과 지도가 맞지 않는다.

Lu와 Milios의 아이디어는 단순했다. 로봇의 포즈 시퀀스 $x_1, x_2, \ldots, x_n$을 노드로, 각 포즈 쌍 사이의 상대 측정값을 엣지로 표현하면, 지도 구성 문제는 그래프 위의 에너지 최소화 문제가 된다. 엣지 하나하나는 두 포즈 사이의 상대변환 $\hat{z}_{ij}$와 그 불확실성 $\Omega_{ij}$를 담는다. 전체 비용 함수는

$$F = \sum_{(i,j) \in \mathcal{E}} e_{ij}^T \Omega_{ij} e_{ij}, \quad e_{ij} = z_{ij} - h(x_i, x_j)$$

여기서 $h(x_i, x_j)$는 두 포즈로부터 기대 상대변환을 계산하는 함수이며, $z_{ij}$는 실제 측정된 상대변환, $\Omega_{ij} = \Sigma_{ij}^{-1}$는 측정 불확실성의 역행렬인 정보 행렬이다.

이 공식화의 핵심은 루프 클로저의 자연스러운 포함이다. 나중에 같은 장소를 다시 방문했을 때 얻은 상대 측정값을 그래프에 엣지로 추가하면, 전체 최적화가 그 제약을 반영하여 모든 포즈를 조정한다. EKF에서 루프 클로저는 covariance를 $O(N^2)$ 단위로 갱신하는 무거운 작업이었다. 포즈 그래프에서는 엣지 하나를 추가하는 것으로 충분하다.

> 🔗 **차용.** Lu-Milios의 포즈 그래프 최적화 정식화는 [Levenberg(1944)](https://www.ams.org/qam/1944-02-02/S0033-569X-1944-10666-0/)와 [Marquardt(1963)](https://www.stat.cmu.edu/technometrics/70-79/VOL-14-03/v1403757.pdf)의 비선형 최소자승 알고리즘을 기반으로 한다. 수십 년 앞서 비선형 파라미터 추정을 위해 개발된 수치 최적화 기법이 실내 레이저 맵핑의 백엔드에 도착했다.

당시 Lu-Milios의 해법은 모든 포즈를 동시에 푸는 배치(batch) 선형 시스템이었다. 스캔 수가 늘어나면 선형 시스템의 크기도 함께 커진다. 그래서 개념 증명의 성격이 강했다. 그러나 두 가지를 확실히 보여주었다. 전역 일관성은 달성 가능하다. 그리고 그 도구는 필터가 아닌 최적화다. 같은 시기 Gutmann-Konolige는 증분성에, Folkesson-Christensen은 데이터 연관 강건성에, Thrun 그룹은 대규모 실환경 적용에 각자 방점을 찍으며 같은 결론의 각도를 다르게 깎고 있었다.

---

## 6.2 희소성의 발견: 정보 행렬과 포즈 그래프의 확장

Lu-Milios의 아이디어가 발표된 후 5년간, 여러 그룹이 같은 방향에서 확장을 시도했다. 공통된 발견은 정보 행렬(information matrix, $\Omega = \Sigma^{-1}$)의 **희소성(sparsity)**이었다.

EKF-SLAM의 covariance 행렬 $\Sigma$는 조밀(dense)하다. 로봇이 새 landmark를 관측할 때마다 기존 모든 landmark와의 상관관계가 갱신된다. 로봇 포즈를 marginalize한 상태에서 $n$개의 2D landmark가 있으면 $\Sigma$는 $2n \times 2n$ 행렬이고, 갱신 비용은 $O(n^2)$다. 100개 landmark 정도에서 실시간성이 무너지는 이유다.

반면 포즈 그래프의 정보 행렬은 다르다. 로봇의 포즈 $x_i$와 $x_j$가 직접 측정 관계에 있을 때만 $\Omega$의 $(i,j)$ 블록에 비영(non-zero) 항이 생긴다. 연속 이동 시 인근 포즈들만 엣지로 연결되고, 먼 포즈들은 직접 연결되지 않는다. $\Omega$는 그래프 토폴로지를 반영한 띠형(banded) 희소 구조를 가진다. 루프 클로저가 없는 순수 주행 시나리오에서 이 구조는 정확히 tridiagonal에 가깝다.

Sebastian Thrun 그룹의 [Sparse Extended Information Filter(SEIF)](http://www.cs.cmu.edu/~thrun/papers/thrun.tr-seif02.pdf), Edwin Olson의 연구는 이 희소성을 명시적으로 활용하기 시작했다. 희소 선형 대수 풀이기(sparse solver)를 쓰면 계산 비용이 $O(n^2)$에서 크게 줄어들 수 있었다. 실제 복잡도는 그래프 구조에 의존하지만, 로봇이 제한된 지역 내에서 움직이는 현실 시나리오에서는 $O(n \log n)$ 수준이 가능했다.

> 🔗 **차용.** Thrun 그룹의 sparse information filter(SEIF)와 [Eustice의 exactly sparse delayed-state filter](https://web.mit.edu/2.166/www/handouts/eustice_et_al_ieeetro_2006.pdf)는 정보 행렬의 희소성이 필터 기반에서도 활용 가능하다는 것을 보였다. 이 희소성 통찰은 Dellaert의 factor graph 공식화와 Bayes tree 자료구조로 이어지는 맥락을 형성한다.

2006년 ICRA에서 [Olson, Leonard, Teller](https://april.eecs.umich.edu/pdfs/olson2006icra.pdf)는 stochastic gradient descent로 포즈 그래프를 최적화하는 방법을 발표했다. 수렴 보장은 없었다. 그래도 수백 노드 규모에서 충분히 빠르게 돌았고, Olson의 구현 코드는 이후 커뮤니티 전반에 퍼졌다.

---

## 6.3 Factor Graph와 Square Root SAM

2006년 Dellaert와 당시 박사과정이던 Kaess가 발표한 [Square Root SAM](https://doi.org/10.1177/0278364906072768)은 SLAM 백엔드를 이해하는 방식을 다시 썼다. Dellaert는 Georgia Tech에서 확률론적 그래픽 모델(probabilistic graphical model)을 연구해왔다. 그는 SLAM을 베이지안 추론 문제로 보았고, factor graph 위에서 그 추론을 수행하는 것이 가장 자연스럽다고 판단했다.

**Factor graph**(변수와 제약을 노드와 엣지로 표현한 이분 그래프)에서 변수 노드는 로봇 포즈와 landmark의 위치, factor 노드는 관측값 또는 사전 확률(prior)이다. Factor $f_k(x_{i_1}, x_{i_2}, \ldots)$는 연결된 변수들 사이의 확률적 제약을 나타낸다. 전체 결합 확률은

$$p(X) \propto \prod_k f_k(X_k)$$

이며, MAP 추정은 이 확률을 최대화하는 $X^*$를 찾는 것이다. Gaussian factor 하에서 이것은 비선형 최소자승 문제가 된다.

Dellaert의 통찰은 이 최소자승 문제의 구조에서 왔다. Jacobian 행렬 $J$에 QR 분해를 적용하면 상삼각(upper triangular) 행렬 $R$이 남는다. $R^T R = J^T J = \Omega$이며, $R$이 바로 "square root information matrix"다. 이 $R$의 희소 구조는 Jacobian 자체가 아니라 변수 제거(variable elimination) 순서와 factor graph 토폴로지가 결정한다. 적절한 ordering(예: AMD, COLAMD)을 선택하면 fill-in을 최소화하여 희소한 $R$을 얻을 수 있다.

이 공식화는 EKF의 covariance 갱신보다 수치적으로 안정하다. 지도 전체의 랜드마크와 포즈를 일관된 방식으로 함께 최적화할 수 있으며, 루프 클로저는 새 factor를 추가하는 것으로 표현된다.

---

## 6.4 iSAM과 iSAM2: 온라인 증분 추론

Square Root SAM은 배치(batch) 방법이었다. 새 관측이 들어올 때마다 전체 $J^T J$를 다시 분해하면 $O(n^3)$ 비용이 발생한다. 온라인 로봇 시스템에서는 실용적이지 않았다.

2008년 [Kaess, Ranganathan, Dellaert가 발표한 **iSAM**(incremental Smoothing and Mapping)](https://www.cs.cmu.edu/~kaess/pub/Kaess08tro.pdf)은 이 문제를 Givens rotation으로 접근했다. 새 변수와 factor가 추가될 때, 기존 QR 분해를 처음부터 다시 수행하는 대신 새 행만 추가하여 Givens rotation으로 $R$을 갱신한다.

iSAM1의 본질적 한계는 재선형화 스케줄이었다. 비선형 factor를 선형화한 결과로 만든 $R$은 현재 추정값 근처의 1차 근사일 뿐이다. 로봇이 이동하면서 추정값이 선형화 지점에서 멀어지면 근사 오차가 누적된다. iSAM1의 대응은 **주기적 전면 재선형화(periodic full relinearization)**였다. 몇십 스텝마다 전체 factor graph를 처음부터 다시 선형화하고 QR 분해를 처음부터 다시 수행했다. 루프 클로저로 $R$에 채움(fill-in)이 발생해 희소 구조가 손상되는 것은 이 스케줄이 촉발되는 가시적 증상이었지만, 비용의 근본은 "전체를 주기적으로 다시 푼다"는 스케줄 자체에 있었다. 증분적으로 보이던 알고리즘이 주기마다 배치 알고리즘으로 되돌아가는 구조였다.

2012년 [iSAM2](https://doi.org/10.1177/0278364911430419)는 Bayes tree라는 자료구조로 이 문제를 해결했다. Bayes tree는 factor graph에 variable elimination을 적용하여 얻는 chordal Bayes net으로부터 구성되는 트리 구조다. Bayes net의 클리크(clique)를 노드로, 클리크 간 공유 변수(separator)를 엣지로 가진다. 새 factor가 추가될 때 Bayes tree에서 영향받는 클리크를 특정하고, 해당 서브트리만 factor graph로 되돌려 재선형화·재최적화한다. 핵심은 **fluid relinearization**이다. 선형화 오차가 임계값을 넘는 factor만 선택적으로 골라 다시 선형화하고, 그 영향이 Bayes tree의 separator를 타고 필요한 만큼만 전파된다. iSAM1의 "주기마다 전체" 스케줄이 "필요한 factor만, 영향받는 clique만"으로 대체된 셈이다. 루프 클로저가 발생해도 연결 clique 집합이 국소적으로 한정되는 경우가 많아 전체 재계산을 피할 수 있었다.

> 🔗 **차용.** Bayes tree의 자료구조적 아이디어는 확률론적 그래픽 모델 문헌의 junction tree(join tree) 알고리즘 계보를 잇는다—Koller-Friedman의 [*Probabilistic Graphical Models*](https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models/) 같은 표준 교과서가 다루는 제거 순서·chordal 그래프 기반 추론 기법이 대표적이다. 인공지능 추론 커뮤니티의 기법 계열이 실시간 로봇 SLAM에 이식된 것이다.

iSAM2는 [GTSAM(Georgia Tech Smoothing and Mapping)](https://gtsam.org) 라이브러리로 패키징됐다. C++ 코어에 Python 바인딩을 얹은 형태다. Dellaert가 나중에 Google로 옮긴 후에도 GTSAM 개발은 끊기지 않았다. 2026년 기준으로 자율주행, 드론, 로봇팔 보정 등 여러 분야에서 사실상 표준 SLAM 백엔드로 쓰인다.

---

## 6.5 g2o: ROS 생태계의 표준

Georgia Tech 그룹이 이론 정제에 집중하는 동안, 뮌헨 공대(TUM)·프라이부르크의 Rainer Kümmerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, Wolfram Burgard는 실용적인 오픈소스 구현에 집중했다. 2011년 ICRA에서 이들이 발표한 [g2o](https://doi.org/10.1109/ICRA.2011.5979949)(general graph optimization)는 "어떤 종류의 그래프 최적화든 플러그인 방식으로 처리한다"는 원칙으로 설계됐다. 저자 구성 자체가 이미 혼종이었다. Burgard·Grisetti의 프라이부르크 로보틱스 전통, Strasdat의 단안 SLAM 경험, Konolige가 가져온 산업계 엔지니어링 감각이 하나의 프레임워크로 합쳐졌다.

g2o의 설계는 세 개념을 분리한다. vertex(변수 노드)와 edge(factor/제약)가 그래프를 구성하고, solver가 희소 선형 시스템을 푼다. 사용자는 vertex 타입과 edge의 오차 함수·Jacobian을 정의하면, g2o가 Gauss-Newton 또는 Levenberg-Marquardt로 전체 최적화를 수행한다. 희소 풀이기는 Cholmod, CSparse, Eigen 중 선택하거나 외부 라이브러리로 교체할 수 있다.

ROS(Robot Operating System)가 2010년대 초 모바일 로봇 연구의 표준 플랫폼으로 자리잡으면서, g2o는 사실상의 SLAM 백엔드 표준이 되었다. gmapping, Cartographer, ORB-SLAM, LSD-SLAM이 g2o 또는 g2o와 유사한 인터페이스를 채용했다. 포즈 그래프 SLAM을 새로 구현하려는 연구자라면 g2o를 첫 번째 선택지로 고려하게 됐다.

---

## 6.6 왜 분야가 여기로 수렴했나

Chatila-Laumond(1985), Lu-Milios(1997), Gutmann-Konolige(1999), Folkesson-Christensen(2004), Thrun 그룹, Dellaert(2006), Kaess(2012)까지—여러 그룹이 각자의 도구로, 서로 다른 시기에, EKF 백엔드만으로는 충분하지 않다는 같은 결론에 도달했다.

핵심은 문제 모델링 방식의 전환이었다. EKF-SLAM은 현재 상태의 최적 추정값과 불확실성을 유지하면서 과거를 marginalize한다. 이 필터 패러다임에서 과거 포즈는 사라지고, 누적 오차는 현재 추정값 속에 잠복한다. 루프 클로저를 닫으려면 현재 covariance에 무거운 갱신이 필요하다.

그래프 SLAM은 과거 포즈를 버리지 않는다. 포즈·landmark·관측값 모두 그래프에 살아 있고, 루프 클로저는 새 엣지를 추가하는 것으로 표현된다. 재최적화가 전체 궤적을 일관성 있게 조정한다 (이산 keyframe 대신 시간 연속적인 궤적으로 그래프를 재정식화하는 계열은 Ch.7c Continuous-Time SLAM 참조). 이미 지나간 포즈도 수정 대상이 된다는 점이 필터와의 본질적 차이다.

계산 비용도 달랐다. EKF의 갱신 비용은 $O(N^2)$ (landmark 수 $N$에 대해), 정보 저장은 $O(N^2)$다. 그래프 방법은 희소 Cholesky(또는 QR) 분해를 활용하면 복잡도가 크게 줄어든다. 로봇이 제한된 지역 내에서 움직이는 현실 시나리오—희소 연결 그래프—에서 일반적으로 $O(N \log N)$ 수준의 갱신이 가능하다. 대규모 장기 SLAM에서 이 간격은 좁히기 어렵다.

> 📜 **예언 vs 실제.** Dellaert의 Square Root SAM(2006)이 제시한 배치 방식의 한계는 같은 그룹에서 곧바로 증분화 방향으로 이어졌다. 2008년 iSAM이 Givens rotation 기반 증분 갱신으로 이를 다뤘고, 2012년 iSAM2는 Bayes tree로 루프 클로저 상황의 효율성까지 끌어올렸다. GTSAM·Ceres·g2o 모두 같은 구조 위에서 경쟁한다. 세 논문이 동일한 문제 의식을 단계적으로 해소하는 계보를 이뤘다는 점에서, 이 라인은 거의 예고된 궤적대로 실현된 편에 가깝다. `[적중]`

마지널리제이션(marginalization)의 유연성도 한몫했다. 그래프에서 오래된 포즈를 marginalize할 때 그 정보가 남은 변수들에 연결 factor로 보존된다. 필터는 정보를 버렸지만, 그래프는 압축하면서도 정보를 지킬 수 있다. 슬라이딩 윈도우 최적화나 keyframe 선택 같은 공학적 트레이드오프가 여기서 등장한다.

---

## 6.7 비선형성과 강건성: 실무 엔지니어링의 층위

그래프 최적화의 이론적 우아함과 실제 구현 사이에는 간격이 있다. 그 간격을 메우는 작업이 2010년대 SLAM 엔지니어링의 상당 부분을 차지했다.

첫 번째 문제는 초기값 의존성이다. 가우스-뉴턴이나 LM 최적화는 초기 포즈 추정이 참값에서 크게 벗어나 있으면 지역 최솟값(local minimum)에 수렴한다. 루프 클로저에서 잘못된 대응 관계가 섞이면 초기값이 훼손된다. 그래서 루프 클로저 검증과 아웃라이어 rejection이 백엔드 이전 단계의 핵심 작업이 됐다. 이 지역 최솟값 문제 자체를 볼록 완화(SDP)로 우회하여 전역 최적성을 증명 가능한 형태로 푸는 계열은 Ch.6b(Certifiable SLAM)에서 별도로 다룬다.

표준 최소자승은 아웃라이어에 취약하다는 것도 실무에서 금방 드러났다. Huber 비용이나 Cauchy 비용 같은 robust kernel을 쓰면 잘못된 매칭의 영향을 줄일 수 있다. g2o와 GTSAM 모두 robust kernel을 선택 가능하게 한다. 어느 kernel을 쓸지는 환경과 센서 특성에 따라 달라지며, 2026년에도 이 선택은 여전히 엔지니어의 경험에 의존한다.

세 번째 문제는 marginalization 근사다. iSAM2의 Bayes tree는 정확한 증분 추론을 제공하지만, 변수 수가 계속 증가하면 트리가 커진다. 실제 시스템에서는 오래된 포즈를 marginalize하여 트리 크기를 관리한다. 이 marginalization 과정에서 발생하는 fill-in이 information matrix를 조밀하게 만들 수 있다. 어떻게 truncate할지, Prior factor로 어떻게 근사할지가 구현 품질을 가른다.

> 📜 **예언 vs 실제.** g2o가 표방한 "어떤 그래프 최적화 문제든 플러그인으로 처리한다"는 범용성은, 실제로 line·plane 같은 복잡한 기하 제약을 내부적으로 활용하는 시스템(OpenVINS, VINS-Fusion 계열 등)으로 부분적으로 확장됐다 (IMU factor를 그래프에 얹는 표준 방식인 preintegration은 Ch.7b Preintegration 참조). 다만 2026년 기준 g2o 라이브러리 자체는 광범위한 기본 factor 확장보다 인터페이스 안정성과 기존 사용자 호환성 유지에 비중을 두고 있고, 새로운 factor 타입은 사용자 측에서 상속·fork·래핑으로 얹는 방식이 일반적이다. `[진행형]`

---

## 🧭 아직 열린 것

어느 robust kernel을 선택해야 하는가. Huber, Cauchy, Geman-McClure, DCS 등 여러 선택지가 있지만, 주어진 환경과 센서에 어느 kernel이 최적인지를 사전에 결정하는 원칙적인 방법이 없다. 이 선택은 여전히 엔지니어의 직관과 경험에 의존한다. 학습 기반으로 cost function 자체를 최적화하는 연구가 있으나, 온라인 증분 시스템에 통합하는 것은 풀리지 않은 문제다.

비가우시안 상황을 factor graph 안에서 표현하는 것은 아직 열려 있다. 현재 GTSAM·g2o의 factor는 거의 모두 가우시안 노이즈를 가정한다. 루프 클로저의 오매칭 확률, 다중 가설 포즈 같은 상황을 정확하게 표현하는 것은 이론적으로도 계산적으로도 어렵다. 맥스 믹스처(max-mixture) 모델 등의 시도가 있지만 범용 솔루션은 없다.

Bayes tree는 루프 클로저 수가 적을 때 효율적이다. 수십 킬로미터를 수시간 주행하며 수천 번의 루프 클로저가 발생하는 시나리오에서는 트리 구조가 복잡해지고 메모리 효율이 떨어진다. GTSAM의 실제 자율주행 데이터 적용에서 이 병목이 보고되어 있으며, 계층적 트리 관리나 서브맵 분할과의 결합이 현재 연구 방향 중 하나다.

---

2010년대 들어 백엔드 논쟁은 잦아들었다. g2o와 GTSAM이 실질적 표준이 되면서, 연구자들의 관심은 백엔드 위에 무엇을 얹느냐로 옮겨갔다. 어떤 feature로, 얼마나 멀리서 루프를 인식하는가가 새 물음이 되었다. 프론트엔드가 새 경쟁 무대였다.

---

# Ch.6b — Certifiable SLAM: 지역 최솟값을 넘어서

Ch.6이 기록한 Lu-Milios에서 g2o·GTSAM에 이르는 계보는 한 가지를 남겨두었다. 포즈 그래프 최적화는 비볼록 문제다. Gauss-Newton·LM이 내놓는 해는 지역 최솟값일 수 있다. 실무자들은 "odometry 초기값이 있으면 대체로 잘 풀린다"는 민속적 관찰로 지내왔지만, 어느 현장에서는 백엔드가 엉뚱한 지점에서 수렴했는데도 경고음은 울리지 않았다. 2015년 MIT의 Luca Carlone이 그 민속을 수학으로 대체하기 시작했다. Carlone의 Lagrangian duality 시도에서 2019년 Rosen의 SE-Sync로, 이어 Briales-Gonzalez-Jimenez의 Cartan-Sync, Yang-Carlone의 TEASER, Papalia의 CORA까지 — 이 계보는 SLAM 백엔드를 "경험적으로 잘 풀리는 비볼록 최적화"에서 "전역 최적성이 증명 가능한 convex surrogate"로 다시 쓴다. 도구는 모두 SLAM 바깥에서 왔다. 오퍼레이션스 리서치의 Shor relaxation, 수학 최적화의 Burer-Monteiro factorization, 미분기하의 Riemannian optimization, 그래프 이론의 Kirchhoff Matrix-Tree. 10년간 이것들을 한 테이블에 모은 사람들의 이름이 이 챕터의 본문이다.

---

## 6b.1 지역 최솟값이라는 오래된 불안

Ch.6 §6.7은 그래프 SLAM 백엔드의 첫 문제로 초기값 의존성을 꼽았다. 비용 함수가 회전 변수 $\boldsymbol{R}_i \in \mathrm{SO}(3)$ 위에서 비볼록이기 때문에, 초기 추정이 참값에서 멀면 Gauss-Newton은 엉뚱한 분지로 빨려 들어간다. Handbook §6.1의 parking garage 예시가 증상을 극명하게 보여준다. 같은 입력에서 무작위 초기화 네 번 중 하나만 SE-Sync가 도달한 전역 최솟값에 붙고, 나머지 셋은 육안으로도 바닥이 접힌 지역 최솟값에 안착한다.

2000년대 후반까지 커뮤니티의 대응은 두 갈래였다. odometry를 신뢰해 초기값 품질을 확보하거나, 루프 클로저 검증과 아웃라이어 제거를 전단에서 철저히 하거나. 둘 다 유효했지만, 수렴한 값이 진짜 최솟값인지 판정하는 도구는 아니었다. Huang과 Dissanayake가 2010년 무렵 짚은 문제는 단순했다. 초기값이 아무리 좋아도 데이터 자체가 모호하면 최적화기는 틀린 답에 가서 멈출 수 있다. PGO가 NP-hard라는 것도 그 무렵 정식화됐다. 그런데도 현장에서는 g2o가 대체로 잘 풀렸다. 이 간극 — 이론은 최악을 말하는데 실무는 평균을 보는 간극 — 이 2010년대 중반 백엔드 이론 연구자들이 파고든 자리다. Gauss-Newton이 멈춘 모든 지점은 *국소적으로는* 최적이다. 기울기가 0이고 헤시안도 양정치다. 그런데도 답은 전혀 다르다. 백엔드가 "수렴했다"고 신호를 보내는 순간이 실패가 가장 눈에 띄지 않는 순간이기도 하다.

> 🔗 **차용.** Ch.6의 robust kernel(Huber, Cauchy)과 이 챕터의 GNC는 [Black & Rangarajan (1996)](https://cs.brown.edu/people/mjblack/Papers/ijcv1996.pdf)의 robust statistics·이중성 정리를 공유한 뿌리에서 갈라졌다. 한쪽은 비용 가중으로 아웃라이어 영향을 줄였고, 반대쪽은 같은 원리를 비볼록성 회피에 전용했다.

---

## 6b.2 Shor relaxation — 바깥에서 들어온 무기

PGO의 비볼록성은 회전 제약 $\boldsymbol{R}_i \in \mathrm{SO}(d)$에서 온다. 이 제약은 사실 직교성 조건 $\boldsymbol{R}^\top \boldsymbol{R} = \boldsymbol{I}$과 $\det(\boldsymbol{R}) = +1$로, 이차 방정식으로 쓸 수 있다. 목적함수도 이차다. 결국 PGO는 **QCQP**(Quadratically Constrained Quadratic Program)로 정확하게 떨어진다. 그리고 QCQP에는 1987년 이후 오퍼레이션스 리서치 분야에서 검증된 convex relaxation 도구가 있었다. [Naum Shor의 1987 relaxation](https://link.springer.com/article/10.1007/BF01582220)이다.

Shor의 아이디어는 $\boldsymbol{x}^\top \boldsymbol{M}\boldsymbol{x} = \mathrm{tr}(\boldsymbol{M}\boldsymbol{x}\boldsymbol{x}^\top)$ 항등식으로 리프팅 변수 $\boldsymbol{X} \triangleq \boldsymbol{x}\boldsymbol{x}^\top$을 도입해 원 QCQP를 "$\boldsymbol{X} \succeq 0$이면서 rank-1" 위의 선형 목적 문제로 바꾸고, rank-1 제약을 버려 볼록한 **SDP**를 얻는 거래다. 탐색 공간이 $n$에서 $n(n+1)/2$로 늘지만 볼록성을 얻는다.

$$d^* = \min_{\boldsymbol{X}\in\mathbb{S}^n} \mathrm{tr}(\boldsymbol{C}\boldsymbol{X}) \;\; \text{s.t.} \;\; \mathrm{tr}(\boldsymbol{A}_i\boldsymbol{X})=b_i,\; \boldsymbol{X}\succeq 0.$$

쓸모는 이중성 부등식 $d^* \le p^*$에 있다. SDP 최솟값은 원 QCQP 최솟값의 아래쪽 경계다. 후보해 $\hat{\boldsymbol{x}}$가 있을 때 $f(\hat{\boldsymbol{x}}) - d^*$가 그 후보의 최적성 간극의 상한이 된다. 여기서 "certifiable"이라는 이름이 나온다. 전역적으로 못 풀어도, 가진 해가 얼마나 나쁜지의 상한은 풀 수 있다. SDP 해 $\boldsymbol{X}^*$가 rank-1로 떨어지면 $\boldsymbol{X}^* = \boldsymbol{x}^*\boldsymbol{x}^{*\top}$에서 $\boldsymbol{x}^*$가 원 QCQP의 전역 최솟값이다. 이 "favorable situation"이 SLAM에서 얼마나 자주 일어나는지가 이후 논문들의 주제가 된다.

이 계보의 출발점은 Carlone이 2015년 IROS와 ICRA에서 발표한 두 편의 논문, [Carlone et al. 2015 "Lagrangian duality in 3D SLAM"](https://arxiv.org/abs/1506.00746)과 [Carlone & Dellaert 2015 "Planar pose graph optimization"](https://doi.org/10.1109/ICRA.2015.7139264)이다. 2D PGO에서 duality gap이 대개 0임을 경험적으로 보였고, 3D로 확장 가능함을 시사했다. Carlone은 2014년 TRO 서베이에서 g2o·GTSAM 초기화 기법을 정리한 직후였고, odometry와 루프 클로저가 충돌할 때 최적화가 자주 틀린 지점에서 멈추는 것을 본 뒤였다. 2015년 논문은 "duality gap이 보통 0"임을 보고할 뿐, 언제 성립하는지의 닫힌 조건은 주지 못했다.

같은 시기 [Briales & Gonzalez-Jimenez (2017)](https://arxiv.org/abs/1702.03235)의 Cartan-Sync가 SO(3) synchronization으로 같은 프로그램을 밀었다. 수학 쪽에서는 Boumal·Absil·Sepulchre가 Riemannian optimization을, 최적화 쪽에서는 Burer-Monteiro의 low-rank SDP factorization이 2003년부터 자리잡고 있었다. 흩어진 재료들이 2019년 한 편의 논문에서 조립된다.

---

## 6b.3 SE-Sync — Rosen 2019가 조립한 것

[Rosen, Carlone, Bandeira, Leonard의 SE-Sync (IJRR 2019)](https://arxiv.org/abs/1612.07386)는 certifiable SLAM의 캐논이다. Rosen은 MIT에서 John Leonard의 박사과정을 마쳤고, Leonard는 Ch.4의 Durrant-Whyte와 함께 1990년대 초 "SLAM"이라는 이름을 자리잡게 한 MIT 연구자였다. 공저자 Afonso Bandeira는 SDP·synchronization 수학 쪽 전문가로 rank-deficient 2차 임계점의 전역성 증명을 맡았다. 로보틱스·SLAM·수학 최적화·응용수학 네 계보의 배경이 이 논문이 무엇을 조립했는지를 말해준다. 이 논문이 한 일은 조립이었다. Shor relaxation, translation elimination, Burer-Monteiro low-rank parameterization, Boumal의 Riemannian staircase — 각각 다른 계보에서 10여 년씩 숙성된 재료들을 PGO라는 한 문제 위에서 맞물리게 했다.

조립의 순서는 세 단계다. 첫째, 회전 고정 시 translation이 선형 최소자승이 된다는 관찰에서 $\boldsymbol{t}$를 닫힌 형태로 소거한다(Problem 6.2). Ch.6의 graph SLAM 계보가 오래전부터 알던 사실을 Carlone이 2014년 TRO 서베이에서 명시했고, Rosen이 convex relaxation의 첫 단계로 집어넣었다. 둘째, 남은 rotation-only 문제 $\min_{\boldsymbol{R}\in\mathrm{SO}(d)^n} \mathrm{tr}(\tilde{\boldsymbol{Q}}\boldsymbol{R}^\top\boldsymbol{R})$에 Shor relaxation을 적용해 SDP로 리프팅한다(Problem 6.3). 셋째, $dn \times dn$ 차원 SDP는 그대로 풀면 interior-point method가 수천 포즈에서 무너지므로 Burer-Monteiro 재파라미터화 $\boldsymbol{Z} = \boldsymbol{Y}^\top \boldsymbol{Y}$로 Stiefel manifold 위의 저차원 비제약 문제로 바꾼다(Problem 6.4).

두 정리가 이 조립을 정당화한다. Theorem 6.1 **exact recovery**: 측정 노이즈가 어떤 상수 $\beta$보다 작으면 SDP relaxation의 유일 해가 원 MLE의 전역 최솟값을 rank-1로 품는다. "어떤 노이즈까지 버티는가"에 대한 첫 정량적 답이었다. 다만 $\beta$는 ground-truth에 의존해 사전에는 모른다. Theorem 6.2는 Boumal et al.의 결과로, Stiefel manifold 위에서 찾은 2차 임계점이 rank-deficient하면 곧 전역 최솟값임을 보장한다. 이 두 정리가 Riemannian Staircase를 가능케 한다. rank를 작게 두고 시작해 2차 임계점을 찾고 rank-deficiency를 검사, 안 맞으면 rank를 하나 올린다. rank가 $dn + 1$에 닿으면 모든 $\boldsymbol{Y}$가 row rank-deficient가 되므로 유한 단계 내 반드시 멈춘다. 실무 데이터셋에서는 보통 한 계단이면 끝난다.

sphere·torus·garage 벤치마크에서 SE-Sync는 g2o·GTSAM 수준 속도로 수렴하며 a posteriori certificate를 함께 냈다. g2o·GTSAM은 빨랐지만 답을 언제 믿을지 침묵했고, Rosen의 알고리즘은 끝에 suboptimality bound를 하나 더 토해낸다. 이 bound가 0이면 해는 증명 가능하게 전역 최적이다. Lu-Milios 이후 20년 만에 백엔드가 "이 해가 진짜 최솟값인가"에 '예/아니오'를 찍을 수 있게 됐다.

> 📜 **예언 vs 실제.** Rosen은 IJRR 2019 논문 §8.2에서 "우리가 보인 algebraic simplification은 anisotropic noise·outlier·다양한 센서 모달리티로 확장될 수 있을 것"이라 적었다. 그 예언은 부분적으로 적중했다. 2023년 Holmes-Barfoot의 landmark-SLAM 확장, 2024년 Papalia의 CORA 범위 측정 확장, Yang-Carlone의 TEASER 계열이 실제 뒤따랐다. 그러나 "visual SLAM의 perspective projection까지 SE-Sync가 덮는다"는 가장 야심찬 확장은 2026년에도 오지 않았다. Projection이 rational function이라 polynomial optimization으로 편입되기 어렵다는 구조적 장벽이 드러났다. `[기술변화]`

> 🔗 **차용.** SE-Sync의 심장에 있는 Burer-Monteiro factorization은 [Burer & Monteiro (2003)](https://link.springer.com/article/10.1007/s10107-002-0352-8)의 low-rank SDP 해법이다. 그 위에 [Boumal-Voroninski-Bandeira (2016)](https://arxiv.org/abs/1605.08101)이 Riemannian 언어로 2차 임계점의 전역성을 보였고, Rosen이 SLAM 맥락에 가져왔다. 순수 수학에서 로봇 백엔드까지 16년이다.

---

## 6b.4 Graph Laplacian과 Fisher Information의 뜻밖의 등가

§6.2는 다른 질문을 던진다. 전역 최솟값이라고 해서 그 추정이 참값과 얼마나 가까운가? 답은 Cramér-Rao Lower Bound와 Fisher Information Matrix다. 회전을 고정한 단순 PGO 모델에서 Rosen-Khosoussi-Barfoot의 결과는 놀랍게도 FIM이 그래프의 weighted reduced Laplacian의 Kronecker product로 정확히 떨어진다는 것이다.

$$\mathcal{I} = \boldsymbol{J}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{J} = \boldsymbol{L}_w \otimes \boldsymbol{I}_3.$$

그래프 구조만 알면 실제 측정 없이도 추정 정확도의 근사를 얻는다는 뜻이다. Kirchhoff의 Matrix-Tree Theorem에 따라 reduced Laplacian의 determinant는 가중 spanning tree 수와 같고, 이것이 D-optimality(정보 행렬 행렬식)에 대응한다. 알제브라 연결성(Fiedler value)은 E-optimality(최악 분산)에 대응한다. 1847년 Kirchhoff가 전기 회로망을 위해 증명한 정리가 180년 뒤 측정 선택·active SLAM의 이론 기반이 된 셈이다. active SLAM에서 "FIM을 최대화"는 Laplacian 스펙트럼 조작으로 환산된다.

[Kasra Khosoussi와 Timothy Barfoot의 2014년 이후 작업](https://arxiv.org/abs/1709.08601)이 이 연결을 정립했다. Khosoussi는 Sydney에서 Dissanayake·Huang 지도로 박사과정을 밟았고, 이후 MIT와 Toronto를 거쳤다. 3D PGO로 일반화된 형태에서는 Laplacian과 SE(3) adjoint representation의 Kronecker 결합이 등장해 위상·기하 정보를 분리해 다루게 한다. "측정 선택 기준"을 FIM 전체 대신 6배 작은 Laplacian으로 근사 가능하다는 것이 Ch.6이 자리만 두고 지나간 "루프 클로저 선택"의 수학적 근거가 된다.

Ch.4 §4.8이 짚은 EKF-SLAM의 consistency 문제도 이와 맞닿는다. Julier-Uhlmann이 2001년 지적한 EKF의 over-confidence는 CRLB로 재해석하면 근사 선형화가 Fisher information을 과대 추정한다는 말이다. Handbook §6.2가 FIM 챕터를 convex relaxation 옆에 붙여둔 까닭이다. 전역 최솟값과 그 정확도는 쌍으로 다뤄야 한다.

> 🔗 **차용.** [Kirchhoff의 Matrix-Tree Theorem(1847)](https://en.wikipedia.org/wiki/Kirchhoff%27s_theorem)은 전기 회로망 분석 도구로 태어나 조합론을 거쳐 측정 설계 문헌으로 이식됐고, 2010년대 Khosoussi를 통해 SLAM active perception의 언어가 됐다. 한 정리의 180년 이주 경로다.

---

## 6b.5 확장과 한계 — TEASER, CORA, 그리고 Lasserre의 벽

SE-Sync가 나온 뒤 전선은 두 방향으로 넓어졌다. 첫째, 아웃라이어에 강건한 certifiable estimator. 둘째, range·landmark·anisotropic noise 같은 확장된 측정 모델.

아웃라이어 쪽이 먼저 압박이었다. Ch.6 §6.7이 짚었듯 루프 클로저 검증이 완벽하지 않으면 오매칭이 섞이고, Huber·Cauchy 커널로도 일정 비율 이상의 아웃라이어 앞에서는 최적화가 무너진다. 2017년 무렵 certifiable 계보가 이에 답해야 한다는 압박이 분명해졌다.

대표는 [Yang, Shi, Carlone의 TEASER (TRO 2020)](https://arxiv.org/abs/2001.07715)다. 3D 점군 등록에서 99% 아웃라이어에서도 전역 최적 해를 찾는다. truncated least squares 비용을 GNC 래퍼에서 풀되 회전 부분 문제에 SDP relaxation을 붙여 certificate를 함께 낸다. 비결은 스케일·translation·rotation을 각각 certifiable subproblem으로 쪼개 단계마다 전역 최적 보장과 함께 넘기는 데 있었다. 이어진 [Yang & Carlone (2022)](https://arxiv.org/abs/2109.03349)는 이를 Lasserre moment relaxation으로 일반화해 "certifiably robust estimation"이라 명명했다.

Range-aided SLAM은 [Papalia et al. CORA (2024)](https://arxiv.org/abs/2403.09295)의 자리다. 범위 측정 $(\|\boldsymbol{t}_j - \boldsymbol{t}_i\| - \tilde r_{ij})^2$는 그대로면 quartic이라 QCQP에서 벗어나는데, Papalia는 보조 단위벡터 $\boldsymbol{b}_{ij} \in S^{d-1}$로 bearing lifting해 다시 집어넣었다. CORA는 단일 로봇에서는 tight한 relaxation이 멀티로봇에서는 일반적으로 exact하지 않음을 보여 "언제 Shor가 통하는가"의 범위를 좁혔다.

Landmark 쪽에서는 [Holmes & Barfoot (2023)](https://arxiv.org/abs/2308.05631)이 Schur complement로 landmark를 미리 소거해 SE-Sync가 그대로 받아먹는 형태로 만들었다. Holmes·Khosoussi·Rosen이 Handbook Ch.6을 공저한 것은 이 계보가 2025년 한 테이블에 모였다는 증거다.

그러나 벽도 드러났다. anisotropic noise와 truncated-quadratic outlier를 POP(Polynomial Optimization Problem)로 일반화하면 Lasserre moment relaxation이 필요한데, 유도된 SDP가 **degenerate**해 constraint qualification이 실패하고 Riemannian Staircase의 수렴 조건이 깨진다. Yang의 2022년 sparse monomial basis 같은 우회가 있지만 전용 solver는 일반 local solver보다 느리다. 속도와 증명 가능성을 동시에 쥐는 알고리즘은 아직 없다. Visual SLAM·VIO는 더 깊은 장벽 — perspective projection·IMU preintegration의 구조적 비호환 — 앞에 있는데 🧭에서 다룬다.

> 📜 **예언 vs 실제.** Carlone이 2015년 ICRA에서 "Lagrangian dual이 tight한 인스턴스가 왜 대부분인지 이론적 해명이 필요하다"고 적었다. 10년이 지났고, 답은 부분적으로만 나왔다. Rosen-Carlone-Bandeira-Leonard의 exact recovery 정리가 "노이즈가 $\beta$ 이하"라는 충분조건을 주었지만, 실제 SLAM 인스턴스에서 $\beta$를 사전에 계산하는 방법은 없다. tightness가 언제 깨지는지에 대한 **사전**(a priori) 조건은 2026년 기준 여전히 per-instance certificate로 대체되어 있다. `[진행형]`

---

## 🧭 아직 열린 것

**Tightness가 깨지는 경계.** SE-Sync의 exact recovery 정리는 "노이즈 $\beta$ 이하"라는 충분조건만 주고, 실제 인스턴스에서 $\beta$ 계산법은 없다. 사전에 tight 여부를 판정할 수 있어야 알고리즘 설계가 진전된다. 무거운 아웃라이어나 극단적 희소 그래프에서 relaxation의 실패 양상에 대한 계통적 연구는 아직 초기 단계다.

**Visual SLAM·VIO와의 통합.** perspective projection $\pi(\boldsymbol{X}) = [X/Z, Y/Z]$는 rational function이라 polynomial optimization에 그대로 들어오지 않는다. 분모를 곱해 polynomial로 바꿔도 feature마다 새 변수·보조 제약이 추가되어 ORB-SLAM3의 수천 맵포인트에서 SDP 크기는 실시간 영역 밖이다. Forster의 2015년 IMU preintegration도 exponential map과 bias drift가 얽혀 POP 편입이 어렵다. 2026년 기준 Ch.7·Ch.8·Ch.13의 visual/VIO 주류는 certifiable guarantee 바깥에 있다. 계보가 풀지 못한 가장 큰 자리다.

**Online certification과 스케일.** SE-Sync는 배치다. 새 측정마다 SDP를 다시 풀어 certificate를 갱신하는 증분 certifiable SLAM은 아직 성숙하지 않았다. iSAM2가 배치 SAM에 풀어낸 증분화를 certifiable 쪽에서 반복해야 하는 셈이다. warm-start, rank 증분, 부분 certificate 합성 모두 열린 연구고, 도시 규모 그래프에서 moment relaxation solver의 속도도 여전히 문제다.

**Outlier-majority.** 현재의 certifiable robust estimator는 "소수 아웃라이어" 가정 위에 선다. 다수가 오염된 상황에서는 list-decodable regression 같은 다중 가설 certification이 필요하나 통계학 쪽에서도 시작 단계다. 2024년 Cheng·Shi·Carlone의 후속 작업이 있었지만 TEASER 같은 표준 도구는 없다.

---

이 챕터 전체가 Ch.6이 한 줄로 지나간 "지역 최솟값 수렴"(§6.7)에 대한 각주다. 민속적 관찰은 10년의 이론 프로그램으로 대체됐다. Carlone-Khosoussi-Rosen-Holmes-Barfoot-Dissanayake가 공저한 *The SLAM Handbook* Ch.6이 34페이지로 이 주제를 다룬 것 자체가 계보의 현재 무게다. 같은 10년 동안 Ch.12·Ch.13·Ch.16의 학습 기반 SLAM은 다른 경로로 나아갔다. 한쪽은 해의 전역성을 증명하는 쪽, 다른 쪽은 신경망이 해를 직접 예측하는 쪽. 두 계보가 만날지, 분야를 둘로 나누어 지낼지는 2026년에도 답이 없다. Ch.19에서 이 챕터의 🧭 항목들이 "백엔드 이론의 공란"으로 수확된다.

---

# Ch.7 — Feature-based 계보: ORB-SLAM 삼부작

Ch.6의 graph SLAM 혁명은 포즈 그래프 최적화를 SLAM의 표준 언어로 굳혔다. Kümmerle(2011)의 g²o와 Kaess(2012)의 iSAM2는 대규모 지도에서 반복 최적화를 실현했고, loop closure의 비용을 현실적인 수준으로 낮췄다. 최적화 이론이 완성을 향해 달려가던 그 흐름에서 3부가 시작된다. 2부가 "어떻게 오차를 줄이는가"를 질문했다면, 3부는 그 질문에 이미 답이 나와 있다는 전제 위에서 시작한다. 남은 과제는 프론트엔드였다. 어떤 특징을 어떻게 뽑아 추적할 것인가.

Klein과 Murray가 2007년 PTAM으로 tracking과 mapping을 두 스레드로 분리했을 때, 그것은 실험실 데모였다. 깊이가 입증된 아이디어였고, 소규모 실내 장면 이상에서는 무너졌다. Raúl Mur-Artal이 2015년 Zaragoza대학에서 그 구조를 가져올 때, 그는 세 가지를 함께 들고 왔다. Rublee(2011)의 ORB 디스크립터, Gálvez-López(2012)의 DBoW2 visual vocabulary, 그리고 Strasdat(2011)의 Essential graph 아이디어. PTAM이 빠른 프로토타입이었다면 ORB-SLAM은 10년짜리 표준이었다.

---

## 7.1 ORB-SLAM (2015): 설계의 삼각대

[Mur-Artal, Montiel & Tardós 2015. ORB-SLAM](https://doi.org/10.1109/TRO.2015.2463671)은 IEEE Transactions on Robotics에 실린 논문이다. 제목이 단순하다. ORB feature를 쓰는 SLAM. 그러나 논문을 뜯으면 선택 하나하나가 설계 판단이다.

시스템의 뼈대는 Tracking, Local Mapping, Loop Closing 세 스레드다. PTAM도 두 스레드(Tracking과 Mapping)였다. Mur-Artal은 Loop Closing이라는 세 번째 스레드를 추가했다. Loop Closing은 DBoW2로 장소를 인식하고, Essential graph를 통해 포즈 그래프를 최적화하며, 마지막으로 전역 Bundle Adjustment를 실행한다. 이 분리 덕분에 Tracking은 지도 수정을 기다리지 않고 실시간을 유지한다.

> 🔗 **차용.** PTAM(Klein & Murray, 2007)의 Tracking–Mapping 분리가 ORB-SLAM의 Tracking–LocalMapping으로 직접 이어졌다. Mur-Artal은 논문 §3에서 이 부채를 명시했다. ORB-SLAM은 두 스레드 구조를 세 스레드로 확장하며 루프 클로저를 독립 모듈로 격리했다.

Mur-Artal이 front-end에서 ORB(Oriented FAST and Rotated BRIEF) descriptor를 고른 데는 이유가 있었다. SIFT와 SURF는 특허 문제가 있었고, BRIEF는 빠르지만 회전에 취약했다. ORB는 FAST 키포인트에 회전 불변성을 덧붙인 것으로, Rublee et al.이 2011 ICCV에서 발표했다. 계산 비용이 SIFT 대비 두 자릿수 빠르고 binary 형태라 해밍 거리로 매칭한다. CPU에서 실시간이 된다.

ORB가 scale invariance를 얻는 방식은 image pyramid다. 원 이미지를 스케일 팩터 s(ORB-SLAM에서는 1.2)로 8단계 축소해 피라미드를 만들고, 각 레벨에서 독립적으로 FAST 키포인트를 검출한다. 키포인트의 방향(orientation)은 intensity centroid로 정의한다: 패치 내 픽셀 intensity의 1차 모멘트로 중심을 구하고, 이 방향각 θ를 BRIEF 비트 비교 쌍에 적용해 회전 불변 descriptor를 만든다. 결과는 256-bit binary vector다. 두 descriptor 사이의 유사도는 XOR 후 popcount, 즉 해밍 거리로 계산한다.

> 🔗 **차용.** [Rublee et al. 2011. ORB](https://doi.org/10.1109/ICCV.2011.6126544)의 descriptor가 시스템 이름 자체가 되었다. ORB는 Zaragoza 팀이 설계한 것이 아니다. Mur-Artal은 있는 도구를 가져다가 파이프라인을 조립했다. front-end의 선택이 10년간 시스템의 이름으로 불린 경우다.

키프레임 선택 정책이 PTAM과 다르다. PTAM은 키프레임을 공격적으로 추가했다. ORB-SLAM은 covisibility graph 기반으로 중복을 제거한다. **Covisibility graph**는 키프레임 사이의 공유 landmark 수를 엣지 가중치로 삼는 그래프다. 공유 landmark가 15개 이상인 키프레임 쌍이 연결된다. Local Mapping은 이 그래프를 이용해 local window를 선택하고, 그 안에서만 Bundle Adjustment를 수행한다.

KITTI 시퀀스 00(전체 4.5 km 루프)에서 ORB-SLAM은 1.2% translation drift를 기록했다. 당시 비교 대상이었던 PTAM은 루프를 닫지 못한다. scale 자체가 없다. ORB-SLAM이 같은 시퀀스에서 루프를 닫고 drift를 흡수한 것은 Essential graph와 DBoW2 덕분이다.

**Essential graph**는 covisibility graph의 부분 그래프다. 공유 landmark가 100개 이상인 엣지, spanning tree, 루프 클로저 엣지만 남긴다. 루프가 감지되면 이 그래프 전체를 포즈 그래프로 최적화한다. 수천 개의 키프레임이 있어도 Essential graph의 엣지는 sparse하다. 최적화가 수 초 안에 끝난다.

> 🔗 **차용.** Essential graph의 아이디어는 [Strasdat et al. 2011. Double Window Optimisation](https://doi.org/10.1109/ICCV.2011.6126517)의 계층적 최적화 구조에서 왔다. Strasdat는 local window와 global window를 분리해 최적화 비용을 낮췄다. Mur-Artal은 이를 Essential graph라는 sparse 포즈 그래프로 일반화했다.

루프 클로저의 장소 인식은 DBoW2가 담당한다. [Gálvez-López & Tardós 2012. DBoW2](https://doi.org/10.1109/TRO.2012.2197158)는 binary descriptor용 vocabulary tree다. ORB descriptor를 k-medians(k-means++ seeding)로 계층적 클러스터링해 트리 구조의 vocabulary를 만든다. 트리 분기 수 $k_w$와 깊이 $L_w$가 고정되면 leaf 노드(word) 수는 $k_w^{L_w}$가 된다. DBoW2 논문은 $k_w=10$, $L_w=6$로 1백만 단어 규모의 vocabulary를 학습한 예를 보고하며, ORB-SLAM 공개 구현도 비슷한 수준의 vocabulary를 사용한다. 각 word에는 TF-IDF(Term Frequency–Inverse Document Frequency) 가중치가 붙는다: 특정 word가 전체 키프레임 데이터베이스에 자주 등장할수록 낮은 IDF 가중치를 받아 discriminative한 word가 더 큰 영향력을 갖는다. 키프레임은 이 가중 BoW 벡터로 표현되고, inverted index에 저장된다. 새 프레임이 들어오면 vocabulary tree를 내려가 word를 결정하는 데 O(log(k^L))=O(L)이 걸리고, inverted index로 후보 키프레임을 바로 조회한다. 전체 지도를 순회하지 않는다.

Tracking 스레드는 매 프레임마다 현재 포즈를 추정한다. 이전 프레임 포즈를 초기값으로 feature matching을 수행한 뒤, **EPnP**(Efficient Perspective-n-Point)로 포즈 $\mathbf{T}_{cw} \in SE(3)$를 구한다. EPnP는 3D–2D correspondence $\{(\mathbf{X}_i, \mathbf{u}_i)\}$에서 reprojection error를 최소화한다:

$$\mathbf{T}^* = \arg\min_{\mathbf{T}} \sum_i \left\| \mathbf{u}_i - \pi(\mathbf{T}\mathbf{X}_i) \right\|^2$$

여기서 $\pi$는 카메라 투영 함수, $\mathbf{X}_i$는 맵 포인트의 월드 좌표, $\mathbf{u}_i$는 이미지 좌표다. 초기 추정 후 RANSAC으로 outlier를 제거하고, inlier만으로 g²o 기반 local bundle adjustment를 수행해 현재 키프레임과 covisibility graph 이웃 키프레임들의 포즈 및 맵 포인트를 동시에 최적화한다.

---

## 7.2 ORB-SLAM2 (2017) — stereo/RGB-D

ORB-SLAM(2015)는 mono-only였다. 카메라 하나만으로는 scale을 알 수 없다. "이 복도가 10m인가 100m인가"를 이미지 픽셀에서 읽어낼 방법이 없다. Mur-Artal과 Tardós가 2016년에 작업을 시작한 것은 이 문제 때문이었다.

[Mur-Artal & Tardós 2017. ORB-SLAM2](https://doi.org/10.1109/TRO.2017.2705103)는 stereo와 RGB-D를 추가해 이 문제를 해결한다. stereo는 기선(baseline)을 알므로 depth를 직접 삼각측량한다. RGB-D는 depth 센서가 측정값을 준다. 두 경우 모두 scale이 생긴다.

구조는 mono와 동일한 세 스레드다. front-end만 센서 종류에 따라 달라진다. stereo는 rectified 이미지 쌍에서 ORB를 추출하고 좌우 매칭으로 depth를 구한다. 기선 근방의 특징점은 **stereo landmark**로, 멀리 있어 depth 추정이 불가능한 것은 **monocular landmark**로 분류한다. 이 혼합 방식이 stereo와 mono의 장점을 동시에 활용한다.

**Stereo 초기화**는 mono와 달리 첫 프레임부터 즉각 수행된다. mono 초기화는 두 프레임 사이의 Essential Matrix나 Homography를 통해 맵을 구성하고 scale 모호성이 남는다. Stereo는 첫 키프레임에서 좌우 이미지 간 수평 시차(disparity) $d$와 기선 $b$, 초점 거리 $f$로 depth를 계산한다:

$$Z = \frac{b \cdot f}{d}$$

depth $Z$가 임계값 $Z_{\max}=40b$ 이하인 특징점은 즉시 3D 맵 포인트로 등록된다. RGB-D 초기화도 동일한 원리다. depth 이미지에서 픽셀 $(u, v)$의 depth값 $Z$를 읽고, 역투영(back-projection)으로 3D 좌표를 얻는다. 두 경우 모두 scale이 고정되므로 첫 프레임 직후 Local BA를 바로 실행할 수 있다.

EuRoC MAV(Micro Aerial Vehicle) 데이터셋 Machine Hall 01 시퀀스에서 ORB-SLAM2(stereo)는 Table II에서 절대 translation 오차 0.035 m를 기록했다. 같은 표는 Stereo LSD-SLAM을 비교 대상으로 삼고 있어, 당시 feature-based 계열의 정밀도 우위가 수치로 확인되었다. KITTI 오도메트리에서도 ORB-SLAM2가 당시 published 방법 중 상위권이었다.

2017년 5월 논문이 IEEE TRO에 실리던 날, Mur-Artal과 Tardós는 GitHub에 소스를 함께 올렸다. Zaragoza 팀 둘이서 mono·stereo·RGB-D 세 모드를 단일 코드베이스로 공개한 것이다. 이후 GitHub star는 수천을 넘었고, ROS 래퍼가 커뮤니티에서 만들어졌다.

---

## 7.3 ORB-SLAM3 (2021): Atlas와 Visual-Inertial

2021년 IEEE Transactions on Robotics에 실린 [Campos et al. 2021. ORB-SLAM3](https://doi.org/10.1109/TRO.2021.3075644)는 저자 목록이 달라진다. Mur-Artal이 아니라 Carlos Campos가 1저자다. Mur-Artal은 Tardós와 함께 공저자로 이름을 올렸다. Campos는 Zaragoza 대학에서 Tardós 지도 아래 박사 과정을 밟았다. 계보가 한 세대 내려온 것이다.

ORB-SLAM3의 핵심 확장은 두 가지다. **Atlas**(멀티맵)와 **Visual-Inertial** 모드.

Atlas는 여러 개의 분리된 지도를 동시에 유지하는 구조다. 추적이 실패하면 기존 지도를 닫고 새 지도를 시작하며, 나중에 같은 장소를 재방문했을 때 두 지도를 병합한다. ORB-SLAM과 ORB-SLAM2에서 추적 실패는 치명적이었다. 한 번 잃으면 처음부터 다시 해야 했다. Campos는 이 점을 박사 과정 내내 가장 자주 겪은 한계로 지목했고, Atlas가 그 답이었다. ORB-SLAM3는 실패 후 재초기화하고 이전 지도를 기억한다.

Visual-Inertial(VI) 모드는 IMU 데이터를 통합한다. Campos는 Forster et al.이 RSS 2015에서 "IMU Preintegration on Manifold" 제목으로 제안하고 2016년 IEEE TRO에서 [On-Manifold Preintegration for Real-Time Visual-Inertial Odometry](https://doi.org/10.1109/TRO.2016.2597321)로 확장한 방식을 그대로 가져왔다. IMU는 빠른 모션에서 Visual SLAM이 잃기 쉬운 추적을 보완한다. VI-SLAM은 단안 카메라의 scale ambiguity도 해결한다. IMU의 가속도계 측정이 중력 방향과 함께 절대 scale을 제공한다.

Preintegration의 핵심은 키프레임 $i$와 $j$ 사이의 IMU 측정을 한 번만 적분해 놓는 것이다. 가속도계·자이로스코프 측정값을 $\tilde{\mathbf{a}}_t = \mathbf{a}_t + \mathbf{b}_a + \mathbf{n}_a$, $\tilde{\boldsymbol{\omega}}_t = \boldsymbol{\omega}_t + \mathbf{b}_g + \mathbf{n}_g$로 모델링하면(bias $\mathbf{b}$, noise $\mathbf{n}$), 두 키프레임 사이의 상대 회전·속도·위치 변화량을 다음과 같이 preintegration한다:

$$\Delta\mathbf{R}_{ij} = \prod_{k=i}^{j-1} \mathrm{Exp}\bigl((\tilde{\boldsymbol{\omega}}_k - \mathbf{b}_g)\Delta t\bigr)$$
$$\Delta\mathbf{v}_{ij} = \sum_{k=i}^{j-1} \Delta\mathbf{R}_{ik}\,(\tilde{\mathbf{a}}_k - \mathbf{b}_a)\Delta t$$
$$\Delta\mathbf{p}_{ij} = \sum_{k=i}^{j-1}\!\left[\Delta\mathbf{v}_{ik}\Delta t + \tfrac{1}{2}\Delta\mathbf{R}_{ik}\,(\tilde{\mathbf{a}}_k - \mathbf{b}_a)\Delta t^2\right]$$

여기서 $\mathrm{Exp}(\cdot)$는 $\mathfrak{so}(3)$의 지수 사상이다. bias가 BA 중 갱신되면 전체 재적분 없이 1차 선형 근사로 보정한다. ORB-SLAM3는 이 preintegrated 항($\Delta\mathbf{R}$, $\Delta\mathbf{v}$, $\Delta\mathbf{p}$)을 factor graph의 inertial edge로 추가해 visual reprojection residual과 함께 최적화한다.

> 🔗 **차용.** Campos는 Forster et al.의 On-Manifold Preintegration(TRO 2016, 원형은 RSS 2015) 공식을 ORB-SLAM3 Inertial 통합의 핵심으로 가져왔다. Forster의 수식은 연속 IMU 측정을 bias 보정과 함께 SO(3) 매니폴드 위에서 적분하는 방법을 제공한다. ORB-SLAM3는 이 공식을 factor graph 최적화에 연결했다.

EuRoC MAV 전체 11개 시퀀스 평균 RMSE ATE(절대 궤적 오차)에서 ORB-SLAM3(mono-inertial)는 Table II에서 0.043 m로 보고된다. 같은 표에서 VINS-Mono는 0.110 m로 집계되며, Kimera(stereo-inertial)는 0.119 m였다.

VI 모드와 Atlas가 결합하면 무인기나 핸드헬드 장치가 조명이 달라지거나 추적을 잃어도 이전 지도로 돌아올 수 있다. 시스템의 성격이 달라진 것이다.

---

## 7.4 왜 2020년대에도 Baseline인가

2023년, 학회 논문들은 여전히 ORB-SLAM3를 비교 대상으로 표에 넣는다. 새 방법이 발표될 때 "ORB-SLAM3보다 얼마나 낫냐"가 기준선이다. 알고리즘이 2021년에 멈췄는데도 기준이 되는 이유가 있다.

강건성이 먼저다. ORB feature는 조명 변화에 어느 정도 내성이 있고, binary descriptor라 계산이 빠르며, 많은 수를 실시간으로 뽑아 추적 실패를 줄인다. learned feature가 특정 데이터셋에서는 더 정확하지만, 새로운 환경에서 무너지는 경우가 있다. ORB의 동작은 예측 가능하다.

재현성도 있다. 코드가 공개되어 있고, ROS 통합이 잘 되어 있으며, 수천 개의 실사용 사례가 문서화되어 있다. 실험실에서 새 시스템을 평가할 때 ORB-SLAM3를 돌려보는 것이 첫 번째 단계가 된 지 오래다. mono·stereo·RGB-D·IMU를 단일 코드베이스가 지원하기 때문에 "우리 방법 vs ORB-SLAM3(stereo)" 혹은 "우리 방법 vs ORB-SLAM3(mono-inertial)"을 나란히 비교할 수 있다. 하나의 baseline이 여러 설정을 커버한다.

마지막으로, learned alternative가 일관되게 능가하지 못한다. DROID-SLAM(Teed & Deng, 2021)은 여러 시퀀스에서 ORB-SLAM3를 이긴다. 그러나 논문 자체가 보고하듯 EuRoC·TartanAir 같은 대용량 시퀀스에서는 24 GB급 GPU가 필요하고 TartanAir에서는 8 fps로 실시간이 아니다. 반면 ORB-SLAM3는 CPU-only로 돌고, 커뮤니티 보고로는 ARM/임베디드 플랫폼에서도 기본 동작이 확인된다.

---

## 📜 예언 vs 실제

> 📜 **예언 vs 실제.** Mur-Artal은 2015년 ORB-SLAM 논문 Section IX-C에서 두 가지 Future Work를 제시했다. 하나는 "Points at Infinity"로, 시차가 부족해 일반 맵 포인트로 편입할 수 없는 먼 점들을 회전 추정에 활용하자는 것이었다. 다른 하나는 "Dense Map Reconstruction"으로, compact한 키프레임 선택이 dense reconstruction의 좋은 skeleton을 제공한다는 제안이었다. 10년 뒤 시점에서 보면 첫 번째 방향은 VI-SLAM 및 후속 연구에서 부분적으로 흡수됐고, 두 번째 방향은 2020년대 NeRF-SLAM·Gaussian Splatting 계열이 "sparse skeleton + dense overlay" 구도를 다른 재료로 실현하는 쪽으로 귀결됐다. 저자가 지목한 RGB-D/stereo/IMU 같은 후속 모달리티 확장은 이 Section이 아니라 ORB-SLAM2(2017)·ORB-SLAM3(2021)에서 별도의 문제의식으로 덧붙었다. `[부분 적중]`

> 📜 **예언 vs 실제.** Campos et al.은 2021년 ORB-SLAM3 논문 Conclusions에서 ORB-SLAM3의 주된 실패 모드가 저텍스처 환경임을 인정하며, 네 가지 data association 문제에 적합한 photometric 기법의 개발을 다음 방향으로 제시했다(내시경 영상 응용을 예로 들었다). 2023-2025년 사이 그 방향보다 먼저 두드러진 흐름은 SuperPoint·LightGlue 같은 learned front-end를 ORB-SLAM3에 이식하는 연구였고, photometric 계열의 통합은 DSO·LDSO 쪽 맥락에서 별도로 이어졌다. ORB-SLAM3 공식 저장소 main branch는 2026년 현재도 전통 ORB descriptor를 유지한다. 저자 예언의 중심축(photometric)과 실제 학계 관심(learned feature)은 어긋난 채로 굴러갔다. `[진행형·어긋남]`

---

## 🧭 아직 열린 것

Long-term map reuse. Atlas가 멀티맵 유지를 가능하게 했지만, 조명이 크게 달라진 환경에서 지도 병합은 여전히 실패한다. 아침에 만든 지도와 저녁에 재방문할 때의 장소를 같은 곳으로 인식하는 것이 목표인데, 외관 변화가 크면 DBoW2의 place recognition이 놓친다. seasonal change가 있는 outdoor 환경에서 장기 자율주행이 필요한 연구그룹들이 이 문제를 붙잡고 있다. 2024년 기준 완전한 해답은 없다.

Pure vision baseline의 자리. learned feature 기반 시스템들이 표준 benchmark에서 ORB-SLAM3를 이기기 시작했다. SuperPoint + SuperGlue 조합, LightGlue, 그리고 DINOv2 기반 feature들이 특정 시퀀스에서 더 낮은 오차를 보인다. 그러나 일반화 가능성은 다른 문제다. training distribution 밖의 환경에서 learned feature가 전통 ORB보다 나쁜 결과를 내는 경우가 보고된다. "일관되게 능가한다"는 주장을 하려면 아직 더 넓은 실험이 필요하다.

대규모 outdoor에서의 drift. ORB-SLAM3는 도심 주행이나 수 km 이상의 경로에서 LiDAR SLAM 대비 여전히 열세다. GPS-denied 환경에서 urban-scale localization을 순수 카메라로 달성하는 것은 2026년 기준 미해결이다. 시각 조건의 변화, 동적 객체, 텍스처 없는 구간이 복합되면 drift가 누적된다. LiDAR 측량 정밀도와의 격차는 좁혀지고 있으나 닫히지는 않았다.

---

ORB-SLAM 삼부작이 feature-based 계보의 표준을 세운 같은 시기, Newcombe와 Engel은 정반대의 선택을 하고 있었다. 특징점을 뽑지 않고 이미지 전체의 밝기 정보를 직접 쓰겠다는 것이었다. 두 계보는 2010년대 내내 나란히 발전했고, 서로를 비교 대상으로 삼으면서 각자의 한계를 드러냈다. ORB-SLAM3가 2021년 EuRoC 벤치마크를 주도하는 동안, DSO는 TUM 복도에서 ORB-SLAM2를 눌렀다. 같은 시간표, 다른 출발점이었다.

---

# Ch.7b — 흔들리는 센서에서 제약식으로: IMU Preintegration의 발명

2009년 시드니, ACFR(Australian Centre for Field Robotics)의 박사과정생 Todd Lupton이 지도교수 Salah Sukkarieh 앞에서 한 문제를 풀고 있었다. 드론이 빠른 기동을 할 때 IMU는 200Hz로 측정값을 뱉어내는데 factor graph에는 이걸 전부 넣을 자리가 없었다. 키프레임은 초당 몇 번인데, 그 사이 수십·수백 개 IMU 측정을 어떻게 한 묶음으로 만들 것인가. Lupton이 IROS에 낸 답이 preintegration의 씨앗이었다. 6년 뒤 2015년 RSS, Christian Forster가 Davide Scaramuzza·Luca Carlone·Frank Dellaert와 함께 그 씨앗을 SO(3) 매니폴드 위로 옮겼을 때 IMU는 factor graph의 일등시민이 되었다. Ch.7의 ORB-SLAM3, Ch.8의 VI-DSO, Ch.17의 LIO-SAM·FAST-LIO가 "Forster 2016을 썼다"로 처리한 그 수식의 내부가 이 챕터의 무대다.

---

## 7b.1 MEMS와 "센서의 민주화"

Preintegration이 필요해진 이유는 IMU가 싸졌기 때문이다.

스트랩다운 관성항법의 뿌리는 1950년대 항공우주에 있다. 잠수함·미사일의 ring laser gyro는 수만 달러 장비였고 로봇공학 커뮤니티가 쓸 일은 없었다. 흐름을 바꾼 것은 MEMS(Micro-Electro-Mechanical Systems)였다. Analog Devices의 ADXL, InvenSense의 MPU 시리즈가 6축 IMU를 수 달러로 끌어내렸다. iPhone에 IMU가 들어간 것이 2007년, 2010년대 초반에는 연구용 드론·핸드헬드 장비가 당연히 MEMS IMU를 달았다. 스마트폰 수십억 대가 단가를 떨어뜨리는 시점과 Visual SLAM이 monocular scale ambiguity(Ch.5 §🧭)를 진지하게 고민하는 시점이 겹쳤다.

측정 모델은 단순하다. 가속도계는 중력이 섞인 specific force $\tilde{\mathbf{a}} = \mathbf{R}_w^b(\mathbf{a}^w - \mathbf{g}^w) + \mathbf{b}^a + \boldsymbol{\eta}^a$를, 자이로스코프는 angular velocity $\tilde{\boldsymbol{\omega}} = \boldsymbol{\omega}_b^b + \mathbf{b}^g + \boldsymbol{\eta}^g$를 준다. 여기서 $\mathbf{b}$는 bias, $\boldsymbol{\eta}$는 white noise다. 수식이 강요하는 사실이 더 무거웠다. 중력이 항상 섞이고, bias는 시간에 따라 천천히 떠다니며(random walk), MEMS 노이즈는 고주파다. IMU는 gravity-aligned world frame을 강요하고 온도·전원 상태마다 bias가 조금씩 달라지는 까다로운 동반자였다.

---

## 7b.2 첫 시도 — Lupton & Sukkarieh (2009 / 2012)

문제는 factor graph의 시간 축이었다. Ch.6가 정리한 Kaess의 iSAM2는 키프레임 단위의 pose를 노드로 삼는다. 그런데 IMU는 키프레임 사이에 수십 번 측정을 던진다. 이 측정을 전부 노드로 만들면 그래프가 폭발하고, 버리면 정보가 사라진다.

Lupton과 Sukkarieh의 [Visual-Inertial-Aided Navigation for High-Dynamic Motion (IROS 2009, TRO 2012)](https://doi.org/10.1109/TRO.2011.2170332)이 내놓은 답은 우회였다. 키프레임 $i$에서 $j$ 사이의 IMU 측정을 *한 번만* 수치 적분해 상대 증분을 만들어 놓자. 그 증분을 하나의 factor로 삼으면 IMU 원측정은 그래프에 들어갈 필요가 없다. "pre-integration"이라는 이름이 여기서 나왔다.

아이디어는 맞았지만 구현에 두 장애물이 있었다. 회전 표현이 Euler angle이었다 — gimbal lock이 있고 매니폴드가 아니다. 더 치명적인 것은 bias였다. BA 한 번 돌 때마다 bias 추정치가 바뀌고, bias가 바뀌면 증분도 달라진다. Lupton의 방식에서는 BA 반복마다 IMU 시퀀스를 재적분해야 했다. 키프레임당 수백 개 측정을 매번 다시 도는 비용이 실시간을 깎았다. 이 한계가 아이디어의 확산을 6년 지연시켰다.

---

## 7b.3 결정적 변곡점 — Forster-Carlone (2015 / 2017)

2015년 RSS, ETH Zürich의 박사과정생 Christian Forster가 Scaramuzza(UZH), Carlone(Georgia Tech, 이후 MIT), Dellaert(Georgia Tech, GTSAM의 창시자)와 함께 [IMU Preintegration on Manifold for Efficient Visual-Inertial Maximum-a-Posteriori Estimation](https://www.roboticsproceedings.org/rss11/p06.pdf)을 냈다. 2017년 IEEE TRO에 확장판 [On-Manifold Preintegration for Real-Time Visual-Inertial Odometry](https://doi.org/10.1109/TRO.2016.2597321)가 실렸다. 공저자 구성 자체가 계보다. UZH의 민첩한 드론 실험, Georgia Tech의 GTSAM factor graph 언어, Carlone의 최적화 이론이 한 논문에서 만났다.

재정의가 셋이었다. $\Delta\mathbf{R}_{ij}$를 SO(3) 매니폴드 위 상대 회전으로 엄밀히 정의하고, $\Delta\mathbf{v}_{ij}, \Delta\mathbf{p}_{ij}$를 *중력과 초기 상태에 독립*이 되도록 재정의한 것이 첫째였다. 이 양들은 물리적 증분이 아니라 수학적으로 state-independent하게 만들어진 양이다. 덕분에 IMU factor는 양 끝 pose와 velocity만 알면 평가할 수 있었다. 둘째, noise를 지수사상 끝으로 밀어내는 right Jacobian trick으로 공분산 $\boldsymbol{\Sigma}_{ij}$를 해석적으로 propagate했다.

진짜 혁명은 셋째였다. **Bias 1차 Jacobian 선형 보정**. BA 반복 중 bias가 조금 바뀌었을 때 증분 전체를 재적분하지 말고 precomputed 편미분으로 1차 근사 보정하자는 것이다. Lupton의 Euclidean 선형화와 같은 아이디어지만 SO(3) 위에서 작동한다. 키프레임 사이를 처음 적분할 때 한 번 계산해 두면 그래프 최적화가 수백 번 반복돼도 Jacobian을 다시 건드릴 필요가 없다. 재적분 수 ms가 Jacobian-vector product 수 μs로 줄었다. IMU factor가 실시간 BA 안으로 들어온 자리가 여기였다.

GTSAM에 Forster의 구현이 레퍼런스로 올라간 것이 마지막 쐐기였다. 후속 시스템들은 수식을 다시 쓰지 않았다. 그들은 `ImuFactor`를 `#include`했다.

> 🔗 **차용.** Forster의 manifold preintegration은 [Barfoot 2017. *State Estimation for Robotics*](https://doi.org/10.1017/9781316671528)가 정리한 SO(3) right Jacobian 체계를 그대로 쓴다. 소형 회전 변분을 지수사상과 Jacobian으로 치환하는 Lie group 계산은 로봇공학 상태 추정의 공용어였고, Forster는 이 언어로 IMU preintegration을 다시 썼다. Lupton이 Euler angle에서 막혔던 자리를, 같은 문제를 SO(3) 방언으로 옮기자 풀렸다.

> 🔗 **차용.** Bias 1차 Jacobian 아이디어 자체는 [Lupton & Sukkarieh 2012](https://doi.org/10.1109/TRO.2011.2170332)가 먼저 제시했다. Forster et al. TRO 2016 §VIII-B는 이 부채를 명시하며 "we follow [Lupton-Sukkarieh] but operate directly on SO(3)"라 적었다. Euclidean 근사를 매니폴드 위로 옮기자 같은 수학이 실시간이 되었다.

---

## 7b.4 실무 VIO 3파의 정립

Forster 공식이 자리 잡자 2017-2022년 사이 VIO 시스템이 세 갈래로 뻗었다.

첫 갈래는 필터 계열이고 뿌리는 Forster보다 앞서 있다. 2007년 UC Riverside의 Anastasios Mourikis와 Stergios Roumeliotis가 ICRA에 낸 [MSCKF(Multi-State Constraint Kalman Filter)](https://doi.org/10.1109/ROBOT.2007.364024)가 출발점이었다. 과거 여러 카메라 포즈를 필터 상태에 올려놓고 stochastic cloning으로 관측된 3D 점을 margin out하는 방식이다. preintegration 없이 EKF 뼈대로 Visual-Inertial을 실시간으로 돌린 최초 사례였다. 2021년 화성에서 NASA JPL의 Mars helicopter Ingenuity가 돌린 추정기가 MSCKF 계열이었다. University of Delaware의 Guoquan Huang 그룹이 2020년 [OpenVINS](https://doi.org/10.1109/ICRA40945.2020.9196524)로 오픈소스화했다.

두 번째 갈래는 최적화 계열이다. HKUST의 Shaojie Shen 그룹과 박사과정생 Tong Qin이 2018년 TRO에 낸 [VINS-Mono](https://doi.org/10.1109/TRO.2018.2853729)가 대표작이다. Forster 공식을 그대로 받아 sliding-window tightly-coupled BA 안에 IMU factor로 심고, 초기화 단계에서 scale과 gravity 방향을 분리 추정하는 절차를 정리했다. 코드가 공개되어 2019-2022년 학회의 VIO baseline이 됐다. Ch.7의 ORB-SLAM3가 EuRoC 11개 시퀀스 평균 ATE 0.043m로 보고될 때 같은 표에 0.110m로 비교된 쪽이 VINS-Mono였다.

세 번째 갈래는 direct 계열이다. Ch.8에서 다룬 VI-DSO(2018), Basalt(2019), [DM-VIO(2022)](https://doi.org/10.1109/LRA.2021.3140129)가 여기 속한다. TUM Cremers 그룹이 DSO의 photometric BA 위에 Forster의 inertial factor를 얹은 적층 구조였다. DM-VIO는 *delayed marginalization*을 더했다. IMU 초기화가 수렴하기 전에 섣불리 marginalize하면 잘못된 prior가 고정돼 장기 drift를 유발하는데, 두 marginalization prior를 병렬로 유지하다 gravity와 scale이 관측된 뒤 최종 prior로 합치는 방식이다.

---

## 7b.5 Observability — 무엇을 못 보는가

Visual-Inertial 시스템이 모든 것을 보는 것은 아니다.

Huang 그룹이 2010년대 초부터 정리한 분석은 하나의 결론에 수렴했다. 단서 없는 visual-inertial 시스템의 null space는 **4차원**이다. 3차원 global position과 1차원 yaw-around-gravity. 절대 좌표와 중력 축 회전은 IMU와 카메라만으로는 영원히 알 수 없다. GPS를 더하면 position이, 자기장이나 외부 anchor를 더하면 yaw가 복원된다. 순수 VIO는 이 4차원 부분공간을 구조적으로 볼 수 없다.

흥미로운 것은 *roll과 pitch는 보인다*는 점이다. 가속도계가 중력을 통해 수평을 읽기 때문이다. Ch.5가 지적한 monocular scale ambiguity가 IMU를 붙여 해결되는 자리도 여기다.

더 까다로운 쪽은 degenerate motion이다. 순수 직선 이동에서는 global orientation이, 순수 회전에서는 feature 깊이가, 일정 가속도에서는 monocular scale 전체가 관측되지 않는다. 드론이 hover하거나 자동차가 일정 속도로 직진할 때 VIO scale이 흔들리는 이유가 이 세 가지 degenerate motion이다. 실무자들은 이걸 경험으로 안다. 이륙 순간, 브레이크 순간, 코너 순간에 scale이 "잠긴다".

> 📜 **예언 vs 실제.** Forster et al.은 2017년 TRO §IX에서 세 방향을 꼽았다. time-synchronization과 online extrinsic calibration의 통합, long-term operation에서 bias random walk 가정 검증, event camera·rolling shutter 같은 비동기 센서로의 확장. 2026년 시점에서 첫 번째는 VINS-Mono·Kalibr·OpenVINS가 시간 offset을 상태 변수로 올리며 표준화됐고, 두 번째는 navigation-grade IMU에서는 맞지만 consumer MEMS에서는 온도·전원 변동이 여전히 남았으며, 세 번째는 Le Gentil의 GP 연속시간 preintegration이 답의 한 갈래가 되었다. 예언은 대체로 적중했으나 저자들이 그린 단일 확장이 아니라 세 갈래로 분화했다. `[부분 적중·분화]`

---

## 7b.6 Continuous-time으로의 분기

2021년 RSS, 시드니의 UTS(University of Technology Sydney)에서 Cédric Le Gentil과 지도교수 Teresa Vidal-Calleja가 [Continuous Integration over SO(3) for IMU Preintegration](https://roboticsproceedings.org/rss17/p075.pdf)을 냈다. 같은 시드니였다. Lupton의 ACFR에서 몇 km 떨어진 곳에서 같은 문제를 다른 각도로 다시 본 셈이다.

Forster의 preintegration은 discrete하다. IMU 측정 사이를 piecewise-constant로 가정하고 Euler integration한다. 이 가정은 LiDAR·event camera처럼 비동기 센서가 섞이면 깨진다. 스캔 중간에 들어온 LiDAR point를 어느 discrete bin에 붙일지 모호하고 보간 오차가 누적된다. Le Gentil의 답은 IMU를 **Gaussian Process**로 모델링해 angular velocity를 연속 함수로 본 것이다. 임의의 시간 $\tau$에서 상태를 평가할 수 있으니 비동기 측정이 자연스럽게 들어온다. B-spline·STEAM·GPMP 계보와 만나는 이 방향은 별도 계보로 다룰 만하다.

---

## 7b.7 차용의 지형

> 🔗 **차용.** Factor graph 위에서 IMU factor를 평가·최적화하는 골격은 Ch.6에서 정리한 [Dellaert의 GTSAM](https://gtsam.org/) 전통 그대로다. Forster의 `ImuFactor`는 GTSAM의 `NoiseModelFactor` 인터페이스에 꽂혀 visual reprojection factor와 나란히 하나의 `Values` 객체로 최적화되었다. 소프트웨어 구조의 상속이었다.

> 🔗 **차용.** Bias를 random walk으로 다루는 방식은 Ch.4가 기록한 Kalman filter의 state propagation 관습에서 왔다. Lupton 이전부터 항법 커뮤니티가 "bias를 상태에 포함하고 process noise를 작게 주는" 모델을 썼고, preintegration 시대에는 이것이 bias random walk factor로 재해석됐다.

---

## 🧭 아직 열린 것

**Visual-inertial observability의 실시간 감지.** 4차원 null space와 degenerate motion 표는 이론적으로 정리됐지만, 실제 시스템이 "지금 내가 degenerate 구간에 있다"를 판단하는 메커니즘은 미완이다. Hesch·Li·Huang 계열의 FEJ(First-Estimate Jacobian)가 선형화 시점의 null space를 보존하지만, 런타임에 degenerate 조건의 시작·종료를 포착해 제어 루프에 피드백하는 표준 방법은 2026년 기준 없다. 드론 제어와 VIO 추정이 같은 CPU에서 도는 시스템에서 이 공백은 실제 사고로 이어진다.

**Preintegration과 continuous-time의 통합.** Forster의 이산 증분과 Le Gentil의 GP 연속표현은 같은 문제를 다른 수학 언어로 푼다. LiDAR·event·frame 카메라를 섞어 쓸 때 어떤 표현을 밑바닥에 깔 것인가는 아직 엔지니어링 선택지의 문제다. B-spline 연속시간 BA가 부분적 답을 내놓았지만 배포 시스템 다수는 여전히 Forster의 이산 factor를 쓴다.

**학습 기반 IMU bias 모델.** Bias random walk 가정은 navigation-grade IMU에서는 맞지만, consumer MEMS에서는 온도 hysteresis와 전원 과도 현상 때문에 어긋난다. TLIO·RoNIN 계열이 LSTM·Transformer로 IMU-only odometry의 bias를 학습했고, 최근에는 conditional diffusion으로 bias 분포 자체를 모델링하는 시도가 나왔다. 이 접근이 Forster factor 안에 어떻게 들어갈지, 학습이 떠받치는 순간 preintegration의 수학적 우아함이 어디까지 유지될지는 다음 질문이다.

---

Lupton이 시드니에서 시작한 idea가 6년 동안 Euler angle의 벽에 갇혀 있었고, Forster가 SO(3)로 옮겨 bias Jacobian의 자물쇠를 풀었고, Le Gentil이 다시 시드니에서 연속시간으로 가지를 쳤다. 세 세대의 작업이 ORB-SLAM3의 한 줄, VI-DSO의 한 줄, LIO-SAM의 한 줄 뒤에 쌓여 있다.

---

# Ch.7c — 시간이 매끈하게 흘러야 할 때: Continuous-Time Trajectory

Ch.7b가 정리한 preintegration은 IMU 측정을 이산 키프레임 사이의 relative factor로 압축하는 공학이었다. 그 압축은 "키프레임"이라는 단위를 전제로 성립한다. 두 키프레임 사이에 100번 들어온 관성 샘플이 하나의 factor로 접히려면, factor의 끝점이 둘 다 명확한 시각을 가져야 한다. 카메라 셔터가 글로벌하게 한 번 열리고 닫히는 시스템에서는 그 가정이 무해하다. 문제가 생기는 자리는 따로 있었다.

2012년 Toronto에서 [Paul Furgale·Timothy Barfoot·Gabe Sibley](https://asrl.utias.utoronto.ca/~tdb/bib/furgale_iros12.pdf)가 IROS 논문에서 질문을 정식화했다. rolling shutter로 찍은 한 장의 이미지에서 각 행은 다른 시각의 자세로 투영된다. spinning LiDAR가 한 바퀴를 도는 사이에 차량은 수 미터를 달린다. IMU는 1 kHz로 쏟아지는데 카메라는 30 Hz다. 이 센서들을 하나의 최적화로 묶는 가장 자연스러운 방법은 자세를 "프레임"이 아니라 "시간 t의 함수"로 두는 것이었다. Furgale·Barfoot·Sibley는 B-spline을 골랐고, 그 선택이 continuous-time trajectory estimation이라는 갈래의 공식 출발점이 됐다.

10년 뒤, Handbook은 이 갈래를 manifold와 나란히 SLAM의 "기본 도구 2개" 중 하나로 배치한다. 우리 책의 이전 장들이 한 번도 이 도구를 꺼내지 않은 것은 Visual-Inertial의 문법이 대부분 discrete keyframe에서 완성됐기 때문이다. 이 장은 그 공백을 메운다.

---

## 7c.1 Discrete-time의 한계

Ch.7b의 preintegration이 해결한 것은 "IMU가 카메라보다 빠르다"는 단일 축이었다. 해결하지 못하는 축은 네 개가 더 있다.

첫째, rolling shutter. consumer CMOS 카메라는 한 프레임을 위에서 아래로 수십 ms에 걸쳐 읽는다. 빠르게 움직이는 카메라에서 첫 행과 마지막 행은 서로 다른 자세에서 찍힌다. Ch.8의 DSO·LSD-SLAM이 photometric consistency를 가정할 때 이 왜곡은 모델 밖에 있었다. Cremers 그룹이 2019년 [Basalt](https://arxiv.org/abs/1904.06504)에 B-spline 궤적을 올린 이유가 여기에 있다.

둘째, spinning LiDAR motion distortion. Ch.17에서 보았듯 Velodyne HDL-64E는 10 Hz로 한 바퀴를 돈다. 그 100 ms 사이에 차량이 10 m/s로 달리면 한 scan 안의 점들이 1 m씩 다른 자세에서 찍힌다. LOAM은 이 왜곡을 odometry 루프 안에서 간접 보정했지만, 원리적 해법은 "각 점이 찍힌 순간의 자세"를 질의할 수 있는 궤적 표현이었다.

셋째, event camera. Ch.18이 기록한 DVS는 픽셀마다 μs 단위로 비동기 이벤트를 쏟는다. 이벤트에는 "프레임"이 없다. [Mueggler et al. 2015](https://arxiv.org/abs/1502.00796)가 SE(3) B-spline 궤적 위에서 event SLAM을 정식화한 것은 다른 선택지가 없어서였다.

넷째, high-rate IMU를 이질 주파수 센서 여럿과 동시 결합하는 일. 한 시스템에 200 Hz IMU, 20 Hz 카메라, 10 Hz LiDAR가 들어오면, 이산 상태 노드를 모든 측정 시각마다 두는 것은 현실적이지 않다. 상태의 수가 측정의 수를 따라가는 순간 factor graph는 부풀어 오른다.

네 문제의 공통 구조는 같다. 측정 시각 $t_i$가 제어되지 않는다. 관측은 아무 때나 들어오고, 추정기는 그 시각의 자세를 알아야 한다. "측정 시각 / 추정 시각 / 질의 시각"의 분리가 continuous-time representation의 본질적 이점이다.

---

## 7c.2 Parametric spline: Furgale 계보

Furgale·Barfoot·Sibley가 2012년에 고른 도구는 B-spline이었다. 궤적을 basis function의 합 $\mathbf{p}(t) = \sum_k \Psi_k(t)\,\mathbf{c}_k$로 쓰고, 계수 $\mathbf{c}_k$를 최적화 변수로 둔다. B-spline의 핵심은 local support다. 한 시점 t에서 0이 아닌 basis는 소수(보통 4개)뿐이고, 나머지는 정확히 0이다. 임의 시각 $t_i$의 자세를 질의하는 비용이 상수고, factor graph sparsity가 그대로 유지된다.

> 🔗 **차용.** B-spline의 수학적 뼈대는 [de Boor (1978) *A Practical Guide to Splines*](https://link.springer.com/book/10.1007/978-1-4612-6333-3)의 고전이다. Furgale이 한 일은 그 뼈대를 SE(3) 위로 끌어올리고, 계수를 factor graph의 변수 노드로 배치한 것이다. 수치해석 교과서의 도구가 SLAM 최적화로 이식된 경로다.

형식의 약점은 명확했다. 계수 간격을 좁게 잡으면 과적합하고, 넓게 잡으면 빠른 움직임을 놓친다. 간격 선택이 경험 의존이었다. 그리고 linear B-spline을 SE(3)에 그대로 얹으면 보간 결과가 매니폴드를 벗어난다.

2013년 Oxford의 [Steven Lovegrove et al.](https://www.roboticsproceedings.org/rss09/p11.html)가 cumulative B-spline을 제안했다. basis를 합이 아니라 누적 곱 형태 — $T(t) = \prod_k \exp\bigl(\tilde\Psi_k(t) \log(T_k T_{k-1}^{-1})\bigr) \cdot T_0$ — 로 재배치하면 각 인자가 Lie group에 닫혀 있다. 이 형식은 이후 rolling-shutter·event camera·VIO 논문의 기본어가 됐다. Basalt, [Mueggler event SLAM](https://arxiv.org/abs/1502.00796), [Kerl et al. 2015 dense rolling shutter VO](https://doi.org/10.1109/ICCV.2015.172)가 모두 cumulative B-spline 위에 섰다.

Parametric spline은 계산이 가볍고 코드가 단순하다는 이점 때문에 실시간 VIO와 event 시스템에서 꾸준히 쓰였다. 대신 궤적에 대한 사전 분포(motion prior)를 자연스럽게 얹는 방법이 없었다. 관측이 드문 구간에서 spline은 매끈하긴 하지만 근거 없이 매끈했다. 그 공백을 다른 갈래가 메운다.

---

## 7c.3 SDE 기반 GP: Barfoot 계보와 STEAM

같은 2014년, 토론토의 Barfoot 그룹이 두 번째 갈래를 열었다. [Barfoot, Tong, Särkkä 2014 "Batch Continuous-Time Trajectory Estimation as Exactly Sparse Gaussian Process Regression"](https://www.roboticsproceedings.org/rss10/p01.pdf)은 제목이 그대로 주장이었다. 궤적을 basis 합이 아니라 Gaussian process로 두겠다. 궤적의 사전 분포는 kernel $\mathcal{K}(t, t')$로 주어지고, 관측이 들어오면 posterior를 조건부 Gaussian으로 닫는다.

GP의 순수 형태는 문제가 하나 있다. 관측 수 $N$이 크면 kernel matrix $K$의 역행렬 비용이 $O(N^3)$이다. Barfoot·Tong·Särkkä가 보인 것은 이 비용을 회피할 수 있는 kernel의 가족이 있다는 것이었다. 궤적이 linear time-invariant stochastic differential equation $\dot{\mathbf{x}}(t) = A\mathbf{x}(t) + L\mathbf{w}(t)$의 해로 정의될 때, 그 kernel $K$의 역행렬 $K^{-1}$이 block-tridiagonal 구조를 가진다. factor graph로 읽으면 연속한 상태 노드 사이에만 binary factor가 있고, 멀리 떨어진 노드 사이에는 factor가 없다.

> 🔗 **차용.** "GP posterior를 factor graph의 prior로 재해석한다"는 틀은 [Särkkä 2013 *Bayesian Filtering and Smoothing*](https://users.aalto.fi/~ssarkka/pub/cup_book_online_20131111.pdf)이 정리한 SDE-GP 연결을 Barfoot 그룹이 SLAM으로 끌어온 것이다. Rasmussen-Williams의 GP 교과서는 kernel을 닫힌 형식으로 쓰지만, 실시간 SLAM은 sparse inverse를 원한다. Särkkä의 SDE 표현이 그 다리였다.

실무 귀결이 **STEAM** (Simultaneous Trajectory Estimation and Mapping)이다. 2015년 RSS에서 [Sean Anderson·Barfoot 2015 "Full STEAM Ahead"](https://www.roboticsproceedings.org/rss11/p45.pdf)가 constant-velocity prior 기반 STEAM을 공식화했다. 상태를 자세 $\mathbf{p}(t)$와 속도 $\mathbf{v}(t)$로 augment하고, 속도의 white noise 적분으로 자세가 따라가는 구조다. Anderson은 같은 해 sparsity 증명을 tightened 형태로 완성했고, 그 증명이 이후 Barfoot 그룹의 모든 continuous-time 논문의 backbone이 됐다.

STEAM의 두 번째 이점이 GP interpolation이었다. 제어점(control pose)을 소수만 두고, 제어점 사이 임의 시각의 자세를 posterior mean으로 질의할 수 있다. spinning LiDAR의 한 scan 안에서 10,000개의 점이 각자 다른 시각에 찍혀도, 제어점은 scan 당 하나만 둔다. 계산량이 관측 수가 아니라 제어점 수에 비례한다.

2019년 Tang·Barfoot의 [STEAM 오픈소스](https://github.com/utiasASRL/steam)가 공개되면서 학계·산업계에서 직접 쓸 수 있는 라이브러리가 됐다. 같은 해 Dellaert 그룹의 GTSAM에도 GP continuous-time factor가 contrib로 들어간다. 두 경로의 수렴이었다.

---

## 7c.4 Lie group 위의 continuous-time

Parametric이든 nonparametric이든 SLAM은 SE(3) 위의 궤적을 원한다. Euclidean 상의 spline·GP를 SE(3)로 끌어올리는 일은 기술적으로 간단하지 않다. tangent space에 기대서 선형 보간을 한 뒤 exponential map으로 매니폴드에 얹는 방식이 통용된다.

B-spline 쪽에서는 [Sommer, Demmel et al. 2020 "Efficient Derivative Computation for Cumulative B-Splines on Lie Groups"](https://arxiv.org/abs/1911.08860)가 SE(3) cumulative spline의 Jacobian을 닫힌 형식으로 정리했다. CVPR에 실린 이 논문은 rolling-shutter VIO·event camera·visual-inertial 시스템에서 실시간 미분이 가능한 B-spline 궤적의 표준 공식을 제공했다. Basalt와 Cremers 그룹 후속 작업이 이 정리 위에 섰다.

GP 쪽에서는 Anderson·Barfoot이 "local variable" 구도를 제안했다. 각 제어 자세 $T_k$ 근처에서 local perturbation $\xi_k(t) = \log(T(t)\,T_k^{-1})$를 정의하고, 그 위에서 GP를 운용한다. 전역 매니폴드 위에서 직접 GP를 정의하는 것은 어렵지만, 각 제어점 근방의 tangent space에서는 euclidean GP가 성립한다. 제어점 사이를 건너뛸 때 adjoint가 등장하는데, 그 수학적 근거는 Ch.7b preintegration의 on-manifold 논의와 같다. 두 도구가 같은 Lie group 문법을 공유한다는 사실이 2015년 이후 분명해졌다.

> 🔗 **차용.** GP를 Lie group local variable로 이식한 경로는 [Anderson-Barfoot 2015 ICRA](https://doi.org/10.1109/ICRA.2015.7138984)가 처음 체계화했다. 이들이 쓴 트릭 — "연속한 두 제어점 사이에서만 GP를 돌리고, 제어점 사이를 건너뛸 때 adjoint로 보정" — 은 이후 continuous-time LiDAR·VIO 논문이 모두 물려받는다.

spline과 GP의 실질적 차이는 motion prior의 유무다. spline은 계수를 직접 추정하고 사전 분포가 없다. GP는 SDE에서 유도된 사전 분포가 constant-velocity 혹은 white-jerk 등으로 내장돼 있다. 관측이 드문 구간에서 GP는 prior가 채우고, spline은 인접 관측이 채운다. 둘을 결합하려는 시도(Johnson et al. 2020)도 있었지만, 실무에선 application에 따라 한쪽을 고른다.

---

## 7c.5 응용으로 내려온 계보: LiDAR와 VIO

이론이 응용으로 내려오는 데 10년이 걸렸다. 2022년을 기점으로 continuous-time이 세 현장에서 사실상 표준이 된다.

첫째, LiDAR motion distortion. Paris의 [Pierre Dellenbach et al. 2022 "CT-ICP"](https://arxiv.org/abs/2109.12979)는 각 scan을 "시작 자세"와 "끝 자세" 두 개로 파라미터화하고 그 사이를 선형 보간했다. 간단한 continuous-time 모델이지만, KITTI·NCLT·Newer College 벤치마크에서 기존 LOAM·FAST-LIO의 정확도를 상회했다. 같은 해 Toronto의 [Keenan Burnett et al. 2022 "Are We Ready for Radar to Replace Lidar?"](https://arxiv.org/abs/2206.05432)와 [STEAM-ICP](https://github.com/utiasASRL/steam_icp)가 GP 기반 continuous-time을 Aeva FMCW LiDAR에 적용했다. Aeva 센서가 각 점마다 도플러 속도를 함께 출력하는데, 이 속도는 STEAM의 속도 상태와 직접 대응한다. continuous-time 표현이 아니었다면 쓸 방법이 없는 정보였다.

둘째, rolling-shutter VIO. Basalt·[Cremers 그룹 rolling-shutter VO](https://doi.org/10.1109/CVPR.2016.71)·[OKVIS](https://doi.org/10.1177/0278364914554813) 후속작들이 이미지 각 행의 찍힌 시각을 B-spline 궤적에 질의한다. 글로벌 셔터를 가정하고 우회하는 기존 VIO와 달리 rolling shutter 자체를 모델 안에서 처리한다.

셋째, event camera. Ch.18이 기록한 2010년대의 좌절 이후, 2020년대의 event SLAM은 거의 모두 continuous-time 궤적 위에 섰다. 각 이벤트의 μs 타임스탬프를 B-spline 혹은 GP에 질의해 그 순간의 자세를 얻고, event-image consistency로 residual을 계산한다. event가 "프레임이 없는 센서"라는 사실과 continuous-time이 "프레임 가정이 필요 없는 표현"이라는 사실이 자연스럽게 맞물렸다.

> 🔗 **차용.** CT-ICP는 [Besl·McKay 1992 ICP](https://graphics.stanford.edu/courses/cs164-09-spring/Handouts/paper_icp.pdf)의 point-to-plane 목적함수에 scan 내부 continuous-time linear 보간을 얹은 조합이다. 고전 registration과 Furgale의 continuous-time 정신이 30년의 간격을 두고 한 시스템에서 만났다.

---

## 📜 예언 vs 실제

> Furgale·Barfoot·Sibley가 2012년 IROS 논문 Future Work에 적은 기대는 두 갈래였다. 하나는 "continuous-time 표현이 rolling shutter와 IMU 고속 샘플링을 통합하는 자연스러운 언어가 될 것"이라는 전망이었고, 다른 하나는 "sparse factor graph 호환성을 증명하는 후속 작업"이었다. 두 기대 모두 10년 안에 들어맞았다. Barfoot·Tong·Särkkä 2014가 sparse GP 증명을 닫았고, 2020년대 rolling-shutter VIO와 event SLAM은 cumulative B-spline을 기본어로 쓴다. 다만 저자들이 예측하지 않은 전개가 하나 있다. 2012년 당시에는 "discrete keyframe 기반 ORB-SLAM이 주류가 되고, continuous-time은 특수 센서용"이라는 분업이 암묵적으로 상정됐다. 실제로는 반대 방향의 밀어올림도 나왔다. Burnett이 FMCW LiDAR의 도플러 속도를 쓰는 STEAM-ICP를 내놓으면서, continuous-time이 특수 센서를 다루는 부록이 아니라 센서의 능력을 끌어내는 능동적 표현이 됐다. `[적중+확장]`

---

## 🔗 차용 (요약)

위에 산재한 세 상자 외에, 이 장이 기댄 다른 계보를 한 번 더 모은다.

Särkkä의 SDE-GP 교과서가 없었다면 Barfoot·Tong·Särkkä 2014는 수식의 앵커가 없었을 것이다. de Boor의 1978년 spline 고전이 없었다면 Furgale 2012는 basis function을 처음부터 쌓아야 했다. Anderson·Barfoot 2015의 local variable 기법이 없었다면 GP를 Lie group으로 이식하는 작업은 더 오래 걸렸을 것이다. continuous-time trajectory estimation은 수치해석·확률 이론·Lie group 미분기하의 세 줄기가 SLAM이라는 좁은 지점으로 모여든 자리다.

---

## 🧭 아직 열린 것

**Learning-based continuous-time prior.** SDE가 주는 motion prior는 constant-velocity나 white-jerk 같은 물리 가정을 내장한다. 실제 주행·보행·UAV 궤적은 이 가정을 어기는 경우가 많다. 2023-2024년 neural SDE나 neural ODE로 데이터 기반 prior를 학습해 continuous-time factor graph에 꽂으려는 시도들이 나왔다. 아직 실시간 sparse 구조를 유지한 채 learned prior를 얹은 시스템은 검증 단계다.

**VIO와 continuous-time의 통합.** Ch.7b의 preintegration은 keyframe 기반 VIO의 사실상 표준으로 남아 있다. continuous-time 궤적이 preintegration을 대체할 수 있는지, 혹은 두 도구가 공존하는 하이브리드가 더 나은지는 2026년 기준 결론이 없다. Le Gentil의 [GP-augmented preintegration 계보](https://arxiv.org/abs/2007.04144)가 한 다리를 놓으려 시도 중이지만, ORB-SLAM3·VINS-Fusion 수준의 배포 시스템에서는 여전히 discrete-time preintegration이 주력이다.

**Edge deployment를 위한 online sliding window.** STEAM과 B-spline 기반 시스템은 제어점 수가 누적되면 최적화가 느려진다. marginalization으로 과거 제어점을 제거하면서 continuous-time posterior의 일관성을 유지하는 문제는 기술적으로 까다롭다. 자동차·드론 같은 임베디드 플랫폼에서 continuous-time SLAM을 10년짜리 표준으로 밀어올리려면 이 공백이 먼저 메워져야 한다.

---

Ch.7b가 이산 시간의 효율을 끝까지 짜낸 공학이었다면, 이 장은 그 바깥 — "시간이 매끈하게 흘러야 할 때" — 에서 자란 갈래의 계보였다. 두 도구는 경쟁하지 않는다. 하나의 SLAM 시스템에 IMU preintegration factor와 continuous-time LiDAR factor가 나란히 들어가는 구성이 2024년 이후 꾸준히 보고되고 있다. 다음 장은 이 모든 문법이 어떤 시각 계보의 꼭짓점에서 맞닥뜨리는지를 다룬다.

---

# Ch.8 — Direct 계보: DTAM에서 DSO까지

Richard Newcombe는 Andrew Davison의 박사과정 학생이었다. Imperial College에서 MonoSLAM의 30-landmark 한계를 직접 목격한 그는 2011년 정반대의 선택을 했다—모든 픽셀을 쓰기로. Davison이 "몇 개의 점만 추적하면 충분하다"는 EKF의 논리에 기대어 실시간을 증명했다면, Newcombe는 GPU 한 장을 얹고 화면 전체를 써도 실시간이 가능하다는 것을 보여줬다. DTAM은 MonoSLAM의 직계지만, 그 방법론적 DNA는 완전히 뒤집혀 있다.

Ch.7에서 살펴본 ORB-SLAM 계보는 feature를 먼저 뽑고 그 feature만 추적하는 방식이었다. Harris 코너와 ORB 디스크립터가 걸러낸 수백 개의 점—나머지 픽셀은 버려진다. Direct 계보는 이 선택을 거부했다. 버릴 픽셀이 없다—이미지 자체가 측정값이다.

같은 해 뮌헨에서는 Daniel Cremers가 다른 경로를 걷고 있었다. Computer vision의 variational 방법론(Gauss-Newton image alignment, 광학 흐름의 수식 언어)을 SLAM 전체에 이식하는 작업이었다. Cremers의 제자 Jakob Engel은 2014년 LSD-SLAM을, 2016년 DSO를 내놓았다. 두 논문은 서로 다른 밀도에서 같은 질문을 던졌다. feature를 추출하는 대신 픽셀의 밝기를 직접 비교하면 어떤 일이 생기는가.

---

## 1. 모든 픽셀: DTAM

2011년 ICCV에서 Newcombe와 공동저자 Lovegrove, Davison이 발표한 [Newcombe, Lovegrove & Davison 2011. DTAM](https://doi.org/10.1109/ICCV.2011.6126513)은 "Dense Tracking and Mapping in Real-Time"의 약자다. 이름 그대로 추적과 지도 구축 양쪽을, 모든 픽셀을 사용해, 실시간으로 수행한다.

시스템의 핵심은 두 부분이다. 추적 단계에서는 현재 프레임 전체를 cost volume과 비교하는 photometric alignment를 수행한다. 특징점 추출 없이, 디스크립터 매칭 없이, 픽셀 intensity의 차이만 최소화한다. 지도 구축 단계에서는 multi-baseline stereo 방식으로 depth map을 추정하고, total variation regularization으로 smooth한 dense 3D 모델을 유지한다.

$$E(\mathbf{u}) = \sum_{i} \rho\left( I_i\bigl(\pi(KT_i\mathbf{p}(\mathbf{u}))\bigr) - I_r\bigl(\pi(\mathbf{p}(\mathbf{u}))\bigr) \right) + \lambda \,\text{TV}(\mathbf{u})$$

여기서 $\mathbf{u}$는 역 깊이(inverse depth) 맵, $\mathbf{p}(\mathbf{u})$는 $\mathbf{u}$로 역투영한 3D 점, $K$는 카메라 내부 행렬, $T_i$는 참조 프레임 기준 $i$번 프레임의 rigid body 변환, $\pi$는 원근 투영, $\rho$는 Huber loss, $\text{TV}(\mathbf{u}) = \|\nabla \mathbf{u}\|_1$은 total variation regularizer이다. 이 최적화를 실시간으로 돌리려면 GPU가 필요하다. DTAM은 그 전제를 숨기지 않았다. 당시 Nvidia GTX 480 한 장(논문 §3의 commodity 시스템 설정)에서 실행되었다.

> 🔗 **차용.** DTAM의 dense volumetric 접근은 depth camera 기반 연구, 특히 [Curless & Levoy 1996](https://doi.org/10.1145/237170.237269)의 TSDF 아이디어에서 부분 영감을 받았으나, 단안(monocular) 카메라에 적용했다는 점이 핵심 차이다. 이후 Newcombe 자신이 주도한 [KinectFusion](https://doi.org/10.1109/ISMAR.2011.6092378)(2011, ISMAR)이 오히려 depth sensor 버전으로 이 아이디어를 완성시키는 역방향 흐름이 나타난다.

결과는 충격적이었다. 실내 scene 전체가 실시간으로 복원되는 영상은 2011년 ICCV 발표 직후 YouTube에 공개되어 수만 회 조회를 기록했다. 그러나 약점도 명확했다. GPU 없이는 돌아가지 않았고, 조명 변화에 취약했으며, 실외 대규모 환경으로는 확장되지 않았다.

---

## 2. 엣지의 추적: LSD-SLAM

[Engel, Schöps & Cremers 2014. LSD-SLAM](https://doi.org/10.1007/978-3-319-10605-2_54)은 DTAM의 dense를 포기하는 대신 GPU 의존성도 함께 버렸다. "Large-Scale Direct Monocular SLAM"은 semi-dense 방식으로, 이미지에서 gradient magnitude가 임계값 이상인 픽셀만 추적한다. 벽의 평탄한 영역은 무시하고, gradient가 충분한 엣지 근방 픽셀만 살린다. 코너 detector는 쓰지 않으며, 오직 intensity gradient의 세기가 픽셀 선택 기준이다.

추적 단계는 SE(3)에서의 direct image alignment다. 현재 프레임을 키프레임에 direct로 warping하여 photometric residual을 Gauss-Newton으로 최소화한다. 지도는 키프레임 기반이며 각 키프레임마다 semi-dense depth map을 유지한다. 키프레임 간 연결은 pose graph로 관리하고, loop closure는 appearance-based relocalization으로 후보를 찾은 뒤 depth consistency check로 검증한다.

> 🔗 **차용.** Gauss-Newton photometric registration은 이미지 정렬 분야의 고전이다. [Lucas & Kanade 1981](https://www.ijcai.org/Proceedings/81-2/Papers/017.pdf) tracker와 그 역방향 합성([Baker & Matthews 2004](https://doi.org/10.1023/B:VISI.0000011205.11775.fd))이 LSD-SLAM frontend의 직접 조상이다. Cremers 그룹은 variational image processing 커뮤니티의 언어를 SLAM 파이프라인 전체로 이식했다.

CPU에서 실시간으로 동작한다는 점이 LSD-SLAM의 실용적 의미였다. 키프레임만 들고 pose graph를 최적화하는 구조는 PTAM의 tracking/mapping 분리와 표면적으로 닮았지만, 내부는 달랐다. ORB나 BRIEF 같은 binary descriptor가 없고, 픽셀 강도가 유일한 측정값이다.

LSD-SLAM은 실외 대규모 환경에서도 동작하는 장면을 공개했다. 자전거를 타고 수십 미터를 이동하는 동안 semi-dense map이 구축되는 데모는 direct 방식의 확장 가능성을 보여줬다. KITTI 벤치마크에서 당시 top-tier feature-based 방법과 비교 가능한 수준이었다.

그러나 조명 변화가 문제였다. 터널 진입, 창문 역광, 갑작스러운 플래시—photometric consistency를 가정하는 순간, 이런 상황은 시스템을 즉시 destabilize했다.

---

## 3. Sparse Direct의 완성: DSO

[Engel, Koltun & Cremers 2018. DSO (PAMI)](https://doi.org/10.1109/TPAMI.2017.2658577)는 2016년 arXiv에 먼저 공개되었다. "Direct Sparse Odometry"는 이름이 이미 포지셔닝을 담고 있다. LSD-SLAM보다 더 sparse하게, 그러나 DTAM보다 훨씬 적은 픽셀로, 대신 photometric calibration을 철저히 하겠다.

시스템은 각 키프레임에서 gradient가 높은 픽셀 약 2,000개를 선택한다. ORB-SLAM2의 기본 설정(nFeatures=1000)에 비해 많고, LSD-SLAM의 semi-dense(gradient 있는 픽셀 전체)보다 훨씬 적다. 이 픽셀들에 대해 sliding window bundle adjustment를 수행하는데, 최적화 변수가 camera pose뿐 아니라 inverse depth, affine brightness 파라미터 $(a_i, b_i)$까지 포함한다. 윈도우를 벗어난 프레임은 marginalization으로 제거되며, 이 과정에서 Schur complement를 이용해 계산 비용을 O(N)으로 유지한다.

DSO는 카메라 photometric 모델을 세 층으로 분리한다. 첫째, vignetting(렌즈 주변부로 갈수록 밝기가 감소하는 효과)은 사전 캘리브레이션으로 보정한다. 둘째, camera response function(gamma curve, 센서가 빛을 비선형으로 기록하는 특성)도 사전에 역함수를 추정해 linear intensity domain으로 변환한다. 셋째, 프레임마다 달라지는 노출 시간과 affine brightness 변화는 실시간 최적화 변수 $(t_i, a_i, b_i)$로 추정한다:

$$E_{pj} = \sum_{\mathbf{p} \in \mathcal{N}_p} w_{\mathbf{p}} \left\| \left( I_j\!\left[\mathbf{p}'\right] - \frac{t_j e^{a_j}}{t_i e^{a_i}} I_i[\mathbf{p}] - \left(b_j - \frac{t_j e^{a_j}}{t_i e^{a_i}} b_i\right) \right) \right\|_\gamma$$

여기서 $t_i, t_j$는 노출 시간, $(a_i, b_i)$와 $(a_j, b_j)$는 각 프레임의 affine brightness 파라미터(gain과 bias), $\|\cdot\|_\gamma$는 Huber loss이다. Vignetting은 전처리 단계에서 photometric calibration으로 보정되며, 위 잔차는 보정된 intensity에 적용된다. 카메라의 노출 변화·vignetting·response curve를 별도 캘리브레이션 단계와 실시간 최적화 변수로 분리해 처리한 것은 direct SLAM에서 DSO가 처음이었다.

> 🔗 **차용.** Photometric camera calibration의 형식적 기반은 [Debevec & Malik 1997](https://doi.org/10.1145/258734.258884)의 HDR 복원 작업에서 비롯된다. 그들이 여러 장의 사진에서 camera response function을 복원하기 위해 세운 photometric 모델을 DSO는 실시간 SLAM의 최적화 변수로 가져왔다.

결과는 인상적이었다. TUM monocular dataset에서 DSO는 ORB-SLAM2를 여러 시퀀스에서 능가한다고 보고했다. 특히 feature가 희박한 환경(평탄한 벽이 많은 실내 복도)에서 DSO가 ORB-SLAM2보다 낮은 ATE를 기록했다. photometric 정보를 직접 쓰면 원론적으로 더 많은 정보를 활용한다는 주장의 경험적 근거였다.

> 📜 **예언 vs 실제.** DSO는 사전 photometric calibration을 요구했고, 그 의존성은 곧 후속 연구의 표적이 되었다. 2018년 Bergmann, Wang, Cremers의 [online photometric calibration](https://doi.org/10.1109/LRA.2017.2777002)이 한 방향이었다—캘리브레이션을 사전에 하지 않고 SLAM 실행 중 노출·response·vignetting을 동시에 추정한다. 그럼에도 end-user 관점의 배포 장벽은 2026년 기준 여전히 남아 있다. consumer 카메라에서 photometric 파라미터를 안정적으로 추출하는 과정이 완전히 자동화되지 못한 채 카메라별 사전 세팅을 요구한다. `[진행형]`

> 📜 **예언 vs 실제.** DTAM은 GPU 한 장에 의존한 실시간 dense SLAM이었고, dense 재구성의 접근성 확대는 자연스러운 다음 과제로 놓였다. 그 실현 경로는 직진이 아니었다. 순수 mono dense는 NeRF와 3DGS가 등장하는 2020년대까지 실시간 배포 가능한 형태로 나오지 않았다. 대신 Newcombe 자신이 주도한 KinectFusion이 RGB-D depth sensor를 사용해 GPU dense 재구성을 2011년에 바로 완성했다—sensor 교체로 문제를 우회한 것이다. `[기술변화]`

---

## 4. VI-DSO와 계보의 확장

2018년 von Stumberg, Usenko, Cremers는 DSO에 IMU를 결합한 [VI-DSO](https://doi.org/10.1109/ICRA.2018.8462905)를 ICRA 2018에서 발표했다. 동기는 단순했다. photometric direct method의 가장 큰 실패 모드인 조명 급변 상황에서 IMU의 관성 측정이 pose 추적을 보조할 수 있다. 또한 mono 카메라의 scale ambiguity를 IMU로 해소할 수 있다.

VI-DSO는 DSO의 windowed photometric bundle adjustment에 IMU preintegration factor를 추가한다. IMU preintegration 방식은 [Forster et al.의 2017년 논문](https://doi.org/10.1109/TRO.2016.2597321)에서 차용했다. 결과적으로 scale이 복원되고 극단적 조명 조건에서 robustness가 향상되었다.

Cremers 그룹의 후속 작업들, [Basalt](https://arxiv.org/abs/1904.06504)(2019)와 [DM-VIO](https://doi.org/10.1109/LRA.2021.3140129)(2022)도 같은 방향을 이었다. direct photometric frontend에 tightly coupled inertial backend를 붙이는 구조다. 이 계보는 feature-based VIO(VINS-Mono, OpenVINS)와 병렬로 진행되면서 각자의 생태계를 형성했다.

> 🔗 **차용.** VI-DSO의 IMU preintegration은 [Forster et al. 2017. On-Manifold Preintegration (IEEE TRO)](https://doi.org/10.1109/TRO.2016.2597321)의 manifold preintegration 공식을 그대로 사용한다. DSO의 photometric layer 위에 Forster의 inertial layer가 올라간 적층 구조다.

---

## 5. direct method의 한계

Direct method는 이론적으로 더 많은 정보를 쓴다. feature detector가 버리는 픽셀들, 즉 gradient가 낮아도 consistent한 영역을 추적에 활용한다. photometric residual은 feature descriptor의 discretization 없이 연속적인 최적화 landscape를 제공한다.

그럼에도 2026년 기준 대다수 배포 시스템은 feature-based다. 이유는 여러 층에 걸쳐 있다.

첫째, photometric calibration 의존성이다. DSO가 가정하는 vignetting 보정, response curve 보정, 노출 제어는 consumer camera에서 그냥 얻어지지 않는다. 스마트폰 카메라는 HDR 합성, auto-exposure, 실시간 화이트 밸런스를 자체적으로 적용하며, 그 파이프라인은 사용자에게 공개되지 않는다. DSO의 photometric 모델은 이런 카메라에서 기본 가정이 깨진다.

둘째, 조명 변화다. 자동 노출, 역광, 플리커처럼 프레임 간 밝기가 급변하는 상황에서는 direct 방식의 핵심 가정인 photometric consistency가 바로 깨진다. DSO의 affine brightness 모델은 완만한 밝기 변동만 흡수할 수 있어, 실외에서 구름이 지나가거나 실내에서 형광등이 깜빡이는 장면은 여전히 추적 실패의 주 원인으로 남았다.

셋째, DSO가 controlled dataset에서 ORB-SLAM2를 이기는 시퀀스가 있어도, 실제 로봇에 얹는 엔지니어는 ORB-SLAM 쪽을 골랐다. ORB-SLAM은 여러 카메라 모델에서 별도 photometric calibration 없이 동작한다. 카메라를 교체해도 바로 돌아간다. DSO는 카메라마다 vignetting·response curve를 따로 캘리브레이션해야 했다.

넷째, [SuperPoint](https://arxiv.org/abs/1712.07629)(2018)·[LightGlue](https://arxiv.org/abs/2306.13643)(2023) 같은 학습 기반 feature가 "feature는 정보를 버린다"는 direct method의 핵심 비판을 약화시켰다. 기존 handcrafted descriptor보다 훨씬 많은 정보를 보존하면서도 descriptor matching의 실용적 장점을 유지한다. direct method가 feature-based를 공격하던 그 지점에, learned feature가 자리를 메운 것이다.

---

## 🧭 아직 열린 것

**조명 급변 환경에서의 direct tracking.** Direct method의 근본 전제, 장면의 밝기 분포가 프레임 간 보존된다는 가정은 자동 노출 카메라, 강한 역광, 터널-야외 전환 상황에서 즉각 붕괴한다. VI-DSO의 IMU 보조가 부분적으로 완화하지만, 조명 모델 자체를 동적으로 추정하는 완전한 해법은 아직 없다. 학습 기반 photometric 보정이 대안으로 탐색 중이지만, 실시간 배포 가능한 형태로 나오지 않았다.

**Textureless + direct의 이중 약점.** Feature-based는 코너가 없는 벽 앞에서 실패한다. Direct는 gradient가 없는 면에서 residual이 사라진다. 두 방식 모두 실내 복도, 대규모 창고, 균질한 실외 지형 같은 환경에서 약하다. Semi-dense LSD-SLAM은 gradient 있는 픽셀을 선택적으로 쓰는 방식으로 절충했지만, 그 픽셀이 충분히 분포하지 않는 상황의 degeneracy는 해결하지 못했다.

**Learned photometric model로의 이행 가능성.** 현재 direct SLAM의 photometric 모델은 단순 affine brightness 보정이나 고정 camera response function으로 표현된다. Neural radiance field 계열의 연구들은 장면의 appearance를 neural network로 모델링하는 방식을 탐색하고 있다. 이것이 실시간 direct SLAM의 photometric layer로 들어올 수 있는지, 들어온다면 direct와 learned의 경계가 어디에 그어지는지는 2026년 현재 열린 질문이다.

한편 direct 계보와 나란히, 다른 방향의 탈출구가 이미 2011년에 열려 있었다. Newcombe 자신이 KinectFusion을 통해 보여줬다. 단안 카메라의 photometric 가정을 지키는 대신, 센서 자체를 바꾸면 된다. depth 정보를 직접 측정하는 RGB-D 카메라는 밝기 변화에 무관하게 dense 재구성을 가능하게 했다. direct method가 photometric consistency를 수식으로 지키려 했다면, RGB-D는 그 가정 자체를 질문 목록에서 지웠다.

---

# Ch.9 — Dense/RGB-D: KinectFusion부터 BundleFusion까지

2011년 11월, Richard Newcombe(Imperial College London)가 ISMAR에서 KinectFusion을 발표했을 때 청중의 반응은 논문보다 데모 영상에 집중됐다. 손에 들린 Kinect 센서 하나가 실시간으로 방 전체를 3D 메시로 채워가는 장면이었다. 그 장면은 Newcombe 자신이 같은 해 발표한 DTAM이 단안 카메라로 꿈꾸던 것을 RGB-D 센서로 실제로 해낸 것이었다. 계보는 선명하다: 1996년 Curless와 Levoy가 그래픽스 커뮤니티를 위해 고안한 TSDF 표현, 1992년 Besl과 McKay가 로봇공학에 제공한 ICP 추적, 그리고 2010년 Microsoft가 $150에 출시한 Kinect 센서. 이 세 줄기가 교차한 지점에서 dense SLAM의 짧고 강렬한 시대가 열렸다. Davison의 MonoSLAM(Ch.5)이 단안 카메라로 sparse landmark를 추적하던 바로 그 프레임워크—실시간 추적, GPU 없이 CPU만으로—가 이제 Kinect의 깊이 스트림 앞에서 다른 결론에 도달했다. Newcombe의 DTAM(Ch.8)이 직접 광도 최적화로 dense 재구성을 시도하면서 GPU의 가능성을 열었고, KinectFusion은 그 가능성을 RGB-D 센서로 닫았다.

---

## 9.1 Kinect 이전의 dense 재구성

2011년 이전에도 dense 3D 재구성은 가능했다. *실시간*만 빠져 있었다.

오프라인 파이프라인들은 스테레오 혹은 structured light 스캐너로 취득한 포인트 클라우드를 시간을 들여 병합했다. 실내 스캔 장비는 수십만 달러였다. 연구실 바깥에서 이 기술을 쓰는 사람은 없었다. SLAM 커뮤니티는 이미 sparse landmark로 충분히 실용적인 결과를 얻고 있었고, dense 재구성은 그래픽스 쪽 문제로 분류해 두고 있었다.

[Curless와 Levoy의 1996년 SIGGRAPH 논문 "A Volumetric Method for Building Complex Models from Range Images"](https://graphics.stanford.edu/papers/volrange/volrange.pdf)는 이 시기의 그래픽스 쪽 접근을 대표한다. 핵심 아이디어는 **TSDF(Truncated Signed Distance Function)**였다. 3D 공간을 균일한 복셀 그리드로 나누고, 각 복셀에 가장 가까운 표면까지의 부호 있는 거리를 누적한다. 부호 관행은 센서에서 표면 방향으로 진행할 때 표면 앞(free space)이 양수, 표면 뒤(solid 내부)가 음수다. Truncated란 이 값을 절댓값 기준 일정 한계 $t$ 이내로 잘라낸다는 뜻으로, $\text{TSDF}(x) = \text{clip}(d(x), -t, +t)$ 형태가 된다. 새 깊이 프레임이 들어올 때마다 이 값을 가중 평균으로 갱신하면, 노이즈가 점진적으로 평균화되면서 표면이 점점 선명해진다. 표면 추출은 TSDF의 zero-crossing에 marching cubes를 적용하면 된다.

이 방법은 정확했다. 그러나 복셀 그리드는 메모리를 많이 먹었고, 실시간 갱신은 당시 하드웨어로는 불가능했다. Curless-Levoy의 논문은 이후 15년 동안 그래픽스 교과서에 머물렀다.

그 15년 사이에 두 가지가 바뀌었다. GPU가 GPGPU 시대로 진입했고, Kinect가 등장했다.

---

## 9.2 KinectFusion과 TSDF

Microsoft Research는 2010년 Xbox 360용 Kinect를 $150에 출시했다. 구조광(structured light) 방식으로 깊이를 측정하는 이 센서는 VGA 해상도의 깊이 맵을 30Hz로 스트리밍했다. 정밀도는 연구용 ToF 카메라보다 낮았지만, 가격은 100분의 1이었다. 해커들이 먼저 반응했다. 출시 몇 주 만에 오픈소스 드라이버가 공개됐고, 연구자들이 그 뒤를 따랐다.

Newcombe는 그 무렵 Microsoft Research Cambridge로 자리를 옮겼고, Shahram Izadi 팀과 함께 GPU 기반 dense SLAM을 준비하고 있었다. Kinect가 출시됐을 때 그들에게는 이미 파이프라인의 윤곽이 있었다. Kinect가 공급한 깊이 스트림이 나머지를 채웠다. 결과가 2011년 ISMAR에서 발표된 [Newcombe et al. 2011. KinectFusion](https://doi.org/10.1109/ISMAR.2011.6092378)이다.

> 🔗 **차용.** KinectFusion의 핵심 표현인 TSDF는 Curless & Levoy(1996)가 오프라인 3D 스캐닝을 위해 고안한 것이다. Newcombe 팀은 이를 GPU의 병렬 복셀 갱신으로 실시간화했다.

파이프라인은 네 단계로 구성된다.

깊이 전처리 단계에서는 원시 깊이 맵에서 bilateral filter로 노이즈를 줄이고 표면 법선을 계산한다.

ICP 추적 단계에서는 현재 프레임의 포인트 클라우드를 이전 TSDF에서 ray-cast한 가상 표면에 정렬한다. [Besl & McKay(1992)](https://graphics.stanford.edu/courses/cs164-09-spring/Handouts/paper_icp.pdf)의 **ICP(Iterative Closest Point)**를 point-to-plane 변형으로 GPU에서 수천 번 반복한다. 결과는 카메라의 6-DoF 포즈다.

point-to-plane ICP의 목적함수는 다음과 같다. 현재 프레임의 포인트 $\mathbf{p}_i$를 변환 $T = (R, \mathbf{t})$로 움직인 뒤 대응 점 $\hat{\mathbf{p}}_i$(ray-cast 표면)과 법선 $\hat{\mathbf{n}}_i$에 대해

$$E(R, \mathbf{t}) = \sum_i \bigl(\hat{\mathbf{n}}_i^\top (R\,\mathbf{p}_i + \mathbf{t} - \hat{\mathbf{p}}_i)\bigr)^2$$

을 최소화한다. 원래 Besl-McKay의 point-to-point($\|R\mathbf{p}_i + \mathbf{t} - \hat{\mathbf{p}}_i\|^2$)와 달리 법선 방향 오차만 측정하므로, 표면에 접하는 방향의 미끄러짐에 덜 민감하다. 소회전 근사 $R \approx I + [\boldsymbol{\omega}]_\times$ 를 적용하면 $E$는 6-DoF 벡터 $(\boldsymbol{\omega}, \mathbf{t})$에 대한 선형 최소제곱 문제로 바뀌어 GPU에서 병렬 감소(parallel reduction)로 한 번에 풀린다.

> 🔗 **차용.** KinectFusion의 추적 단계는 Besl & McKay(1992) ICP를 직접 계승한다. 고전 로봇공학 문헌의 기법을 GPU 밀도로 다시 꺼낸 것이다.

TSDF 통합 단계에서는 추정된 포즈로 깊이 맵을 복셀 그리드에 투영해 TSDF 값을 갱신한다. 논문의 대표 실험 설정은 512³ 복셀로 약 3m 한 변 크기의 방 규모 볼륨을 덮는다(§4.2, Fig. 13).

표면 렌더링 단계에서는 TSDF의 zero-crossing을 ray marching으로 찾아 실시간 메시를 렌더링한다. 이 결과가 다음 ICP 추적의 참조 표면이 된다.

Newcombe는 같은 해 DTAM을 단안 카메라 dense SLAM으로 발표했다. KinectFusion은 그 자매 연구다. DTAM이 GPU를 써서 단안의 광도 일관성을 최적화했다면, KinectFusion은 같은 GPU를 깊이 통합에 투입했다. 두 논문의 저자 목록이 겹치는 이유다.

> 🔗 **차용.** KinectFusion과 DTAM은 같은 해 같은 연구자가 발표한 두 dense 시스템이다. DTAM의 GPU dense 파이프라인 철학이 KinectFusion으로 자연스럽게 이식됐고, 센서만 달랐다.

512³ TSDF가 30Hz로 갱신됐고, 실내 방 한 칸을 몇 분 안에 dense mesh로 복원했다. 추적 drift는 feature-based 방식보다 훨씬 작았다. ICP가 절대 표면에 수렴하는 구조이기 때문이다.

한계도 명확했다. 512³ 복셀 그리드는 고정된 공간 범위만 다룰 수 있었다. 방을 벗어나면 복셀이 포화되거나 기존 복셀을 덮어써야 했다. loop closure가 없었다. 그리고 Kinect의 IR 구조광은 햇빛 아래에서 작동하지 않았다. 실외는 처음부터 범위 밖이었다.

---

## 9.3 Kintinuous — rolling volume

KinectFusion이 발표된 직후 Whelan은 Imperial College에서 이 한계에 달려들었다. 고정 크기 TSDF 볼륨이 문제라면, 카메라를 따라 이동하면 된다.

2012년 7월 RSS 워크숍(RGB-D: Advanced Reasoning with Depth Cameras, Sydney)에서 [Whelan 등이 발표한 Kintinuous](https://www.cs.cmu.edu/~kaess/pub/Whelan12rssw.pdf)는 "rolling TSDF volume"을 도입했다. 카메라가 볼륨 경계에 가까워지면 반대쪽 슬라이스를 메시로 출력하고 해제한 뒤, 새 슬라이스를 앞에 붙인다. 메모리는 일정하게 유지되면서 카메라는 무한히 이동할 수 있다.

실내 복도 전체를 걷는 데모는 KinectFusion이 보여주지 못한 것이었다. 그러나 loop closure는 여전히 없었다. 긴 복도를 걸어서 원점으로 돌아왔을 때 두 끝이 맞지 않는 문제는 해결되지 않았다. 재구성 품질도 sparse SLAM이 쌓아온 submap 정합 방법에 비해 열위였다.

---

## 9.4 ElasticFusion: Surfel과 비강체 변형

Whelan은 Kintinuous 이후 방향을 바꿨다. TSDF 복셀 대신 surfel을 선택했다.

**surfel(surface element)**은 위치, 법선, 반경, 색상을 가진 점이다. 컴퓨터 그래픽스에서 [Pfister 등(2000)](https://www.merl.com/publications/docs/TR2000-10.pdf)이 렌더링 표현으로 제안한 개념이었다. 복셀 그리드에 비해 불규칙하고 표면에 밀착하는 구조다.

> 🔗 **차용.** ElasticFusion의 surfel 표현은 Pfister 등(2000)의 그래픽스 렌더링 기법을 SLAM의 맵 표현으로 이식한 것이다.

[Whelan et al. 2016. ElasticFusion](https://doi.org/10.1177/0278364916669237)의 핵심 기여는 두 가지다. 첫째, surfel 기반 dense map을 채용했다. 둘째, *non-rigid deformation*을 이용한 loop closure를 구현했다.

기존 dense SLAM의 loop closure는 어려웠다. 전역 메시나 복셀 그리드를 loop closure 정보에 맞춰 수정하려면 비용이 컸다. ElasticFusion은 surfel 집합을 변형 그래프(deformation graph)와 연결하고, loop closure가 감지되면 그래프를 변형해 전체 맵에 오차를 분산시켰다. 메시 수준에서의 비강체 변형이었다.

구체적으로, deformation graph의 각 노드 $g_k$는 위치 $\mathbf{v}_k$와 회전 $R_k$, 이동 $\mathbf{t}_k$를 가진다. surfel $s$는 가장 가까운 $K$개 노드의 영향권 안에 놓이고, surfel의 변형 후 위치는

$$\tilde{\mathbf{p}}_s = \sum_{k \in \mathcal{N}(s)} w_k \bigl(R_k (\mathbf{p}_s - \mathbf{v}_k) + \mathbf{v}_k + \mathbf{t}_k\bigr)$$

로 계산된다(가중치 $w_k$는 거리 기반 감쇠). loop closure 제약이 추가되면 그래프 노드들의 $(R_k, \mathbf{t}_k)$를 Gauss-Newton으로 최적화해 오차를 전역에 분산한다. TSDF를 통째로 다시 쌓지 않고도 dense 맵 전체를 일관되게 수정할 수 있었던 이유다.

실내 재구성 품질 자체로 평가하면 ElasticFusion은 당시 최고 수준이었다(KITTI나 TUM RGB-D 벤치마크 기준이 아니라). ICL-NUIM 합성 데이터셋에서 kt0·kt1·kt2 시퀀스는 ATE RMSE 1.4cm 이하, 그 중 kt0·kt1은 0.9cm를 기록했다(글로벌 루프 클로저가 발동하는 kt3은 예외적으로 큰 값). 실시간성을 유지하면서 이 수준에 도달한 시스템은 그 이전에 없었다.

---

## 9.5 BundleFusion: 오프라인 SfM 품질을 온라인으로

2017년 Dai, Nießner, Zollhöfer, Izadi, Theobalt가 ACM Transactions on Graphics에 발표한 [Dai et al. 2017. BundleFusion](https://doi.org/10.1145/3072959.3054739)은 다른 방향에서 문제에 접근했다. KinectFusion 계열이 실시간성을 타협하지 않으면서 품질을 높이려 했다면, Dai 팀은 GPU 연산을 최대한 투입해 온라인 시스템에서도 SfM 수준의 번들 조정을 실행하는 것을 목표로 삼았다.

핵심 아이디어는 계층적 최적화다. 가장 빠른 층에서는 현재 프레임과 이전 프레임 사이의 dense depth alignment로 초기 포즈를 잡는다. 그 위 층에서는 SIFT feature를 이용한 sparse frame-to-frame alignment로 보정하고, 세 번째 층에서 sliding-window global bundle adjustment가 누적된 프레임들의 포즈를 재최적화한다. Bundle adjustment는 프레임이 누적될수록 과거 포즈도 재추정한다. "retroactive pose correction"이라고 불린 이 방식은 오프라인 SfM 파이프라인이 모든 데이터를 가진 뒤 정합하는 것과 유사한 효과를 온라인으로 달성하려 했다. 갱신된 포즈 시퀀스를 TSDF에 역투영해 재통합하므로, 추적 오류가 맵에 그대로 쌓이지 않는다.

Dai 팀이 TUM RGB-D 벤치마크에서 보고한 수치는 ElasticFusion을 능가했다. 시각적 재구성 품질도 당시 기준으로 오프라인 COLMAP 파이프라인에 근접했다.

> 📜 **예언 vs 실제.** BundleFusion은 real-time online global bundle adjustment를 "unprecedented speed"로 달성했다고 주장하며, 오프라인 SfM 품질을 온라인으로 끌어올리는 경로를 제시했다. GPU 연산력은 이후로도 빠르게 증가했지만, 관심은 dense SLAM이 아니라 NeRF(Neural Radiance Field)로 이동했다. 고품질 실내 재구성의 사실상 표준은 2021년 이후 COLMAP + NeRF 파이프라인이 됐다. BundleFusion이 열려고 했던 경로 자체가 다른 기술로 우회됐다. `[기술변화]`

---

## 9.6 하드웨어와 알고리즘의 공진화

KinectFusion에서 BundleFusion까지의 6년은 하드웨어와 알고리즘이 서로를 밀어붙인 과정이다.

Kinect 1세대는 구조광 방식이었다. 깊이 정밀도는 미터 범위에서 수 밀리미터였지만 햇빛 아래에서는 IR 패턴이 잡히지 않았다. 2013년 출시된 Kinect 2는 ToF(Time-of-Flight) 방식으로 바꿨다. 정밀도가 올라갔고 동적 범위도 나아졌다. Intel의 RealSense 시리즈가 뒤를 이었다. 센서 선택지가 늘어날수록 알고리즘이 가정할 수 있는 깊이 품질이 달라졌고, 연구자들은 더 작은 노이즈를 활용하거나 더 큰 노이즈를 견디는 방식을 실험했다.

GPU 쪽에서는 CUDA 생태계가 성숙했다. KinectFusion이 나온 2011년의 Tesla 아키텍처와 BundleFusion이 나온 2017년의 Pascal 아키텍처 사이에 부동소수점 성능은 10배 이상 증가했다. Whelan이 ElasticFusion에서, Dai가 BundleFusion에서 점점 더 무거운 최적화를 실시간으로 실행할 수 있었던 것은 알고리즘만의 성과가 아니었다.

Kinect가 $150이 아니라 $15,000이었다면, 이 흐름은 5년 이상 늦게 시작됐을 것이다. 소비자 시장용 센서가 연구의 속도를 끌었다.

> 📜 **예언 vs 실제.** KinectFusion이 2011년에 보여준 512³ 고정 볼륨의 한계—공간 범위, 드리프트, 실외 부적합—는 이후 연구의 로드맵이 됐다. 볼륨 확장은 Kintinuous, ElasticFusion, BundleFusion이 차례로 공략했다. 반면 실외는 다른 결론에 도달했다. IR 구조광은 햇빛 아래에서 패턴이 잡히지 않는다. RGB-D 기반 dense SLAM은 그렇게 실내에 묶였고, outdoor는 LiDAR가 맡았다. `[기술변화]`

---

## 9.7 dense-only의 퇴장

2011년부터 2017년 사이 dense RGB-D SLAM은 Visual SLAM의 주된 방향이 될 것처럼 보였다. 실제 전개는 그렇지 않았다.

sparse backend는 계속 지배했다. [ORB-SLAM2](https://arxiv.org/abs/1610.06475)와 [VINS-Mono](https://arxiv.org/abs/1708.03852)로 대표되는 2015년 이후의 실용 SLAM 시스템들은 dense 맵을 기본으로 삼지 않았다. 이유는 복합적이었다. 512³ TSDF는 512MB 이상이 필요해 모바일 플랫폼이나 임베디드 시스템에서는 감당하기 어려웠다. Octree나 해시맵 기반 변형([Voxblox](https://arxiv.org/abs/1611.03631), [OctoMap](https://www.hrl.uni-bonn.de/papers/wurm10octomap.pdf))이 이를 완화하려 했지만 sparse 방식의 효율성과는 격차가 있었다. 실시간 dense 처리는 GPU를 전제했는데, 자율주행 차량의 임베디드 프로세서나 드론의 경량 플랫폼에서는 KinectFusion 수준의 파이프라인을 돌리기 어려웠다. Kinect의 IR depth가 실외에서 작동하지 않는다는 점도 발목을 잡았다. 자율주행과 드론처럼 상용화 요구가 큰 분야 대부분이 실외 환경이었다.

그 사이 dense map data structure 자체의 계보는 KinectFusion의 512³ 고정 볼륨을 다양한 방향으로 흩어 놓았다. [Museth(2013)의 VDB](https://doi.org/10.1145/2487228.2487235)는 block-hashing과 내부 트리를 결합해 sparse 영역은 비워 두고 표면 근방만 계층적으로 정제하는 구조를 제안했고, OpenVDB로 공개되어 오늘날 자율주행용 dense map의 backbone이 됐다(Ch.17 LiDAR의 nvblox 계보와 이어진다). [Reijgwart et al.(2023)의 wavemap](https://arxiv.org/abs/2306.08125)은 wavelet 변환으로 occupancy를 압축해 해상도-메모리 트레이드오프를 재조정했다. Ramos와 Ott가 이끈 다른 계보는 아예 표현을 연속 함수로 넘겼다. [O'Callaghan과 Ramos(2012)의 GPOM(Gaussian Process Occupancy Map)](https://doi.org/10.1177/0278364911435991)은 깊이 측정을 Gaussian Process 회귀로 연결해 측정되지 않은 복셀까지 확률적으로 채웠고, [Ramos와 Ott(2016)의 Hilbert Map](https://doi.org/10.1177/0278364916684382)은 Hilbert space 특징을 logistic regression으로 학습시켜 스트리밍 가능한 확률적 occupancy를 제공했다. [Behley와 Stachniss(2018)의 SuMa](https://www.ipb.uni-bonn.de/wp-content/papercite-data/pdf/behley2018rss.pdf)는 ElasticFusion이 실내 RGB-D에서 쓴 surfel 표현을 outdoor LiDAR로 옮겨 KITTI에서 작동하는 surfel-based SLAM을 만들었다(→ Ch.17). KinectFusion이 방 한 칸에서 멈췄던 자리에서, 이 계보들이 outdoor·도시 규모·확률적 불확실성 쪽으로 각자의 방향을 열었다.

2020년을 전후해 NeRF가 등장하면서 고품질 dense 재구성을 원하는 수요는 NeRF와 3D Gaussian Splatting으로 이동했다. RGB-D SLAM은 localization과 mapping을 분리하는 구조 속에서 depth를 추적 보조로 쓰는 수준으로 좁아졌다.

dense 시대는 짧았지만 흔적은 남았다. TSDF 표현은 자율주행용 occupancy map으로 이어졌고, ICP는 LiDAR SLAM의 표준 추적 수단이 됐다. 접근 방식은 퇴각했지만 부품들은 다른 시스템 안으로 흩어졌다.

---

## 🧭 아직 열린 것

대규모 실외 dense 재구성. IR 구조광의 햇빛 취약성은 active depth 센서 전반의 문제다. LiDAR는 더 먼 거리를 다루지만 색상과 세밀한 표면 정보가 빈약하다. RGB-D 방식으로 실외 대규모 환경을 dense하게 처리하는 방법은 2026년 기준으로 아직 없다. Stereo depth estimation이 학습 기반으로 빠르게 발전하고 있어 일부 연구들이 대안을 탐색 중이지만, 어두운 영역·반사면·원거리에서의 한계가 해결되지 않았다.

동적 장면의 dense 재구성. KinectFusion부터 BundleFusion까지 모든 시스템이 정적 장면을 전제로 설계됐다. 사람이 걸어 다니는 공간을 dense하게 재구성하려면 동적 물체를 분리해야 하는데, 이는 실시간 semantic segmentation과 dense SLAM의 결합을 요구한다. [DynaSLAM](https://arxiv.org/abs/1806.05620), [MaskFusion](https://arxiv.org/abs/1804.09194) 등이 시도했지만 계산 비용과 robustness 모두에서 실용 배포 수준에 미치지 못한다.

TSDF 계열의 메모리 효율. 복셀 그리드의 메모리 비용은 Voxblox의 해시 구조, OctoMap의 octree 압축으로 줄어들었다. 그러나 건물 층 단위, 도시 블록 단위의 dense 표현은 여전히 수십 기가바이트 수준이다. 어떤 해상도를 어느 영역에서 유지할지를 자동으로 결정하는 adaptive resolution dense map은 아직 범용 해법이 없다. [Instant-NGP](https://arxiv.org/abs/2201.05989)와 같은 implicit neural representation이 이 문제에 접근하고 있지만, 실시간 갱신과 쿼리 속도는 트레이드오프가 남아 있다.

dense SLAM이 실내 방 한 칸을 메시로 채우는 동안, 그 방으로 돌아오는 문제는 별도의 계보가 맡고 있었다. KinectFusion은 loop closure가 없었다. 있었다면 어떻게 됐을까. 이미 그 답을 가지고 있는 연구자들이 옥스퍼드에 있었다. 그들이 붙잡고 있던 것은 장소를 기억하는 문제였다.

---

# Ch.10 — Place Recognition의 평행선: FAB-MAP에서 NetVLAD까지, 그리고 AnyLoc까지

2003년 Davison이 웹캠 한 대로 실시간 3D 추적을 증명하던 무렵, Oxford 모바일 로보틱스 그룹의 Mark Cummins와 Paul Newman은 다른 질문을 붙잡고 있었다. "로봇이 이전에 지나간 장소를 어떻게 알아보는가?" VO(visual odometry)가 누적 drift에 시달리는 한, 이 질문에 답하지 못하면 어떤 SLAM 시스템도 루프를 닫을 수 없었다. Place recognition은 Visual SLAM의 나머지 구성 요소들과 평행하게, 그러나 독자적인 계보로 2000년대 내내 발전했다. FAB-MAP은 Josef Sivic의 BoW 아이디어를 로봇 공간으로 이식했고, DBoW2는 그것을 실용화했으며, NetVLAD는 학습으로 끊어냈다. 2023년 AnyLoc는 foundation model의 feature를 그대로 가져왔다.

Ch.7이 ORB-SLAM 삼부작으로 feature-based 계보를 완성하고, Ch.8이 DSO까지 direct 계보를 추적하고, Ch.9가 KinectFusion에서 BundleFusion까지 dense mapping의 궤도를 따라가는 동안, place recognition은 그 어느 계보와도 다른 선을 그었다. Tracking도 아니고 mapping도 아닌, 어느 쪽에서도 파생되지 않은 독립 문제였다. 그럼에도 세 계보 모두 루프 클로저 없이는 불완전했고, 그 루프 클로저의 "어디서 봤는가" 판단을 place recognition이 공급했다.

---

## 10.1 BoW 이전의 place recognition

GPS가 없는 실내, 터널, 도심 협곡에서 로봇이 루프를 닫으려면 현재 관측과 과거 관측 사이의 유사도를 수천 장의 후보 이미지 중에서 빠르게 찾아야 한다. 픽셀 단위 비교는 선형 탐색이어서 O(N)이고, 이미지 수가 수만 장을 넘으면 실시간은 불가능하다.

2000년대 초 컴퓨터 비전에서 이 문제를 먼저 건드린 것은 Sivic과 Zisserman이었다. 2003년 ICCV에서 발표된 ["Video Google"](https://www.robots.ox.ac.uk/~vgg/publications/2003/Sivic03/sivic03.pdf)은 문서 검색의 TF-IDF를 이미지에 적용했다. SIFT 기술자를 k-means로 군집화해 "visual word"를 만들고, 이미지를 그 단어들의 빈도 벡터로 표현했다. 검색은 inverted index를 통해 O(1)에 가까워졌다. place recognition 연구자들은 이 아이디어를 곧바로 받아들였다.

---

## 10.2 FAB-MAP — 확률적 BoW와 Chow-Liu 트리 (2008)

Mark Cummins와 Paul Newman은 Oxford 모바일 로보틱스 그룹에서 2008년 [Cummins & Newman. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance](https://doi.org/10.1177/0278364908090961)를 발표했다.

FAB-MAP(**Fast Appearance-Based Mapping**)의 핵심 질문은 "이 장면은 데이터베이스에 있는 장소인가, 아니면 전혀 새로운 곳인가?"다. 단순 유사도 점수로는 이 판단을 내릴 수 없다. 비슷해 보이는 복도가 수십 개라면 가장 높은 유사도가 정답을 보장하지 않는다.

Cummins와 Newman은 이를 베이즈 추론 문제로 구성했다. 관측 $z_t$(visual word의 발생 여부 집합)가 주어졌을 때, 현재 위치가 데이터베이스의 각 장소 $\ell_i$일 확률을 계산한다:

$$P(\ell_i \mid z_t) \propto P(z_t \mid \ell_i) P(\ell_i)$$

문제는 $P(z_t \mid \ell_i)$다. visual word들이 독립이라고 가정하면 naïve Bayes가 되지만, 실제로 visual word들은 상관된다. "문"이라는 word가 등장하면 "문손잡이"라는 word도 같이 등장하기 쉽다. 독립 가정은 확률 값을 왜곡한다.

FAB-MAP은 **Chow-Liu tree**를 사용해 이 상관을 모델링했다. Chow-Liu tree는 word 간의 pairwise mutual information을 최대화하는 트리 구조 그래픽 모델이다. 두 word $e_i, e_j$ 사이의 mutual information은

$$I(e_i; e_j) = \sum_{e_i, e_j} P(e_i, e_j) \log \frac{P(e_i, e_j)}{P(e_i)P(e_j)}$$

로 정의되고, Chow-Liu 알고리즘은 이를 엣지 가중치로 삼아 최대 스패닝 트리를 구성한다. 이 트리로 joint likelihood를 분해하면

$$P(z_t \mid \ell_i) = \prod_k P(z_t^k \mid z_t^{\text{pa}(k)}, \ell_i)$$

가 된다. 여기서 $z_t^k \in \{0,1\}$은 $k$번째 word의 발생 여부이고, $\text{pa}(k)$는 트리에서 $k$의 부모 노드다. 나이브 베이즈(독립 가정) 대비 word 간 공동 발생 패턴을 반영하므로, 복도처럼 시각적으로 유사한 장소들에서 false positive를 낮출 수 있다. 학습 단계에서 대규모 이미지 집합으로 vocabulary와 tree를 함께 훈련한다.

또한 FAB-MAP은 현재 위치가 데이터베이스에 없는 새 장소일 가능성을 명시적으로 다룬다. "new place" 가설을 넣자 false positive가 줄었다. Loop closure에서 false positive는 catastrophic failure로 이어진다. 실용적으로 핵심이었다.

> 🔗 **차용.** FAB-MAP의 visual word 방식은 Sivic & Zisserman의 "Video Google"(2003)에서 직접 이식되었다. 문서 검색의 inverted index 논리를 로봇의 장소 기억에 적용한 것이다.

2011년 Cummins와 Newman은 [FAB-MAP 2.0](https://www.robots.ox.ac.uk/~mjc/Papers/cummins_newman_ijrr_fabmap2_2010_preprint.pdf)을 발표했다. 처리 가능한 지도 규모를 1,000 km 수준으로 확장한 것이 목표였다. 실험적으로 도시 규모 데이터셋에서 작동함을 보였다.

---

## 10.3 DBoW2 — binary descriptor와 vocabulary tree (2012)

FAB-MAP은 SIFT처럼 부동소수점 descriptor를 기반으로 했다. 2012년 무렵 SLAM 커뮤니티는 더 빠른 binary descriptor, 특히 BRIEF·ORB·BRISK 쪽으로 이동하고 있었다. SIFT vocabulary를 그대로 쓰는 것은 연산 비용이 문제였다.

Dorian Gálvez-López와 Juan D. Tardós(Universidad de Zaragoza)는 2012년 [Gálvez-López & Tardós. Bags of Binary Words for Fast Place Recognition in Image Sequences](https://doi.org/10.1109/TRO.2012.2197158)를 발표했다. **DBoW2**는 binary descriptor를 사용하는 vocabulary tree로, Hamming distance 기반 비교로 SIFT보다 수십 배 빠른 word 배정이 가능했다.

DBoW2의 구조는 계층적 k-means로 만든 vocabulary tree다. 이미지를 표현하는 BoW 벡터는 TF-IDF 가중치가 부여된 binary word 빈도 벡터다. $k$분기 $d$깊이 트리의 각 리프 노드 $w_i$에 TF-IDF 가중치

$$\eta_i = \frac{n_i}{n} \cdot \log \frac{N}{N_i}$$

를 부여한다. 여기서 $n_i$는 해당 이미지에서의 word 빈도, $n$은 총 word 수, $N$은 데이터베이스 이미지 수, $N_i$는 $w_i$를 포함한 이미지 수다. 두 이미지 $a$, $b$의 유사도는 L1-norm

$$s(\mathbf{v}_a, \mathbf{v}_b) = 1 - \frac{1}{2} \left\| \frac{\mathbf{v}_a}{|\mathbf{v}_a|} - \frac{\mathbf{v}_b}{|\mathbf{v}_b|} \right\|_1$$

으로 계산한다. 조회는 inverted index를 통해 O(log N)에 수행된다.

> 🔗 **차용.** DBoW2의 vocabulary tree 개념은 Nistér & Stewénius의 2006년 ["Scalable Recognition with a Vocabulary Tree"](https://people.eecs.berkeley.edu/~yang/courses/cs294-6/papers/nister_stewenius_cvpr2006.pdf)(CVPR)에서 계보를 잇는다. DBoW2는 그 구조를 binary descriptor 세계로 이식하고, 가중치 체계를 SLAM에 맞게 조정했다.

DBoW2가 중요한 건 알고리즘보다 배포다. 오픈소스로 공개된 이 라이브러리는 ORB-SLAM(2015)의 loop closure 모듈로 채택되었고, ORB-SLAM2·ORB-SLAM3까지 같은 DBoW2를 썼다. 2015-2020년대 중반, SLAM 커뮤니티의 place recognition은 사실상 DBoW2가 담당했다.

Gálvez-López와 Tardós의 파트너십 역시 주목할 만하다. Tardós는 이후 Mur-Artal, Campos와 함께 ORB-SLAM 삼부작을 이끈 인물이다. DBoW2는 그 프로젝트의 place recognition 계층을 미리 준비한 셈이었다.

---

## 10.4 NetVLAD — CNN 기반 VPR (2016)

BoW 계열은 한 가지 근본 한계가 있었다. vocabulary는 특정 descriptor와 특정 환경에서 훈련된 것이었다. 조명이 바뀌거나 계절이 달라지거나 시점이 크게 달라지면 visual word의 분포가 달라지고, 미리 훈련된 vocabulary는 부정합을 일으켰다.

Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomáš Pajdla, Josef Sivic는 2016년 CVPR에서 [NetVLAD: CNN Architecture for Weakly Supervised Place Recognition](https://doi.org/10.1109/CVPR.2016.572)을 발표했다. 저자 중 Sivic은 2003년 "Video Google"의 그 Sivic이다. 2003년 ICCV에서 BoW를 이미지 검색에 도입한 사람이, 13년 뒤 그 방식의 한계를 넘는 논문에 공동저자로 이름을 올렸다.

NetVLAD의 아이디어는 **VLAD(Vector of Locally Aggregated Descriptors)** aggregation을 미분 가능하게 만드는 것이었다.

VLAD는 2010년 [Jégou et al.](https://inria.hal.science/inria-00548637/file/jegou_compactimagerepresentation.pdf)이 제안한 aggregation 방식으로, 각 local descriptor가 가장 가까운 cluster center(visual word)에 "잔차"로 얼마나 기여하는지를 누적해 이미지 전체를 표현한다. cluster center $k$에 대한 VLAD 부분 벡터는

$$\mathbf{V}(k) = \sum_{\mathbf{x}_i : \text{NN}(\mathbf{x}_i)=k} (\mathbf{x}_i - \boldsymbol{\mu}_k)$$

이고, 전체 VLAD 벡터 $\mathbf{V} = [\mathbf{V}(1)^\top, \ldots, \mathbf{V}(K)^\top]^\top$는 이를 모든 cluster에 대해 연결(concatenate)한 뒤 L2-normalize한 것이다. $K$ clusters, $D$차원 descriptor라면 최종 벡터는 $KD$차원이다. VLAD 벡터는 BoW의 이진 할당보다 훨씬 풍부한 정보를 담는다.

> 🔗 **차용.** NetVLAD의 aggregation 설계는 Jégou et al.의 "Aggregating Local Descriptors into a Compact Image Representation"(CVPR 2010)에서 VLAD를 직접 계승했다. NetVLAD가 한 것은 VLAD의 hard assignment를 soft assignment로 바꾸고 전체 파이프라인을 end-to-end로 학습 가능하게 만든 것이다.

NetVLAD layer는 기존 VLAD의 nearest-neighbor 할당을 softmax로 완화한다:

$$\bar{a}_k(\mathbf{x}_i) = \frac{e^{\mathbf{w}_k^\top \mathbf{x}_i + b_k}}{\sum_{k'} e^{\mathbf{w}_{k'}^\top \mathbf{x}_i + b_{k'}}}$$

여기서 $\mathbf{x}_i$는 CNN에서 추출한 local feature, $\mathbf{w}_k$와 $b_k$는 학습 가능한 파라미터다. 이 soft 할당으로 NetVLAD 벡터를 누적하면

$$\mathbf{V}(k) = \sum_i \bar{a}_k(\mathbf{x}_i)\,(\mathbf{x}_i - \boldsymbol{\mu}_k)$$

이고, 전체 벡터 $\mathbf{V} = [\mathbf{V}(1)^\top, \ldots, \mathbf{V}(K)^\top]^\top$를 intra-normalization(각 부분 벡터 L2 정규화) 후 전체를 다시 L2-normalize하면 최종 VPR descriptor가 된다. hard assignment VLAD와 달리 gradient가 역전파되므로 CNN backbone과 함께 end-to-end 학습이 가능하다.

학습 방식도 달랐다. 저자들은 Google Street View Time Machine 데이터를 활용해 같은 장소의 다른 시점 이미지 쌍을 양성 예, 다른 장소를 음성 예로 삼는 weakly supervised triplet loss를 사용했다. GPS 위치만 있으면 레이블 없이 학습할 수 있었다.

Pittsburgh 250k, Tokyo 24/7 벤치마크에서 NetVLAD는 DBoW 계열과 이전 VLAD 기반 방법들을 큰 차이로 앞섰다. 조명·계절 조건 변화에 걸쳐 훨씬 강건했고, 시점 차에도 내성이 있었다. 그러나 실용 SLAM 파이프라인에 NetVLAD가 바로 통합되지는 않았다. 추론 속도와 메모리 요구가 DBoW2보다 무거웠고, 이미 ORB-SLAM 생태계가 DBoW2에 맞춰 구축되어 있었기 때문이다.

---

## 10.5 Patch-NetVLAD, MixVPR, AnyLoc (2020-2023)

NetVLAD 이후 VPR(Visual Place Recognition) 연구는 일반화 성능 개선으로 흩어졌다.

2021년 Hausler et al.은 [Patch-NetVLAD](https://arxiv.org/abs/2103.01486)를 내놓았다. Global descriptor 하나로 장소를 판단하는 NetVLAD 대신, 이미지를 패치로 분할해 각 패치의 NetVLAD 표현을 공간적으로 결합하는 방식이다. Tokyo 24/7에서 NetVLAD 대비 Recall@1을 약 10% 포인트 올렸다. 패치 단위 처리로 추론 비용도 함께 늘었다.

2023년 Ali-bey et al.의 [MixVPR](https://arxiv.org/abs/2303.02190)는 Transformer-style feature mixing으로 global feature를 생성했다. 경량화와 성능 사이의 균형이 목표였다. 이 시기 VPR 논문들은 공통으로 Mapillary Street Level Sequences(MSLS)와 Nordland 같은 계절 변화 데이터셋을 벤치마크로 삼았다. 극단적 조명·계절 조건이 공통의 장벽으로 떠올랐다.

2023년 Keetha et al.의 [AnyLoc: Towards Universal Visual Place Recognition](https://arxiv.org/abs/2308.00688)은 다른 방향을 택했다. DINOv2 기반 self-supervised feature를 fine-tuning 없이 그대로 place recognition에 쓰는 것이다.

> 🔗 **차용.** AnyLoc의 feature 추출은 Oquab et al.의 [DINOv2](https://arxiv.org/abs/2304.07193)(Meta AI, 2023)에서 사전 학습된 ViT 표현을 가져온다. AnyLoc은 그 위에 VLAD aggregation을 얹었다. FAB-MAP에서 시작한 BoW-VLAD 계보가 foundation model 시대에 다시 합류한 형태다.

DINOv2는 대규모 인터넷 이미지로 학습된 Vision Transformer(ViT)다. 특정 도시, 특정 계절, 특정 카메라에 편향되지 않은 범용 feature를 생성한다. Keetha et al.이 AnyLoc에서 주목한 건 DINOv2의 **facet** 개념이었다. ViT의 각 attention head는 query(Q), key(K), value(V) 행렬과 최종 token(patch feature)을 출력한다. Keetha et al.은 이 네 종류의 facet 중 value(V) facet이 place recognition에 가장 의미론적으로 안정된 표현을 제공함을 실험으로 확인했다. Q·K facet은 구조·기하 정보에, V facet은 의미론(semantics)에 더 집중되는 경향이 있어, 계절·조명에 걸친 일관된 장소 표현에 유리하다. Keetha et al.은 이 V facet 표현을 VLAD aggregation에 연결하면 세계 각지, 실내외, 지하, 항공 뷰 등 매우 다양한 환경에서 단일 모델이 동작함을 보였다. Pittsburgh, Tokyo, 실내 공장, 지하 주차장, 도서관 등 7개 이상의 환경에서 single-model이 이전 specialized 방법들과 경쟁하거나 앞섰다.

범용성이 한 축에서 풀리자 다음 갈래는 modality 경계를 넘는 쪽으로 옮겨갔다. Lee et al.의 [(LC)²](https://arxiv.org/abs/2304.08660)(RA-L 2023)는 카메라 영상과 LiDAR 점군을 공통 2.5D depth image로 투영해, 2D 쿼리로 LiDAR 지도에서 장소를 조회하는 cross-modal retrieval을 시도했다. 후속인 [LC²++]()는 LoRA로 adapt된 global retrieval 뒤에 MINIMA 기반 local matching과 PnP를 이어붙여, 장소 판정에서 6-DoF pose 복원까지 한 파이프라인으로 연결했다. 이런 cross-modal 평가는 Lee et al.의 [ViViD++](https://arxiv.org/abs/2204.06183)(RA-L 2022)처럼 visible·thermal·event·LiDAR·관성·depth를 실내외와 지하에서 동기화해 놓은 데이터셋이 있어 비로소 가능해졌다.

---

## 10.6 place recognition과 metric localization의 통합 시도 (2024-2025)

Place recognition 연구는 2000년대 초부터 SLAM의 나머지 구성 요소와 평행하게 달려왔다. ORB-SLAM이 DBoW2를 내장했지만 place recognition 모듈은 mapping·tracking으로부터 격리된 블랙박스였다. 입력은 이미지, 출력은 루프 후보 ID.

2024-2025년 들어 이 경계가 흐려지기 시작했다. Berton et al.의 [EigenPlaces](https://arxiv.org/abs/2308.10832)(2023)와 Izquierdo & Civera의 [SALAD](https://arxiv.org/abs/2311.15937)(2023 arXiv / CVPR 2024)는 place recognition descriptor를 metric localization에 직접 끌어들이는 방향을 탐구했다. "어디서 본 장소" 판정에서 한 발 더, 6-DoF pose를 place recognition 표현 자체에서 바로 뽑으려 했다.

2024년 전후로는 Gaussian map 표현과 place recognition을 결합하려는 시도들도 등장했다. 3DGS(3D Gaussian Splatting)가 지도 표현으로 올라온 흐름과 맞물린 방향이었다.

> 📜 **예언 vs 실제.** Cummins와 Newman은 2011년 FAB-MAP 2.0 논문에서 1,000 km 규모 궤적에서의 appearance-only 루프 클로저를 시연하며 place recognition의 스케일 한계를 밀어올렸다. Oxford 캠퍼스와 도심 일부를 달리던 초기 FAB-MAP 실험 기준으로 두 자릿수 배율의 도약이었다. 이후 DBoW2와 대형 vocabulary를 쓴 도시 규모 실험들이 같은 스케일을 실용 SLAM에서 재현했다. 규모 문제는 이렇게 풀렸지만, Cummins와 Newman이 남긴 실패 모드 — 계절·조명 변화에 취약한 vocabulary 기반 표현 — 는 deep learning이 가져다준 다른 도구로 넘어섰다. `[기술변화]`

> 📜 **예언 vs 실제.** Arandjelović et al.은 2016년 NetVLAD 논문 서론에서 place recognition을 풀기 위한 세 가지 도전 — CNN 아키텍처, 충분한 학습 데이터, end-to-end 학습 절차 — 을 명시하고 각각에 대한 자신들의 기여를 제시했다. 아키텍처와 학습 절차 쪽은 NetVLAD로 직접 답했지만, 이후 7년간 외관 조건(계절·조명·시점) 일반화를 목표로 한 VPR 논문들이 연이어 나왔다. 2023년 AnyLoc은 fine-tuning 없는 foundation model feature로 다환경 단일 모델의 가능성을 보였다. 특화 모델에서 범용 모델 쪽으로 축이 옮겨간 것에 가깝다. `[진행형]`

---

## 10.7 🧭 아직 열린 것

**계절·조명 극변.** Nordland(노르웨이 철도, 여름-겨울)와 Oxford RobotCar(1년치 계절 변화) 데이터셋에서 10년 넘게 같은 장벽이 보고된다. DINOv2 기반 방법들이 격차를 줄였지만, 눈이 쌓인 겨울과 나뭇잎이 무성한 여름 사이에서 동일 장소를 99% 정확도로 인식하는 단일 모델은 아직 없다. 외관 변화가 심한 환경에서의 장소 인식은 2026년 기준으로도 열린 문제다.

**Place recognition과 metric localization의 통합.** 현재 대부분의 SLAM 파이프라인에서 place recognition은 "어디서 봤는가"만 답하고, 실제 pose 추정은 별도의 PnP 또는 descriptor matching 단계가 처리한다. 두 과정을 하나의 표현으로 통합하려는 시도들이 2023-2025년에 등장했으나, 실용적 배포 수준의 정밀도와 속도를 동시에 달성한 방법은 아직 없다.

*인식 가능한 장소 표현의 프라이버시.* VPR 시스템이 저장하는 장소 표현은 복원 공격으로 원본 이미지나 3D 구조를 되살리는 데 쓰일 수 있다. 상업 로봇이 가정·병원·사무실 실내를 매핑할 때 이 문제는 현실이 된다. 성능 저하 없이 프라이버시를 보장하는 장소 표현 방식은 아직 없다.

---

3부(성숙기)의 세 계보는 이렇게 막을 내린다. ORB-SLAM이 feature-based 파이프라인을 표준화하고, DSO가 photometric 이론을 완성하고, KinectFusion 계열이 dense mapping의 가능성과 한계를 드러내는 동안, place recognition은 그 어느 계보와도 다른 위치에 있었다. 컴퓨터 비전의 이미지 검색 문제에서 자라난 뒤, SLAM이 루프 클로저를 필요로 했을 때 공급자 자리를 맡았다. 그 거리는 결과적으로 이점이 됐다. deep learning 물결이 닥쳤을 때, place recognition은 기존 SLAM 파이프라인보다 빠르게 새 도구를 흡수했다.

2023년 AnyLoc이 등장했을 때 Sivic의 이름은 참고문헌에 있었다. 2003년 BoW를 이미지 검색에 꽂은 사람, 2016년 NetVLAD로 그 한계를 넘은 공동저자. 그 계보의 끝에서 AnyLoc은 Sivic이 연 문을 foundation model 쪽으로 밀어 넘겼다.

---

# Ch.11 — 깊이 추정의 부활: Eigen에서 Depth Anything까지

3부(Ch.7-10)에서 feature-based, direct, RGB-D, place recognition 계통은 각자의 방식으로 성숙 단계에 올랐다. 기하학이 전부였다. ORB-SLAM은 epipolar 기하로 세계를 재구성했고, DSO는 photometric consistency로, KinectFusion은 ICP로 표면을 쌓아올렸다. 학습이 끼어들 자리는 없었다. 혹은 그렇다고 여겼다. 4부는 그 경계가 무너지는 이야기인데, 균열은 엉뚱한 곳에서 왔다. SLAM 연구자가 아니라 컴퓨터 비전 쪽, 더 구체적으로는 NYU의 한 대학원생이 낸 논문 한 편에서.

Monocular depth 추정은 컴퓨터 비전에서 가장 오래된 ill-posed 문제 중 하나였다. 한 장의 이미지에서 깊이를 복원한다는 것은 원론적으로 불가능하다. 카메라는 3D 세계를 2D로 투영하면서 깊이 정보를 버리기 때문이다. 그러나 인간은 단안으로도 깊이를 판단한다. 원근감, 폐색, 텍스처 기울기, 표면의 음영. 이것들을 통계적으로 학습할 수 있다면? 2014년, NYU의 David Eigen은 이 질문에 CNN을 들이밀었다. 그 실험 하나가 10년 뒤 SLAM 파이프라인을 다시 쓰게 될 계보의 출발점이었다.

---

## 1. Eigen 2014 — 첫 CNN depth

2014년 이전에도 monocular depth 추정 연구는 있었다. Ashutosh Saxena(Make3D, Stanford)가 [2005년 SVM과 Markov Random Field를 결합해 단일 이미지에서 depth map을 예측하는 시스템](https://papers.nips.cc/paper/2921-learning-depth-from-single-monocular-images)을 발표했다. 결과는 거칠었고 실내 구조화 환경에서만 겨우 작동했다.

Eigen, Puhrsch, Fergus의 [Eigen et al. 2014](https://arxiv.org/abs/1406.2283)는 접근 자체를 바꿨다. coarse network가 전역 구조를 예측하고, fine network가 지역 세부를 보정하는 두 단계 CNN. 학습 데이터는 NYU Depth v2 — Kinect RGB-D 카메라로 수집된 실내 장면 120,000장. 수치는 당시 기준 Make3D보다 개선되었고, 더 중요한 것은 개념의 증명이었다. 깊이는 학습 가능하다.

그러나 결정적 약점이 하나 남았다. **scale ambiguity**다. 네트워크는 상대적인 깊이 구조를 배우지만, 절대 스케일은 학습 데이터의 분포에 묶여 있다. NYU 실내에서 학습한 모델을 야외에 들이대면 스케일이 틀린다. 이 한계는 2024년까지 분야 전체의 숙제로 남는다.

> 🔗 **차용.** Eigen 2014는 깊이 추정 task 자체를 Make3D(Saxena 2005)에서 물려받았다. SVM과 MRF를 CNN으로 교체한 것이 핵심 교체였고, task 정의와 평가 지표(RMSE, threshold accuracy)는 이어받았다.

---

## 2. Garg → Godard — self-supervised depth

supervised depth 학습의 병목은 데이터였다. Kinect는 실내에서 잘 작동하지만 야외 환경, 특히 일조 하에서는 적외선 패턴이 날아간다. 대규모 실외 RGB-D 데이터셋 구축은 비용이 크다.

2016년, [Ravi Garg(UCL)는 다른 길을 열었다](https://arxiv.org/abs/1603.04992). stereo 이미지 쌍을 학습 신호로 쓰는 것이다. left 이미지를 보고 depth를 예측한 뒤, 그 depth와 카메라 baseline을 이용해 right 이미지를 reconstruction한다. right 이미지는 이미 존재하므로 photometric loss로 supervision이 가능하다. 라벨이 필요 없다.

Clément Godard(UCL)는 2017년 이 아이디어를 [Godard et al. 2017](https://doi.org/10.1109/CVPR.2017.699)에서 **MonoDepth**로 체계화했다. left-right consistency: left로 예측한 depth와 right로 예측한 depth가 서로 일치해야 한다는 양방향 제약. 구조적 유사도(SSIM)를 photometric loss에 포함해 텍스처 없는 영역에서의 안정성을 높였다. 핵심은 학습 시에만 stereo 쌍이 필요하다는 것이다. 추론은 단일 이미지만으로 가능하다. KITTI 벤치마크에서 당시 self-supervised 방법 중 최고였다.

> 🔗 **차용.** Garg와 Godard의 photometric loss는 stereo matching 문헌에서 온다. [Scharstein과 Szeliski가 정리한(2002)](https://vision.middlebury.edu/stereo/taxonomy-IJCV.pdf) disparity estimation의 intensity consistency 제약을 depth network의 학습 신호로 전용한 것이다.

2019년 Godard의 *MonoDepth2* ([Godard et al. 2019, ICCV](https://arxiv.org/abs/1806.01260))는 stereo 쌍 대신 monocular video를 쓰는 self-supervised로 나아갔다. depth network와 pose network를 동시에 학습한다. 연속 프레임 사이의 카메라 운동을 pose network가 예측하면, depth network의 출력으로 이전 프레임을 현재로 warping한다. warping 오차가 줄어드는 방향으로 두 네트워크가 함께 최적화된다. 두 가지 핵심 장치가 추가됐다. 첫째, **minimum reprojection loss**: 여러 소스 프레임 중 photometric error가 가장 낮은 것을 선택해 occluded 영역 오류를 줄인다. 둘째, **auto-masking**: 카메라와 같은 속도로 움직이는 픽셀(정지 카메라 + 정지 물체 포함)을 자동으로 제외한다.

깔끔한 구조였다. 그러나 여전히 문제가 있었다. 움직이는 물체와 반사 표면이 걸렸고, 하늘처럼 텍스처가 없는 영역에서는 더 심했다. 이 영역에서 photometric consistency 가정이 무너진다. 그리고 스케일은 여전히 모호하다. video supervision은 스케일을 프레임 간 상대적으로만 풀어준다.

---

## 3. MiDaS — 데이터셋 혼합

Intel의 René Ranftl이 이끈 팀이 2020년 발표한 [Ranftl et al. 2020](https://doi.org/10.1109/TPAMI.2020.3019967) **MiDaS**(Mixing Datasets for Zero-shot Cross-dataset Transfer)는 다른 질문을 던졌다. 한 데이터셋이 아니라 여러 데이터셋을 동시에 학습하면 어떻게 될까?

문제는 데이터셋마다 depth의 단위와 스케일이 다르다는 것이다. NYU는 미터 단위 실내, KITTI는 LiDAR 포인트 실외, ReDWeb은 stereo 영화, MegaDepth는 SfM 재구성. 이것들을 그대로 섞으면 네트워크가 혼란스러워진다.

Ranftl의 해법은 **affine-invariant loss**였다. 각 이미지의 depth prediction을 학습 전에 affine transformation(스케일 + 시프트)으로 정규화한다. 구체적으로, 예측과 정답 각각에서 중앙값을 빼 shift를 제거하고, 중앙값 절대편차(MAD)로 나눠 scale을 제거한 뒤 비교한다. 이 scale-and-shift invariant 정규화 덕분에 데이터셋 간 단위 불일치가 사라진다. 이렇게 하면 네트워크는 "상대적으로 어느 것이 더 멀리"를 배운다. 절대 거리는 아니다.

12개 데이터셋, 190만 장 이상의 이미지로 학습한 MiDaS는 처음으로 실용적인 cross-dataset generalization을 보여줬다. 야외, 실내, 역사 사진, 영화 프레임 모두에서 그럴듯한 relative depth를 내놨다. 절대 스케일은 없지만, 깊이 순서와 구조는 맞았다.

이후 Ranftl 팀은 2021년 [**DPT**(Dense Prediction Transformer)](https://arxiv.org/abs/2103.13413)를 별도 발표해 MiDaS backbone을 ViT 기반으로 교체했다. MiDaS v3부터 DPT가 기본 backbone이 됐고, v3.1(2022)은 그 개선판이었다. 성능이 크게 올랐다.

> 🔗 **차용.** MiDaS v3와 이후 Depth Anything은 CLIP·DINOv2·ViT 계열 backbone을 그대로 전용했다. backbone 교체만으로 성능이 점프하는 현상은 foundation model 시대의 일반적 패턴이지만, depth estimation에서 그 효과가 처음 대규모로 확인된 것은 DPT(Ranftl 2021)에서였다.

---

## 4. Depth Anything — foundation 규모

2024년 1월, TikTok Research의 Lihe Yang 팀이 발표한 [Yang et al. 2024](https://arxiv.org/abs/2401.10891) **Depth Anything**은 규모로 문제를 풀었다. 1.5M개의 labeled 이미지(기존 데이터셋 통합)와 62M개의 unlabeled 이미지를 썼고, unlabeled 이미지에는 pseudo-label을 생성해 학습에 포함했다. pseudo-label 품질을 높이기 위해 semantic segmentation feature를 auxiliary supervision으로 썼다.

결과는 MiDaS를 포함한 이전 방법들을 KITTI, NYU, ScanNet, DIODE 등 모든 주요 벤치마크에서 앞질렀다. 모델 크기는 ViT-L 기반 335M 파라미터. 추론 속도는 실시간과 거리가 있었으나, 품질이 먼저였다.

같은 해 나온 [**Depth Anything v2**](https://arxiv.org/abs/2406.09414)는 합성 데이터(Unreal Engine 기반 Virtual KITTI, Hypersim 등)를 대거 추가했다. 합성 데이터는 반사·투명 표면처럼 실제 데이터에서 annotation이 어려운 영역을 커버한다. v2는 v1보다 edge 세부와 얇은 구조 표현에서 눈에 띄게 개선되었다.

그러나 Depth Anything도 여전히 relative depth로, scale은 없다.

[**ZoeDepth**(Shariq Farooq Bhat et al. 2023)](https://arxiv.org/abs/2302.12288)와 [**Metric3D v2**(2024)](https://arxiv.org/abs/2404.15506)는 이 마지막 문제를 다른 방향에서 공략했다. camera intrinsic(초점거리·센서 크기)을 네트워크 입력으로 명시적으로 주입한다. 같은 장면이라도 초점거리가 다르면 depth 분포가 달라지는 것을 네트워크가 배우도록 한다. in-the-wild 데이터에서의 metric depth 결과는 이전과 질적으로 달랐다. 완벽하지는 않지만 많은 실용 시나리오에서 쓸 수 있는 수준이 됐다.

---

## 5. SLAM으로의 역수입

2021년쯤부터 SLAM 연구자들이 monocular depth 모델을 파이프라인 안으로 끌어들이기 시작했다. 진입로는 초기화였다. Monocular SLAM은 구조상 초기화가 까다롭다. 두 프레임에서 triangulation을 하려면 baseline이 충분해야 하고, scale은 첫 단계부터 모호하다.

depth prior를 첫 프레임에 주입하면 초기화가 빨라지고 scale을 대략 고정할 수 있다. [Teed와 Deng이 2021년 발표한 DROID-SLAM](https://arxiv.org/abs/2108.10869)은 recurrent optical flow와 BA를 묶은 구조인데, 이 계통에서 나온 후속 연구들이 monocular depth prior를 geometric initialization에 붙이는 방식을 실험했다.

scale recovery 쪽은 더 직접적이었다. monocular visual odometry(VO)는 달리면서 scale drift가 쌓인다. depth network 예측을 주기적인 scale anchor로 쓰면 이 drift를 억제할 수 있다. 실용적 패치였고, 순수 VO보다 훨씬 긴 거리에서 버텼다.

> 📜 **예언 vs 실제.** Eigen은 2014년 논문에서 surface normal 등 3D geometry 정보와의 결합을 자연스러운 확장 방향으로 언급했다. joint multi-task learning은 이후 PAD-Net·VPD 등으로 부분 실현됐다. 그러나 2024년 시점 실질적 영향은 task를 합친 것보다 ViT backbone 공유로 왔다고 볼 여지가 크다. 예측한 방향과 실제 경로는 달랐다. `[기술변화]`

> 📜 **예언 vs 실제.** MiDaS(2020)는 scale-and-shift invariant loss로 절대 스케일을 포기하고 상대 깊이에만 집중하는 우회를 택했고, 이는 metric 복원이 카메라 파라미터 없이는 본질적으로 어렵다는 인식과 맞닿아 있다. 2024년 Depth Anything v2와 Metric3D v2가 camera intrinsic을 입력으로 받는 방식으로 이 방향을 직접 공략했고, in-the-wild metric이 실용 수준에 가까워졌으나 완전한 카메라 독립은 아직 아니다. `[진행형]`

---

## 🧭 아직 열린 것

**반사·투명 표면의 depth.** 유리, 물, 금속 반사면은 카메라가 포착하는 것이 실제 표면이 아니다. 물리 광학 수준의 문제다. 합성 데이터로 학습을 늘려도 real-world 반사 장면에서의 일반화는 여전히 불안정하다. [ClearGrasp(Sajjan et al. 2020)](https://arxiv.org/abs/1910.02550) 같은 specialized 접근이 있으나 general solution은 없다. Foundation 규모 모델에서도 이 영역의 오차는 구조적으로 크다.

**Dynamic scene에서 ego-depth와 object-depth의 분리.** 자동차·사람·자전거가 움직이는 장면에서 photometric consistency는 근본적으로 위반된다. self-supervised 방법들은 moving object를 masking해 우회했다. 우회였지 해법은 아니었다. 움직이는 물체의 depth를 에고모션과 분리해 동시에 풀어야 하는 문제는 [Ranjan et al.(2019)](https://arxiv.org/abs/1805.09806)을 비롯한 여러 후속 연구가 시도했으나 실용 수준에서는 여전히 난제다.

**Metric scale의 일반화.** Metric3D v2와 Depth Anything v2가 camera intrinsic 조건부로 metric depth를 내놓기 시작했다. 그러나 intrinsic을 모르는 상황은 흔하다. 스마트폰 수백 종이 있고, CCTV와 역사 아카이브 사진에는 exif조차 없다. 카메라 독립적 metric depth는 foundation model 규모에서도 쉽지 않다. 이것이 2025년 시점 monocular depth의 남은 핵심 질문이다.

---

2024년, Depth Anything이 벤치마크를 갈아엎고 있던 같은 시기, Cambridge의 한 논문이 이미 9년째 SLAM 커뮤니티의 미완성 숙제로 남아 있었다. 한 장의 이미지에서 절대 pose를 바로 꺼낸다. 특징 추출도 최적화도 없이. 지도는 처음부터 존재하지 않는다. [PoseNet](https://arxiv.org/abs/1505.07427)이 그 꿈의 이름이었다.

---

# Ch.12 — End-to-end 좌절

11장에서 단안 카메라 하나로 depth를 복원하는 일이 가능해졌다. Eigen의 망은 픽셀에서 metric depth를 꺼냈고, SfMLearner는 라벨 없이도 기하학적 supervision을 만들어냈다. 학습이 형태를 본다는 것이 증명된 순간이었다. 그렇다면 더 나아갈 수 있지 않을까. pose 추정도, 루프 클로저도—SLAM 전체를 하나의 망으로 끝낼 수 있지 않을까. 2015년부터 2018년 사이, 이 물음은 답을 찾지 못했다.

2015년, Cambridge Computer Laboratory 박사과정 학생 Alex Kendall은 Roberto Cipolla 지도교수 아래 Google Street View 이미지로 학습한 신경망에 사진 한 장을 넣고 6-DoF pose를 출력하는 연구를 완성했다. [Kendall et al. 2015. PoseNet](https://doi.org/10.1109/ICCV.2015.336)이라 명명한 이 논문은 Santiago de Chile에서 열린 ICCV에서 즉각 반향을 일으켰다. SLAM의 30년짜리 방정식—특징 추출, 매칭, 최적화, 지도 관리—을 단일 CNN으로 압축할 수 있다면? 이 물음은 2015년부터 2018년까지 수십 편의 논문을 낳았다. 그리고 거의 예외 없이, 같은 결론을 반복했다.

---

## 12.1 PoseNet

PoseNet이 물려받은 것은 [AlexNet(Krizhevsky et al. 2012)](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks)이었다. Kendall은 ImageNet에서 분류 task로 학습된 깊은 CNN이 고수준 시각 표현을 형성한다는 사실을 확인하고, 그 feature hierarchy를 pose estimation으로 전용했다.

> 🔗 **차용.** PoseNet의 backbone은 [GoogleNet(Inception, Szegedy et al. 2014)](https://arxiv.org/abs/1409.4842) 구조다. Classification head를 제거하고 7차원 회귀 head(x, y, z, quaternion 4개)를 붙인 것이 전부다. ImageNet 학습으로 얻은 feature hierarchy를 localization에 이식한 직접 차용.

Kendall이 직접 수집한 Cambridge Landmarks 데이터셋—킹스 칼리지 예배당, 거리, 옛 병원 등 여러 야외 장면—에서 PoseNet은 장면에 따라 위치 오차 2 m 안팎, 방향 오차 5-8° 수준을 기록했다(원 논문 §5 기준). 2015년 기준으로는 인상적인 수치였다. GPU 한 장으로 5 ms 이내에 답이 나왔다. 특징 추출도, RANSAC도, 지도 조회도 없었다.

논문은 즉각적인 후속을 촉발했다. [Bayesian PoseNet(Kendall & Cipolla 2016)](https://arxiv.org/abs/1509.05909)은 Monte Carlo Dropout으로 자세 불확실성을 추정하려 했다. LSTM PoseNet은 시퀀스 정보를 통합했다. Geometric loss를 추가한 변형이 등장했다. Kendall 자신도 2017년에 재귀 구조와 photometric loss를 결합한 버전을 냈다.

그러나 비교 기준이 올라갈수록 격차가 드러났다. 같은 장면에서 [Active Search(Sattler et al. 2012)](https://www.graphics.rwth-aachen.de/media/papers/sattler_eccv12_preprint_1.pdf)나 DenseVLAD는 위치 오차 0.2 m 수준을 달성했다. PoseNet 계열은 수 미터 오차를 좀처럼 넘지 못했다. 이미지 한 장에서 절대 자세를 회귀하는 접근에는 원론적 한계가 있었다.

---

## 12.2 DeepVO

PoseNet의 한계 중 하나가 단일 이미지 입력이라면, 시퀀스를 입력하면 어떨까. Sen Wang(에든버러 Heriot-Watt)과 공저자들은 2017년 [Wang et al. 2017. DeepVO](https://arxiv.org/abs/1709.08429)를 ICRA에 발표했다. FlowNet에서 영향을 받은 CNN으로 연속 프레임 쌍의 optical flow feature를 추출하고, LSTM으로 시간 맥락을 축적해 VO를 직접 출력하는 구조였다.

> 🔗 **차용.** DeepVO의 훈련 라벨은 KITTI의 GPS/IMU ground truth다. feature 추출 설계는 [FlowNet(Dosovitskiy et al. 2015)](https://arxiv.org/abs/1504.06852)의 optical flow CNN 구조에서 직접 차용했다. "deep VO"가 고전 센서 측정과 이전 딥러닝 연구 양쪽에 동시에 기대는 방식.

LSTM이 temporal modeling을 맡으면서 drift 억제를 기대했다. KITTI 시퀀스 일부에서 DVO-SLAM이나 VISO2-M 대비 낮은 drift를 보이는 결과가 논문에 실렸다. 하지만 조건이 있었다. 훈련 시퀀스와 비슷한 주행 패턴, 비슷한 조명 조건, 비슷한 도시 풍경, 비슷한 속도 프로파일. 조건이 어긋나면 LSTM이 축적한 "맥락"은 오히려 편향이 되었다.

Tinghui Zhou(UC Berkeley)가 같은 해에 발표한 [Zhou et al. 2017. SfMLearner](https://arxiv.org/abs/1704.07813)는 다른 각도로 접근했다. 자기지도(self-supervised) 학습으로 depth와 ego-motion을 동시에 추정하되, photometric reprojection loss를 학습 신호로 썼다. 라벨 없이 학습 가능하다는 점이 강점이었다.

> 🔗 **차용.** SfMLearner의 photometric loss는 고전 direct SLAM이 사용하는 intensity residual과 수학적으로 동일하다. [DSO(Engel et al. 2018)](https://arxiv.org/abs/1607.02565)의 photometric 원리를 미분가능 학습 프레임워크로 옮겼다. 이 계보는 살아남았다—SfMLearner의 self-supervision 아이디어는 MonoDepth2를 거쳐 결국 DROID-SLAM의 전제 조건 중 하나가 된다.

다만 SfMLearner 단독 VO는 KITTI 공식 리더보드에서 ORB-SLAM의 절반에도 미치지 못하는 성능으로 마무리되었다.

---

## 12.3 실패 원인 세 가지

2019년에서 2020년 사이, 이 분야의 논문들이 공통된 자기비판을 시작했다. Sudeep Pillai(MIT, 이후 TRI)는 2019년 발표에서 end-to-end 접근의 구조적 한계를 체계화했다. 원인은 크게 세 가지였다.

**첫 번째: inductive bias의 부재.** 고전 SLAM은 수십 년에 걸쳐 축적된 기하학적 제약을 알고리즘 구조 안에 새겨 넣었다. epipolar constraint, rigid body motion 가정, scale invariance, 공간의 연속성. CNN은 이것들을 데이터로부터 새로 배워야 했다. ImageNet의 고양이와 자동차 사진이 3D 공간의 metric geometry를 가르쳐주지는 않는다. 회귀 망이 pose를 맞추는 것처럼 보여도, 실제로 그것이 3D 공간을 이해해서인지 아니면 특정 조명·색상·질감 조합을 외워서인지 구별하기 어려웠다.

**두 번째: 일반화 실패.** 훈련 집합 밖으로 나가면 성능이 급락했다. Cambridge Landmarks로 학습한 PoseNet은 Oxford 거리에서 쓸 수 없었다. KITTI로 학습한 DeepVO는 레이더가 없는 다른 차량 데이터셋에서 drift가 기하급수로 커졌다. 고전 ORB-SLAM은 특징 검출에 실패하거나 조명이 극단적으로 변하면 추적을 잃었지만, 그 실패가 예측 가능했고 재초기화할 수 있었다. end-to-end는 조용히 틀렸다. 얼마나 틀렸는지 모르는 채로.

**세 번째: 불확실성 정량화의 부재.** SLAM이 단순한 pose 추정기로 끝나지 않는 이유는 downstream 시스템—경로 계획, 장애물 회피—이 위치 추정의 공분산을 요구하기 때문이다. EKF와 factor graph는 공분산을 자연스럽게 전파한다. Bayesian PoseNet이 dropout으로 분산을 추정하려 했지만, 그 분산이 실제 위치 오차와 calibrated 관계를 맺는지 검증하기 어려웠다. 특히 훈련 분포 밖 입력에서 Bayesian PoseNet은 오히려 자신만만한 틀린 답을 냈다. 틀린 것보다 자신만만하게 틀리는 것이 로봇 시스템에는 더 위험하다.

---

## 12.4 반성의 기록

Kendall은 이 실패를 외면하지 않았다. 박사학위를 마친 2019년, 그는 Wayve로 자리를 옮겨 자율주행용 imitation learning과 world model 연구로 방향을 틀었다. "이미지 한 장에서 절대 pose를 회귀한다"는 문제 정의가 틀렸다고 판단한 결과였다. 학습 기반 localization 자체를 포기한 것은 아니었다.

Federico Tombari 그룹(TU Munich, 이후 Google)도 같은 시기에 [CNN-SLAM(Tateno et al. 2017)](https://arxiv.org/abs/1704.03489)을 시도했다. CNN이 예측한 dense depth를 직접(direct) monocular SLAM의 깊이 측정과 융합하려는 접근이었다. 학습 부분이 dense depth에 국한되었다는 점에서 완전한 end-to-end는 아니었지만, "CNN이 단안 SLAM의 스케일·저텍스처 문제를 해결해 줄 수 있지 않을까"라는 기대의 한 갈래였다. 성능은 장면에 따라 들쭉날쭉했고, 정확도에서 일관된 우위를 보이지 못했다.

> 📜 **예언 vs 실제.** Kendall은 PoseNet 논문(2015)에서 불확실성 추정, temporal 정보 통합, 더 넓은 규모의 장면으로의 확장을 다음 과제로 꼽았다. 세 방향 모두 실행되었다—Bayesian PoseNet(2016), LSTM PoseNet(2016), 복수의 outdoor 확장 실험들. 그러나 각 시도가 새 벽에 부딪혔고, 연구자들은 결국 이 접근법 전체를 포기했다. 예언이 합리적이었어도 플랫폼 자체가 틀렸으면 소용없다. `[무산]`

일부 시도는 다른 방향으로 살아남았다. SfMLearner의 photometric self-supervision은 MonoDepth2(Godard 2019), 나아가 DROID-SLAM(Teed & Deng 2021)의 훈련 전략 안에 흡수되었다. DeepVO가 보여준 LSTM 기반 temporal modeling은 시각-관성 학습 연구에서 변형된 형태로 재등장했다. 아이디어의 용도가 바뀌었을 뿐이다.

> 📜 **예언 vs 실제.** Zhou는 SfMLearner 논문(2017)에서 dynamic object 처리와 photometric noise에 대한 강건성을 남은 과제로 제시했다. [GeoNet(Yin & Shi 2018)](https://arxiv.org/abs/1803.02276)을 비롯한 후속 self-supervised 연구들이 부분적으로 이 방향을 밀었다. 그러나 self-supervised VO 단독으로 SLAM을 대체하는 경로는 주류에 합류하지 못했다. photometric self-supervision 자체는 계보를 이어갔지만, end-to-end VO라는 목표는 분야가 기각했다. `[기술변화]`

---

## 12.5 교훈의 정착

2020년을 전후해 이 분야는 하나의 합의에 도달했다. "geometry는 알고리즘, learning은 feature와 prior"—대략 이런 방향이었다.

> 🔗 **차용.** 이 원칙의 실천은 13장에서 다루는 CodeSLAM(Bloesch 2018)과 DROID-SLAM(Teed & Deng 2021)에서 구체화된다. 두 시스템 모두 factor graph 또는 bundle adjustment라는 기하학적 뼈대를 유지하고, 학습 부분은 feature 추출이나 depth prior 형성에 국한한다. PoseNet이 버린 뼈대가 사실 포기할 수 없는 것이었다는 확인이다.

고전 파이프라인이 학습 기반 대안에 일관되게 우월한 것이 아니었다. ORB-SLAM도 textureless 환경에서, 야간에서, 비에서 자주 실패했다. 문제는 end-to-end의 오류가 더 불투명하고 더 예측 불가능하다는 데 있었다.

실패의 원인은 데이터셋이나 아키텍처에 있지 않았다. 이미지에서 바로 pose로 직결하는 경로에 30년짜리 기하학 지식이 통째로 빠져 있었다.

---

## 🧭 아직 열린 것

**어떤 inductive bias를 어떻게 주입할 것인가.** "geometry는 알고리즘으로"라는 원칙은 맞지만, 어떤 기하학을 어느 수준에서 코드화해야 하는지는 여전히 개방 질문이다. rigid body motion인가, epipolar constraint인가. foundation model 시대에 이 경계는 다시 흐려지고 있다. GaussianSLAM이나 3DGS 기반 시스템이 geometry를 학습 표현 안에 녹이는 방식을 실험하고 있다.

**Learned uncertainty의 calibration.** Bayesian PoseNet의 실패 이후에도 이 문제는 해결되지 않았다. 딥러닝 기반 uncertainty estimate가 실제 오차와 얼마나 calibrated 관계를 가지는지—특히 out-of-distribution 입력에서—는 2026년 기준으로도 열려 있다. 자율주행이 이 질문에 실용적 압력을 가하고 있다.

**"End-to-end"의 의미 재정의.** PoseNet이 정의한 end-to-end(이미지→pose, 학습만으로)는 실패했다. 그러나 foundation model이 등장한 2023년 이후 end-to-end의 의미가 바뀌고 있다. SLAM의 어느 모듈을 학습으로 채우고 어느 모듈을 알고리즘으로 유지할 것인가—이 분할선 자체가 재협상 중이다.

"geometry는 알고리즘으로, learning은 feature로"라는 원칙이 이 시기에 굳어졌다. 2018년, Andrew Davison의 연구실에서 그 원칙의 첫 실질적 구현이 나왔다. 장소는 Kensington, Imperial College London. 이름은 CodeSLAM이었다.

---

# Ch.13 — Hybrid 승리: CodeSLAM에서 DROID-SLAM까지

Michael Bloesch가 2018년 CVPR에 CodeSLAM을 발표했을 때, 그의 소속은 Imperial College London의 Dyson Robotics Lab이었다. 지도교수는 Andrew Davison. 같은 연구실에서 2011년 Richard Newcombe가 DTAM을 만들었고, 같은 연구실에서 Jan Czarnowski가 2020년 DeepFactors를 내놓았고, Edgar Sucar와 Tristan Laidlow가 계보를 이었다. CodeSLAM이 한 편의 논문 이상인 이유가 거기에 있다. Davison이 2002년부터 쌓아온 "SLAM은 확률론적 추론이다"라는 신조와, 2010년대 중반 딥러닝이 가져온 "표현을 배울 수 있다"는 충동이 2018년 Bloesch의 논문에서 처음 실질적으로 만났다.

---

## 13.1 CodeSLAM — latent code와 지도

전통적인 monocular SLAM에서 depth는 추정의 대상이었다. 수백 개의 sparse landmark이든, [DTAM](https://www.doc.ic.ac.uk/~ajd/Publications/newcombe_etal_iccv2011.pdf)(Newcombe et al. 2011)처럼 모든 픽셀이든, depth는 결국 최적화 변수였다. 그 변수 공간의 차원은 이미지 해상도에 비례했다. keyframe 한 장의 dense depth map은 640×480 해상도에서 307,200개의 독립 변수를 의미한다. 최적화는 무겁고, 초기화는 민감하고, prior를 넣기가 어렵다.

[Bloesch et al. 2018. CodeSLAM](https://doi.org/10.1109/CVPR.2018.00271)의 착상은 간단했다. depth map 자체를 최적화하는 대신, depth map을 생성하는 저차원 잠재 벡터(**latent code**)를 최적화하자. Variational autoencoder(VAE)를 훈련해 실제 depth 분포를 학습시키면, 그 bottleneck latent space는 "사실적인 depth map"들이 사는 다양체를 근사한다. 최적화는 그 다양체 위에서만 움직인다. 변수가 수십만 개에서 수백 개로 줄어든다.

> 🔗 **차용.** CodeSLAM의 latent depth 표현은 [Kingma & Welling 2013. VAE](https://arxiv.org/abs/1312.6114)에서 확립된 encoder-decoder 잠재 공간 구조를 차용했다. 학습 단계에서는 VAE 틀을 따르되, SLAM 추론 시에는 stochastic sampling 없이 **z**를 직접 MAP 최적화 변수로 다룬다. 생성 모델 연구자들이 이미지 합성을 위해 고안한 도구가, 10년 뒤 SLAM 최적화의 저차원 표현 공간으로 재등장했다.

구조는 이렇다. keyframe마다 VAE encoder가 이미지에서 latent code **z**를 추출한다. Decoder는 **z**에서 dense depth map을 재구성한다. Camera pose와 **z**는 jointly 최적화된다. photometric loss가 consistency를 강제하고, latent prior가 **z**를 사전 분포 근방에 머물도록 regularize한다.

수식으로 쓰면 objective는:

$$E(\mathbf{z}, T) = \sum_{i,j} \rho\bigl(I_j(\pi(T_{ij}, D_\mathbf{z}(u_i), u_i)) - I_i(u_i)\bigr) + \lambda \|\mathbf{z}\|^2$$

$D_\mathbf{z}$는 decoder, $\pi$는 projection, $\rho$는 robust cost, $T_{ij}$는 keyframe 간 상대 pose. latent prior 항 $\lambda\|\mathbf{z}\|^2$은 표준 정규 prior $p(\mathbf{z}) = \mathcal{N}(0, I)$의 negative log-likelihood에 해당하며, Gaussian prior 가정 아래 MAP inference에서 자연스럽게 등장하는 regularizer다.

> 🔗 **차용.** Factor graph(Dellaert & Kaess의 [GTSAM](https://gtsam.org/tutorials/intro.html))는 DeepFactors의 backend 골격을 제공했다. CodeSLAM이 joint optimization으로 처리한 pose-latent 결합 구조를 Czarnowski는 명시적 factor graph로 재정식화했다. Learning이 만든 잠재 변수가 전통적인 pose node 옆에 또 하나의 graph 변수로 편입된 것은 DeepFactors에 이르러서다. 두 세계의 인터페이스가 graph의 edge였다.

sparse 입력에서 geometry를 채워 넣는 능력이 기존 방법을 앞섰다. 그러나 CodeSLAM 자체는 실시간이 아니었다. VAE 추론과 최적화 루프가 느렸다. 논문은 그것을 솔직하게 밝혔다.

> 📜 **예언 vs 실제.** CodeSLAM은 compact learned representation을 dense SLAM 안에 들이는 가능성을 보였지만 속도와 규모 양쪽에서 여지를 남겼다. 이어진 DeepFactors(2020)가 같은 Imperial 그룹에서 실시간 쪽으로 한 발 더 나아갔으나 상용 배포 수준은 되지 못했고, monocular·stereo·RGB-D를 아우르는 범용 성능은 결국 다른 팀(Teed·Deng, Princeton)이 학습된 frontend + dense BA라는 다른 설계로 달성했다. `[진행형+기술변화]`

---

## 13.2 DeepFactors — Imperial Dyson Lab, factor graph 통합

2020년, Jan Czarnowski도 Davison 지도 아래 Imperial Dyson Robotics Lab에서 [Czarnowski et al. 2020. DeepFactors](https://doi.org/10.1109/LRA.2020.2969036)를 발표했다. Czarnowski의 목표는 CodeSLAM의 아이디어를 실제 SLAM 파이프라인 안으로 끌어들이는 것이었다.

DeepFactors는 CodeSLAM의 factor graph + latent depth 구조를 유지하면서 tracking과 mapping을 명시적으로 분리하고, keyframe 선택 기준을 도입했다. NVIDIA GTX 1080 위에서 keyframe 대비 tracking은 약 250Hz로 돌았으나, network Jacobian 계산이 keyframe당 수백 밀리초를 차지해 전체 파이프라인의 병목이었다. 방향은 보여주었으되 상용 배포 수준의 실시간에는 미치지 못했다.

DeepFactors가 더 중요하게 증명한 것은 원칙이었다. learned representation은 factor graph의 한 노드로 들어갈 수 있고, geometry optimization은 그 latent space 위에서 작동할 수 있다. Czarnowski가 도달한 결론은 단순했다. 파이프라인 일부를 학습 가능한 모듈로 바꾸는 것이 현실적 경로라는 것.

같은 시기 TU Munich의 Daniel Cremers 그룹도 같은 원칙에 도달했다. 출발점이 Imperial과 달랐을 뿐이다. Davison 계보가 CodeSLAM의 VAE latent 위에 factor graph를 쌓았다면, Cremers 그룹은 자신들이 2016년 내놓은 direct sparse odometry([DSO](https://arxiv.org/abs/1607.02565))를 뼈대로 두고 거기에 neural prediction을 주입했다. [Yang, Wang, Stückler, Cremers 2018. DVSO](https://arxiv.org/abs/1807.02570)는 단안 DSO에 neural depth를 "가상 스테레오"로 주입해 단안 환경에서 두 번째 카메라를 환각시켰고, [Yang, von Stumberg, Wang, Cremers 2020. D3VO](https://arxiv.org/abs/2003.01060)는 self-supervised로 학습된 depth·pose·uncertainty 세 종류의 neural prediction을 DSO의 factor graph에 추가 factor로 넣었다. [Wimbauer et al. 2021. MonoRec](https://arxiv.org/abs/2011.11814)과 [Wimbauer et al. 2023. Behind the Scenes](https://arxiv.org/abs/2301.07668)는 같은 계보를 dynamic scene dense reconstruction과 single-view density field 쪽으로 이어갔다. 인적 계보는 Imperial 그룹과 분리되어 있지만, "neural prediction을 고전 optimization 구조 안으로 흡수한다"는 설계 원칙은 수렴했다.

그 원칙은 2021년 Princeton에서 또 다른 방식으로 다시 나타났다.

---

## 13.3 RAFT — recurrent optical flow

Zachary Teed와 Jia Deng(Princeton)은 2020년 ECCV에 [Recurrent All-Pairs Field Transforms(RAFT)](https://arxiv.org/abs/2003.12039)를 발표했다. RAFT는 SLAM 논문이 아니었다. optical flow 추정 논문이었다.

그러나 RAFT의 설계는 이후 DROID-SLAM의 핵심이 된다. 구조는 세 부분으로 나뉜다.

1. Feature encoder: CNN이 두 이미지에서 feature map 추출
2. Correlation volume: 모든 픽셀 쌍 간의 유사도를 4D volume으로 구성. 4-level pyramid
3. Update operator: Gated Recurrent Unit(GRU) 기반 반복 refinement. correlation volume을 lookup하며 flow field를 업데이트

이름에 들어간 all-pairs가 이 구조의 차별점을 요약한다. 모든 후보 위치를 동시에 고려하고, 고정 해상도에서 flow field를 점진적으로 refinement한다. 기존 coarse-to-fine 방법(PWC-Net 등)과 달리 flow field를 단일 full-resolution으로 유지한 채 correlation pyramid를 lookup한다. KITTI, Sintel, FlyingThings3D에서 기존 방법을 5%-15% 앞섰다.

RAFT는 SLAM 계보의 조상이 아니다. 그러나 Teed는 같은 update operator 구조가 SLAM의 iterative bundle adjustment와 구조적으로 유사하다는 것을 알아챘다. flow field를 refinement하는 GRU가 pose와 depth를 refinement하는 최적화 스텝과 얼마나 다른가.

---

## 13.4 DROID-SLAM — update operator와 BA

2021년 NeurIPS, [Teed & Deng. DROID-SLAM](https://arxiv.org/abs/2108.10869). 제목의 DROID는 "Differentiable Recurrent Optimization-Inspired Design"의 약자다.

아키텍처를 따라가면 hybrid 설계의 의도가 드러난다.

Frontend는 RAFT와 동일한 구조다. CNN encoder가 feature map을 추출하고, all-pairs correlation volume을 구성하고, GRU update operator가 dense flow를 반복 추정한다. 차이는 keyframe graph의 모든 edge에서 동시에 flow를 추정한다는 점이다.

Backend는 Dense Bundle Adjustment(DBA)다. Pose와 inverse depth가 optimization 변수다. flow 추정이 제공하는 2D correspondence를 제약으로 사용해 pose-depth를 jointly 최적화한다. Schur complement trick으로 선형 시스템을 효율적으로 푼다.

연결고리는 **DBA layer**다. GRU가 추정한 flow와 uncertainty가 DBA에 입력된다. DBA가 pose·depth를 업데이트하면, 그 결과가 다음 GRU iteration의 reference를 갱신한다. 두 모듈이 loop로 연결된다.

> 🔗 **차용.** Dense BA라는 아이디어 자체는 10년 전으로 거슬러 올라간다. Newcombe의 DTAM(2011)은 모든 픽셀을 사용한 photometric bundle adjustment의 선구자였다. DROID-SLAM은 그 아이디어를 learned flow라는 더 강건한 입력과 결합했다. Newcombe와 Teed의 affiliations는 다르지만 논리적 계보는 이어진다.

> 🔗 **차용.** DROID-SLAM의 update operator는 같은 저자(Teed·Deng)의 RAFT에서 직접 이식했다. optical flow를 위해 설계된 all-pairs recurrent refinement가 bundle adjustment의 반복 최적화와 구조적으로 호환된다는 통찰이 핵심이었다. 같은 사람이 두 논문을 썼다는 사실이 이 차용을 가능하게 했다.

EuRoC MAV 데이터셋에서 DROID-SLAM은 당시 최고 수준인 ORB-SLAM3보다 낮은 RMSE ATE를 기록했다. TartanAir(합성)와 실제 실내외 시퀀스 양쪽에서. 특히 조명 변화와 texture 부족 상황에서 feature-based 방법보다 강건했다. Teed가 이후 Handbook 회고에서 EuRoC V1_02 시퀀스에 대해 보고한 수치가 인상적이다. frontend만 돌렸을 때 16.5cm이던 ATE가 global optimization을 거치면 1.2cm로 떨어졌다. learned correspondence가 공급한 제약 위에서 고전 BA가 한 자릿수 cm까지 수렴하는 장면이다.

12장의 순수 end-to-end 접근이 왜 실패했는지 돌아보면 DROID-SLAM이 어디서 갈라졌는지 드러난다. PoseNet은 geometry constraint 없이 pose를 직접 회귀했고, 일반화에 실패했다. Teed와 Deng은 역할을 나눴다. dense correspondence 추정은 학습에 맡기고, geometry 제약 강제는 BA가 담당했다. feature 추출·dense matching에서는 신경망이, consistency enforcement·uncertainty 전파에서는 geometry optimizer가 각자 강점을 살렸다. 인간이 설계한 feature를 학습된 feature로 교체하면서도 optimization structure는 그대로 유지됐다. 2021년의 hybrid와 2015년의 end-to-end가 갈라진 지점이 거기에 있다.

---

## 13.5 Imperial Dyson Lab 계보도

CodeSLAM에서 DROID-SLAM까지의 흐름은 Imperial Dyson Robotics Lab의 인적 계보를 따라가면 제대로 보인다.

Andrew Davison은 2002년 MonoSLAM 이후 20년 동안 Imperial에서 SLAM 연구를 이끌었다. 제자와 협력자들이 차례로 분기점을 만들었다.

- **Richard Newcombe** (Davison 지도, Imperial): DTAM(2011), [KinectFusion](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ismar2011.pdf)(2011). 이후 Oculus→Meta Reality Labs
- **Michael Bloesch** (Davison 지도, Imperial): CodeSLAM(2018), touch·inertial SLAM 연구
- **Jan Czarnowski** (Davison 지도, Imperial): DeepFactors(2020)
- **Edgar Sucar** (Davison 그룹, Imperial): [iMAP](https://arxiv.org/abs/2103.12352)(2021), 이후 NeRF-SLAM 계보로 연결
- **Tristan Laidlow** (Davison 그룹, Imperial): dense 3D reconstruction, 이후 neural implicit SLAM 계보로 연결

이 계보는 "학파"라는 단어가 과장이 아닌 경우 중 하나다. factor graph + uncertainty 철학이 단안 sparse에서 dense latent로, dense latent에서 implicit representation으로 모양을 바꾸며 이어졌다. Davison은 [FutureMapping](https://arxiv.org/abs/1803.11288)(2018)과 [FutureMapping 2](https://arxiv.org/abs/1910.14139)(2019, Ortiz 공저)에서 Spatial AI 시스템이 갖춰야 할 계산 구조와 표현을 직접 지도 위에 스케치했다. 다양한 geometric·semantic 표현을 하나의 확률 그래프 위에 묶자는 주장이었다. CodeSLAM과 DeepFactors는 그 스케치의 첫 번째 실험들이었다.

Teed와 Deng은 이 계보 밖에 있다. Princeton, 독립적 경로. 그러나 DROID-SLAM이 채택한 dense BA의 논리적 전임자는 DTAM이고, DTAM은 Newcombe·Davison의 작품이다. 계보는 인적 연결을 건너뛰고도 논리로 이어진다.

---

## 13.6 2023-2025 — DROID 이후의 확장

DROID-SLAM 이후 몇 년간 그 위에서, 혹은 그 옆에서 성장한 연구들이 나왔다.

[GO-SLAM](https://arxiv.org/abs/2309.02436)(Zhang et al. 2023)은 DROID-SLAM의 tracking을 확장해 online loop closing과 full bundle adjustment를 얹고, mapping은 Instant-NGP 계열 neural implicit 표현(multi-resolution hash encoding)으로 돌렸다. tracking은 DROID 계열 dense flow + BA, map은 implicit representation. hybrid의 두 번째 층이다.

[NICER-SLAM](https://arxiv.org/abs/2302.03594)(Zhu et al. 2023)은 다른 길을 갔다. tracking과 mapping을 하나의 hierarchical neural implicit representation 위에서 동시에 풀었다. RGB-only dense SLAM이라는 목표는 공유하지만 경로가 다르다. DROID 계보의 외곽에서 같은 문제에 부딪히는 방식이다.

[SplaTAM](https://arxiv.org/abs/2312.02126)(Keetha et al. 2024)은 map representation을 3D Gaussian Splatting으로 바꾸고, tracking도 silhouette-guided differentiable rendering 기반으로 다시 짰다. 3DGS 계보와의 결합이지 DROID 계보의 직접 확장은 아니다.

[DPV-SLAM](https://arxiv.org/abs/2408.01654)(Lipson, Teed, Deng 2024)은 DROID-SLAM과 같은 Princeton 그룹에서 나왔다. [DPVO](https://github.com/princeton-vl/DPVO)(Deep Patch Visual Odometry)를 기반으로, 근접 기반 loop closure와 CUDA block-sparse BA를 추가해 DROID-SLAM 대비 약 2.5배 빠르고 메모리 footprint가 작은 시스템을 만들었다. 핵심은 patch 기반 sparse 표현 + 효율적 loop closure다.

DROID 계보 바깥에서는 Naver Labs가 열어놓은 [DUSt3R](https://arxiv.org/abs/2312.14132)(Wang et al. 2023)의 path 위에 2024-2025년 확장이 이어졌다. DUSt3R가 두 이미지에서 pointmap을 직접 출력해 SfM의 절차 자체를 재정의한 뒤(16장에서 자세히 다룬다), 같은 Revaud 그룹이 [Cabon et al. 2025. MUSt3R](https://arxiv.org/abs/2503.01661)에서 symmetric multi-view 확장과 working memory를 도입해 이미지 쌍 단위였던 구조를 다수 프레임으로 늘렸다. offline SfM과 online VO/SLAM을 같은 네트워크로 처리할 수 있게 한 시도다. 더 흥미로운 것은 DROID 계열 tool들이 이 생태계 안에서 재활용된다는 점이다. [Li et al. 2024. MegaSAM](https://arxiv.org/abs/2412.04463)은 DROID-SLAM의 differentiable dense BA를 dynamic scene과 uncalibrated 영상 쪽으로 밀어붙여 camera intrinsic까지 inference 도중 공동 최적화했다. NVIDIA의 [Huang et al. 2025. ViPE](https://arxiv.org/abs/2508.10934)는 DROID-SLAM의 dense flow network와 cuvslam의 sparse point, monocular depth network까지 세 종류의 제약을 하나의 dense BA로 결합해 유튜브 규모의 wild video annotation 파이프라인으로 산업화했다. learned frontend + classical backend라는 2021년 DROID의 설계가 2025년에는 calibration-free와 dynamic scene이라는 더 어려운 조건 위에서 반복되고 있다.

패턴이 하나의 단일 경로는 아니다. GO-SLAM처럼 DROID tracking 위에 neural map을 얹는 길, DPV-SLAM처럼 patch odometry로 가볍게 재설계하는 길, NICER-SLAM이나 SplaTAM처럼 implicit/splatting 표현 위에서 tracking을 새로 쓰는 길, MegaSAM·ViPE처럼 DROID의 dense BA를 uncalibrated·dynamic 조건으로 밀어붙이는 길이 동시에 진행됐다. 2021년 Teed와 Deng이 내놓은 learned frontend + classical backend라는 프레임워크가 그 여러 분기들의 공통 출발점이 되었다.

> 📜 **예언 vs 실제.** DROID-SLAM은 differentiable BA를 end-to-end 학습과 결합한 hybrid의 기준점을 세웠다. 같은 그룹이 3년 뒤 내놓은 DPV-SLAM은 그 기준점을 efficiency 쪽으로 이어받았다. 반면 GO-SLAM·NICER-SLAM·SplaTAM 계열은 map representation을 implicit 혹은 Gaussian splatting으로 갈아끼우는 쪽으로 갈라져 나갔다. "learned frontend + classical backend"라는 DROID의 설계가 여러 갈래로 변주되는 중이며, 어느 갈래가 범용 해법이 될지는 2026년 현재 아직 결론이 나지 않았다. `[진행형]`

---

## 🧭 아직 열린 것

Learned prior의 분포 밖 일반화가 첫 번째 문제다. CodeSLAM과 DeepFactors의 VAE는 훈련 데이터의 depth 분포를 학습한다. 완전히 다른 환경(실외 open-world, 비균질 texture, 야간)에서는 learned prior가 오히려 최적화를 잘못된 방향으로 당길 수 있다. DROID-SLAM의 flow estimator도 훈련 도메인 밖에서 성능이 떨어진다. 2026년 현재, "어떤 환경에서도 작동하는 learned SLAM"은 아직 없다. TartanAir처럼 다양한 합성 데이터로 훈련하는 접근이 있으나 sim-to-real gap이 남는다.

실시간 제약도 여전하다. DROID-SLAM은 NVIDIA RTX 2080Ti 기준으로 평균 10-15 fps 수준이다. keyframe graph 크기에 따라 더 느려진다. dense BA가 병목이다. 모바일 로봇이나 AR/VR처럼 실시간(30Hz+), 저전력 배포가 필요한 응용에서는 2026년 현재도 실용적이지 않다. 경량화 시도들(keyframe 수 줄이기, approximate BA)이 있으나 성능 trade-off가 따른다.

Loop closure의 learned 통합도 미해결이다. DROID-SLAM은 loop closure를 명시적으로 다루지 않는다. Teed 본인이 이후 Handbook 회고에서 "DROID-SLAM doesn't include any relocalization module, so large loops with lots of drift cannot be closed"고 담담히 자인했다. keyframe graph가 sliding window 방식으로 유지되고, global consistency는 제한적이다. learned loop closure(12장의 place recognition 연구들)를 DROID의 factor graph에 통합하는 시도가 일부 있으나 아직 단일 시스템으로 수렴하지 않았다. 12장 NetVLAD 계보와 13장 DROID 계보가 만나는 지점이 아직 열려 있다.

---

여기까지 오면 물음 하나가 남는다. map representation을 점, 선, 평면으로 유지해야 하는가. DROID-SLAM의 inverse depth map은 2021년 기준 최선의 dense representation이었다. 그러나 2020년, [NeRF](https://arxiv.org/abs/2003.08934)(Neural Radiance Field)가 전혀 다른 가능성을 제시했다. 장면을 포인트나 메시가 아니라 연속 함수로 표현하면 어떤가. 렌더링이 미분 가능하다면 photometric consistency를 새로운 방식으로 강제할 수 있다.

그것은 4부(러닝 융합기)가 hybrid optimization으로 마무리되는 지점이기도 하고, 5부(표현의 혁명)가 시작하는 지점이기도 하다. 14장은 NeRF가 SLAM에 충돌하는 순간을 따라간다.

---

# Ch.14 — NeRF 충격과 SLAM 접목: iMAP→NICE-SLAM

Ch.13에서 DROID-SLAM은 learned representation이 SLAM의 핵심 루프(tracking)에 직접 들어올 수 있음을 보였다. MLP나 recurrent network가 feature를 만들고, 그 feature 위에서 포즈를 최적화했다. 그런데 learned representation을 tracking이 아니라 *지도 자체*에 쓸 수 있다면? Imperial College의 Edgar Sucar는 그 질문을 2021년 iMAP으로 답했고, 그 답의 재료를 SLAM 바깥에서 가져왔다. NeRF였다.

2020년 3월, Ben Mildenhall과 동료들이 arXiv에 올린 [Mildenhall et al. 2020. NeRF](https://arxiv.org/abs/2003.08934)는 8개 이미지로 새로운 시점의 사진을 만들어냈다. 그 사진은 빛과 그림자의 결을 가지고 있었다. SLAM 커뮤니티는 처음에 이것을 렌더링 문제로 보았다. 지도를 *보여주는* 방법으로 분류한 것이다. 그 인식이 바뀌는 데는 14개월이 걸렸다. 2021년 ICCV에서 Sucar가 iMAP을 발표하면서, NeRF가 렌더링 도구가 아니라 지도 표현 자체로 쓰일 수 있다는 게 드러났다. iMAP은 KinectFusion(Ch.9)의 계보를 이었다. implicit neural field가 TSDF voxel grid를 대체할 수 있다는 가설의 첫 구현체였다.

NeRF가 허공에서 나온 것은 아니었다. 2019년 한 해 동안 coordinate-based MLP로 3D를 표현하는 세 갈래가 거의 동시에 터졌다. [Park et al.의 DeepSDF](https://arxiv.org/abs/1901.05103)는 좌표를 넣으면 signed distance를 뱉는 MLP로 물체 표면을 암묵적으로 기술했고, [Mescheder et al.의 Occupancy Networks](https://arxiv.org/abs/1812.03828)는 같은 좌표 입력에서 occupancy 확률을 뱉게 만들었으며, [Sitzmann et al.의 SRN](https://arxiv.org/abs/1906.01618)은 좌표마다 scene feature vector를 저장해 differentiable ray marching으로 이미지를 합성했다. 세 연구는 좌표를 넣으면 field 값을 뱉는 같은 수학적 틀을 공유했다. Mildenhall et al. 2020 NeRF는 이 틀에 volume rendering 적분과 positional encoding을 더해 view synthesis까지 닫았다. iMAP이 이어받은 것은 그 1년짜리 계보 전체였다.

---

## NeRF: MLP 기반 공간 표현

NeRF의 핵심 아이디어는 하나의 MLP가 3D 공간 전체를 암묵적으로 기억한다는 것이다. 입력은 공간 좌표 $(x, y, z)$와 시선 방향 $(\theta, \phi)$. 출력은 그 위치의 색상 $(r, g, b)$과 밀도 $\sigma$. 이것만으로 어떻게 장면 전체를 표현하는가.

렌더링은 volume rendering 방정식으로 이루어진다. 카메라 원점 $\mathbf{o}$에서 방향 $\mathbf{d}$로 나간 광선을 $t$ 매개변수로 샘플링한다:

$$\hat{C}(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma\!\left(\mathbf{r}(t)\right) \mathbf{c}\!\left(\mathbf{r}(t), \mathbf{d}\right)\, dt$$

여기서 $T(t) = \exp\!\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))\, ds\right)$는 광선이 거기까지 막히지 않고 도달할 누적 투과율이다. 실제로는 이 적분을 구간별 리만 합으로 근사한다.

> 🔗 **차용.** Volume rendering 방정식은 [Kajiya & Von Herzen(1984)](https://courses.cs.duke.edu/cps296.8/spring03/papers/RayTracingVolumeDensities.pdf)의 고전 그래픽스 논문에서 왔다. 40년 가까이 오프라인 렌더링의 물리 기반 도구였던 것을 Mildenhall은 역방향 최적화의 손실 함수로 전환했다.

MLP가 고주파 공간 신호를 학습하지 못하는 문제를 Mildenhall et al.(2020) NeRF 논문은 positional encoding으로 풀었다. 좌표 $(x, y, z)$를 사인·코사인 함수로 여러 주파수에 걸쳐 투영하면, 네트워크가 세밀한 텍스처와 날카로운 경계를 학습할 수 있다:

$$\gamma(p) = \left(\sin(2^0 \pi p),\, \cos(2^0 \pi p),\, \ldots,\, \sin(2^{L-1} \pi p),\, \cos(2^{L-1} \pi p)\right)$$

> 🔗 **차용.** NeRF의 positional encoding은 Mildenhall et al.(2020) 원 논문에 포함된 것이다. 같은 해 [Tancik et al.(2020)](https://arxiv.org/abs/2006.10739) "Fourier Features Let Networks Learn High Frequency Functions"가 NTK(neural tangent kernel) 이론으로 이 기법의 작동 원리를 설명했다.

NeRF의 학습은 역방향이다. 알고 있는 카메라 포즈에서 찍은 이미지들과 렌더링 결과를 비교해 픽셀 단위 L2 손실을 최소화한다. 최적화가 끝나면 MLP 가중치 자체가 장면의 geometry와 appearance를 저장한다. 복셀도, 메시도, 포인트클라우드도 쓰지 않는다. 공간은 네트워크 파라미터 안에 있다.

그러나 원래 NeRF에는 뚜렷한 약점이 있었다. 학습에 수 시간이 걸렸고, 한 장면에 특화되었으며, 카메라 포즈는 COLMAP 같은 외부 SfM으로 미리 구해야 했다. 이것을 SLAM에 이식하려면 포즈 추정과 지도 학습을 동시에, 실시간에 가깝게 해야 한다.

---

## iMAP: 최초의 neural implicit SLAM

Imperial College Dyson Robot Learning Lab의 Edgar Sucar가 2021년 ICCV에 발표한 [Sucar et al. 2021. iMAP](https://doi.org/10.1109/ICCV48922.2021.00612)은 그 시도였다. **iMAP**(Implicit MAP)은 RGB-D 카메라의 입력을 받아 단일 MLP를 지도로 쓰면서 포즈를 동시에 최적화했다.

구조는 두 개의 교번 최적화 루프다. *mapping* 루프는 현재 키프레임과 과거 랜덤 샘플 키프레임에서 광선을 샘플링해 MLP를 업데이트한다. *tracking* 루프는 MLP를 고정하고 현재 프레임의 포즈를 렌더링 손실로 최적화한다. 두 루프는 공유된 단일 MLP 위에서 동작한다.

손실 함수는 두 가지다. 색상 손실 $\mathcal{L}_{\text{color}} = \|\hat{C} - C\|_2^2$과 깊이 손실 $\mathcal{L}_{\text{depth}} = \|\hat{D} - D\|_2^2$. RGB-D를 쓰므로 depth supervision이 있어 geometry 학습이 안정적이었다.

iMAP은 개념 증명이었다. 소규모 실내 장면에서 동작했지만 두 가지 구조적 문제가 있었다. 첫째, 단일 MLP는 새로운 영역이 추가될수록 이전 영역을 잊어버렸다. 신경망의 catastrophic forgetting 문제다. Sucar는 keyframe replay로 부분 완화했으나 근본 해결이 아니었다. 둘째, 장면이 커질수록 단일 MLP의 표현력이 부족해졌다. MLP의 forward pass는 파라미터 수와 무관하게 전체 공간을 하나의 함수로 취급하기 때문이다.

> 📜 **예언 vs 실제.** Sucar는 iMAP 논문 Conclusion에서 "future directions for iMAP include how to make more structured and compositional representations that reason explicitly about the self similarity in scenes"라고 적었다. 구조화·합성적 표현 방향은 실제로 후속 연구의 중심 줄기가 되었다. 5개월 뒤 ETH 취리히의 NICE-SLAM 사전공개는 multi-resolution voxel feature grid로 공간을 계층적으로 쪼갰고, Wang et al.의 Co-SLAM(2023)은 hash grid와 coordinate encoding을 합성해 RTX 3090Ti에서 10-17Hz 수준의 준실시간까지 밀어붙였다. 다만 "self-similarity를 명시적으로 추론하는" 쪽은 NeRF-SLAM 본류에서 크게 발전하지 않았고, 단일 MLP를 정교화하는 계보 역시 중심에서 밀려났다. `[적중+부분]`

---

## NICE-SLAM: 계층 격자와 scalability

iMAP의 단일 MLP 문제에 대한 직접적인 답은 ETH 취리히의 Zihan Zhu·Songyou Peng이 2022년 CVPR에서 발표한 [Zhu et al. 2022. NICE-SLAM](https://arxiv.org/abs/2112.12130)에서 나왔다. **NICE-SLAM**(Neural Implicit Scalable Coding for SLAM)은 단일 MLP 대신 multi-resolution voxel feature grid와 작은 MLP decoder를 결합했다.

아이디어는 공간을 명시적 복셀 격자로 나누되, 각 복셀에 학습 가능한 feature vector를 두는 것이다. 렌더링 시 샘플 좌표 주변 복셀들의 feature를 trilinear interpolation으로 결합한 뒤 작은 MLP에 통과시켜 색상과 occupancy를 얻는다. MLP는 크지 않아도 된다. 공간 정보의 대부분은 격자에 담겨 있기 때문이다.

NICE-SLAM은 세 단계 해상도 격자를 계층적으로 쌓았다. 거친 격자는 전체 geometry 형태를 담고, 중간 격자는 구조의 세부를, 세밀한 격자는 texture를 담는다. 새로운 영역이 추가되면 해당 복셀의 feature만 업데이트하면 되므로 다른 영역의 catastrophic forgetting이 크게 줄어든다.

tracking에서 NICE-SLAM은 iMAP과 유사하게 MLP와 격자 feature를 고정하고 포즈를 최적화했다. mapping에서는 격자 feature를 업데이트했다. Replica·ScanNet 데이터셋에서 iMAP보다 넓은 공간을 다뤘고 세부 표현 품질도 높았다.

그러나 한계가 있었다. 격자 자체의 메모리가 해상도의 세제곱으로 증가했다. 실내 방 한두 개는 다룰 수 있었지만 복층 건물이나 야외로의 확장은 여전히 미해결이었다. 속도도 실시간과 거리가 있었다.

Thomas Müller의 [Müller et al. 2022. Instant-NGP](https://nvlabs.github.io/instant-ngp/)는 2022년 SIGGRAPH에서 이 병목을 다른 각도에서 공략했다. hash table 기반 feature encoding으로 복셀 격자의 메모리 폭발을 해결하고 학습 속도를 수 분에서 수 초로 줄였다. Instant-NGP는 SLAM 논문이 아니었지만, 이후 NeRF-SLAM 연구들이 거의 모두 hash encoding을 채용했다.

> 🔗 **차용.** NICE-SLAM의 multi-resolution feature grid는 Instant-NGP의 hash encoding과 시기적으로 겹치며 독립적으로 설계되었지만, 실제 NeRF-SLAM 구현에서는 Instant-NGP의 hash grid가 NICE-SLAM 격자를 빠르게 대체했다. TSDF를 격자에 저장하던 KinectFusion(Ch.9)의 논리적 후계가 feature를 격자에 저장하는 방식으로 이어진 계보이기도 하다.

---

## Co-SLAM과 NeRF-SLAM: 두 가지 통합 방향

iMAP·NICE-SLAM 이후 2022년 말부터 여러 시스템이 갈래를 나눴다. 한 방향은 implicit representation을 더 효율적으로 만드는 것, 다른 방향은 전통 SLAM의 강건한 backend를 NeRF map과 결합하는 것이었다.

UCL의 [Wang et al.(2023) **Co-SLAM**](https://arxiv.org/abs/2304.14377)은 전자에 속한다. joint coordinate·parametric encoding을 써서 multi-resolution hash grid와 one-blob 인코딩을 결합했다. 두 표현이 서로 보완하도록 설계해 빠른 수렴과 surface completeness를 함께 노렸다. hash grid가 관측된 dense 영역을 빠르게 채우고, coordinate encoding이 미관측 영역에 smooth prior를 제공하는 방식이었다. Replica 데이터셋에서 RTX 3090 기준 15-17Hz. NeRF 기반 SLAM이 처음으로 준실시간 영역에 닿은 지점이었다.

같은 해 같은 CVPR에서 Idiap/EPFL의 [Johari et al.의 **ESLAM**](https://arxiv.org/abs/2211.11704)은 비슷한 문제를 다른 각도에서 풀었다. 3D feature grid 대신 multi-scale axis-aligned feature plane을 써 메모리 증가를 $O(n^3)$에서 $O(n^2)$로 낮추고, volume density 대신 TSDF를 decoding 목표로 삼아 수렴을 가속했다.

Antoni Rosinol(MIT)이 2023년에 낸 [**NeRF-SLAM**](https://arxiv.org/abs/2210.13641)은 다른 접근이었다. 전통 SLAM의 tracking과 backend(factor graph 최적화)를 그대로 쓰고, 지도 표현만 NeRF로 교체했다. 포즈와 dense depth는 모두 DROID-SLAM frontend가 제공했다. Rosinol은 이 포즈·깊이와 불확실성을 입력으로 받아 Instant-NGP 기반 map을 병렬로 쌓았다.

> 🔗 **차용.** NeRF-SLAM의 backend는 Dellaert의 factor graph 최적화(Ch.6) 위에서 작동한다. "NeRF가 지도를 바꿀 수 있다"는 가설 아래에서도 포즈 추정의 핵심 수학은 2005년 이후 확립된 그래프 구조 위에 그대로 남아 있었다.

Rosinol은 모듈성을 골랐다. NeRF를 전체 파이프라인에 강제 삽입하는 대신 지도 표현 계층에서만 교체했다. 덕분에 루프 클로저 같은 전통 SLAM 기능이 그대로 남았다.

---

## iMAP의 구조적 한계와 그 의미

돌아보면 iMAP의 의의는 개념에 있었다. 단일 MLP가 전체 장면을 기억할 수 있고, 그 MLP를 실시간에 가깝게 업데이트하면서 포즈까지 최적화할 수 있다는 것을 보인 최초의 시스템이었다.

단일 MLP의 근본 문제는 지역성(locality)의 부재다. 공간의 어떤 부분을 렌더링하든 MLP 전체를 통과한다. 결과가 두 가지다. 첫째, 새 영역을 학습하면 가중치 전체가 바뀌어 기존 영역의 표현이 훼손된다(catastrophic forgetting). 둘째, 장면이 커질수록 단일 MLP가 담아야 할 공간 다양성이 늘어나 더 큰 네트워크, 더 많은 이터레이션이 필요해진다. 표현 용량은 파라미터 수에 선형으로 묶여 있는데 장면 복잡도는 공간 부피에 따라 커진다. 지역성 없는 표현은 규모가 커질수록 불리하다.

NICE-SLAM의 격자, Instant-NGP의 hash encoding, Co-SLAM의 이중 인코딩은 모두 이 지역성 문제의 답이었다. 공간을 국소적으로 나눠 각 부분이 자신의 영역만 기억하게 하면, 새 정보 추가가 기존 기억을 덜 침범하고, 특정 영역 렌더링 비용이 전체 장면 크기와 분리된다.

---

## 🧭 아직 열린 것

**실시간 NeRF-SLAM.** 2023년 기준 iMAP·NICE-SLAM은 실시간과 거리가 있었고, Co-SLAM이 RTX 3090Ti에서 10-17Hz로 준실시간까지 닿았지만 모바일·로봇 임베디드 환경의 실시간에는 여전히 미치지 못했다. Gaussian Splatting(Ch.15)이 명시적 표현으로의 복귀를 통해 속도 문제를 다른 방식으로 해결했지만, implicit neural field 자체의 고전적 실시간 SLAM(30fps 이상, 소비자 GPU 없이)은 미완으로 남아 있다. Instant-NGP가 렌더링 속도를 극적으로 높였음에도 동시 추적·지도 구축 루프의 전체 처리량은 여전히 제약이 있다.

**대규모 야외 환경.** [Block-NeRF](https://arxiv.org/abs/2202.05263)(2022, Tancik et al.)처럼 공간을 여러 국소 NeRF로 분할하는 시도는 있었지만, SLAM의 루프 클로저·전역 일관성 요구와 매끄럽게 맞물리지 못했다. 도시 규모 NeRF-SLAM은 개방형 문제다.

**semantic·편집 가능한 implicit 지도.** NeRF map은 렌더링에 최적화되어 있어 semantic label 삽입이나 사후 편집이 어렵다. "이 물체를 지도에서 지워라"나 "이 영역을 다른 용도로 분류하라"는 조작이 TSDF나 포인트클라우드 대비 훨씬 불편하다. language-guided NeRF editing 연구([LERF](https://arxiv.org/abs/2303.09553), [Nerfstudio](https://arxiv.org/abs/2302.04264) 생태계)가 진행 중이나 SLAM 파이프라인과의 실시간 통합은 2026년 현재 연구 단계다.

---

iMAP·NICE-SLAM이 implicit field를 극한까지 밀어붙이는 동안, 연구 커뮤니티의 일각은 반대 방향을 보고 있었다. 지도를 MLP 가중치나 feature grid 안에 암묵적으로 가두는 대신, 공간에 명시적으로 배치된 수백만 개의 작은 타원체로 흩뿌리면 렌더링은 빠르고 편집은 직관적일 수 있었다. 2023년 SIGGRAPH에서 Bernhard Kerbl의 논문이 나오기 전까지 그것은 아직 가설이었다.

---

# Ch.15 — Gaussian Splatting 시대: 3DGS에서 GS-SLAM까지

Ch.14에서 iMAP과 NICE-SLAM은 MLP로 공간을 기억하는 방법의 가능성을 보여주었다. 그러나 대가가 있었다. MLP는 불투명했다. 어떤 뉴런이 어떤 공간을 담당하는지 알 수 없었고, 새 관측이 들어올 때마다 전체 네트워크를 건드려야 했다. NICE-SLAM의 RTX 3090에서 1fps 아래 처리 속도는 "실시간 SLAM"이라는 말과 공존하기 어려운 수치였다. 장면은 네트워크 파라미터 안에 갇혀 있었고, 그 안을 들여다볼 방법이 없었다.

2023년 8월 SIGGRAPH에서 Bernhard Kerbl(INRIA), Georgios Kopanas, Thomas Leimkuhler, George Drettakis가 발표한 [논문](https://arxiv.org/abs/2308.04079)은 컴퓨터 비전 커뮤니티를 빠르게 움직였다. Kerbl은 NeRF가 3년에 걸쳐 쌓은 implicit representation 패러다임을 버리지 않으면서, 다른 선택을 했다. iMAP·NICE-SLAM·Co-SLAM이 MLP와 voxel grid로 장면을 잠그는 동안, Kerbl은 수백만 개의 작은 타원체, 즉 Gaussian primitive를 공간에 흩뿌렸다. SLAM 커뮤니티가 6개월 만에 이 표현으로 쏠린 데에는 이유가 있었다. Kerbl의 선택은 Matthias Zwicker의 EWA splatting(2001)이라는 20년 된 그래픽스 기법을 뿌리로 삼고, NeRF의 differentiable rendering 정신은 그대로 계승했다. 차이는 표현의 형식에 있었다.

---

## 3DGS의 구조

Kerbl은 장면을 explicit한 Gaussian 집합으로 표현했다. 각 Gaussian은 위치(mean) $\boldsymbol{\mu} \in \mathbb{R}^3$, 공분산 행렬 $\boldsymbol{\Sigma} \in \mathbb{R}^{3 \times 3}$, 불투명도 $\alpha \in (0,1]$, 그리고 구면조화함수(spherical harmonics) 계수로 표현된 색상을 가진다. 공분산은 학습 안정성을 위해 스케일 벡터 $\mathbf{s}$와 단위 쿼터니언 $\mathbf{q}$로 분해한다:

$$\boldsymbol{\Sigma} = \mathbf{R}\mathbf{S}\mathbf{S}^\top\mathbf{R}^\top$$

렌더링은 projected 2D Gaussian을 깊이 순서대로 알파 블렌딩(alpha-blending)한다. 각 Gaussian의 유효 불투명도 $\alpha_i$는 learnable opacity $\sigma_i$와 픽셀 위치에서 평가된 2D Gaussian 밀도 $G_i(\mathbf{x})$의 곱이다. 픽셀 색상 $C$는

$$C = \sum_{i \in N} c_i \alpha_i \prod_{j<i}(1 - \alpha_j), \quad \alpha_i = \sigma_i \cdot G_i(\mathbf{x})$$

Volume rendering 적분을 수치적으로 근사하는 NeRF와 달리, 3DGS는 GPU 래스터화(rasterization) 파이프라인에 직접 올라간다. 타일 기반 래스터라이저는 forward pass와 backward pass 모두 커스텀 CUDA 커널로 구현했다. RTX 3090 한 장에서 30fps 이상. NICE-SLAM이 같은 카드에서 1fps 아래를 내던 것과 비교하면 수십 배 빠른 렌더링이다.

초기화는 SfM에서 나온 sparse point cloud를 사용한다. 이후 학습 과정에서 Gaussian을 분열(splitting), 복제(cloning), 제거(pruning)하는 **densification** 절차를 반복한다. 뷰 공간(view-space) 위치 그래디언트가 임계치를 넘으면, scale이 큰 Gaussian은 두 자식으로 분열(split)하고 scale이 작은 Gaussian은 동일 위치에 복제(clone)한다. 투명도가 낮은 Gaussian은 주기적으로 제거한다.

> 🔗 **차용.** 3DGS의 래스터화 기반 splatting은 Zwicker et al.의 [EWA splatting (2001)](https://www.cs.umd.edu/~zwicker/publications/EWAVolumeSplatting-VIS01.pdf)을 직접 계승한다. Zwicker는 점 구름을 렌더링하기 위해 각 점에 타원형 가중 평균 커널을 씌웠다. Kerbl은 그 커널을 learnable Gaussian으로 교체하고 GPU 타일 래스터라이저로 가속했다.

---

## 3DGS와 SLAM의 구조적 적합성

implicit representation은 SLAM에 어울리지 않았다. MLP 기반 NeRF는 새 관측이 들어올 때마다 전체 네트워크를 재학습해야 했고, catastrophic forgetting 탓에 incremental update가 어려웠다. 지도 확장은 네트워크 크기 재조정을 뜻했다. NICE-SLAM의 voxel grid는 이 문제를 완화했지만, 해상도와 메모리의 트레이드오프를 피할 수 없었다.

3DGS는 이 문제를 구조적으로 풀었다. Gaussian은 공간에 명시적으로 있는 객체여서, 새 키프레임이 들어오면 해당 영역에 Gaussian을 추가하기만 하면 된다. Densification 절차가 keyframe 추가와 자연스럽게 맞물렸고, 렌더링 품질은 NeRF 수준을 유지했다. 실시간 처리도 가능했다. 2023년 후반에 GS-SLAM 논문들이 쏟아진 것은 이 계산 때문이었다.

---

## GS-SLAM: 첫 번째 시도

Chi Yan(홍콩대)과 공동 연구자들은 2023년 11월 [Yan et al. 2023. GS-SLAM](https://arxiv.org/abs/2311.11700)을 arXiv에 게시했다. 3DGS를 SLAM 파이프라인에 통합한 최초의 시스템이었다.

GS-SLAM의 구조는 전통 SLAM 프레임워크를 따른다. Tracking은 현재 프레임의 포즈를 추정하고, Mapping은 Gaussian 지도를 갱신한다. Yan의 기여는 두 가지였다. 첫째, adaptive Gaussian expansion: 새 키프레임이 추가될 때 coverage가 낮은 영역에 Gaussian을 삽입하는 메커니즘. 둘째, geometry-aware Gaussian selection: 렌더링 손실 역전파 시 기여가 큰 Gaussian만 골라 최적화해 속도를 확보했다.

Tracking은 포즈를 렌더링 포토메트릭 손실로 최적화한다. GS-SLAM의 tracking 손실은 sampled pixel에 대한 L1 색 손실이다:

$$\mathcal{L}_{track} = \sum_m \|\mathbf{C}_m - \hat{\mathbf{C}}_m\|_1$$

Mapping 단계에서 Yan은 color L1과 depth L1을 가중합해 쓴다. 한편 3DGS 원 논문의 training 손실인 $(1-\lambda)\mathcal{L}_1 + \lambda\mathcal{L}_{D\text{-}SSIM}$ ($\lambda=0.2$) 조합은 Gaussian 지도를 학습할 때 상속되는 기본형이다. 이 손실을 포즈에 대해 미분할 수 있는 이유가 3DGS의 differentiable rasterizer다.

Replica 데이터셋에서 NICE-SLAM 대비 PSNR을 유지하면서 처리 속도를 높였다. 한계도 명확했다. RGB-D 카메라를 전제했고, 실외 대규모 환경에서는 검증하지 않았다.

---

## SplaTAM: silhouette 기반 densification

Nikhil Keetha(카네기멜론대)의 [Keetha et al. 2024. SplaTAM (CVPR)](https://arxiv.org/abs/2312.02126)은 설계 철학에서 GS-SLAM과 달랐다. Keetha는 복잡한 선택 메커니즘 대신 실루엣 마스크 기반 단순한 densification을 택했다.

Keetha의 핵심 아이디어는 **silhouette mask**다. 현재 뷰에서 기존 Gaussian으로 설명되지 않는 영역, 즉 렌더링된 마스크에서 비어있는 부분에 새 Gaussian을 추가한다. Gaussian이 어디에 없는지를 보고 채우는 방식이다. 단순한 규칙이었다.

Tracking은 포즈, Mapping은 Gaussian 파라미터를 각각 최적화한다. 두 단계를 엄격히 분리한 것이 안정성의 근거다. GS-SLAM이 tracking과 mapping을 얽어 교대로 최적화할 때 발생하는 간섭을 Keetha는 이 구조로 피했다.

> 🔗 **차용.** SplaTAM의 키프레임 기반 지도 관리 구조는 PTAM(Klein & Murray, 2007)의 아이디어가 새 표현 위에서 재동작하는 사례다. PTAM이 keyframe을 선택적으로 삽입해 지도를 유지하던 방식이, SplaTAM에서는 Gaussian densification의 트리거로 변환되었다.

Replica 데이터셋 기준으로 SplaTAM은 PSNR 34.11 dB를 기록했다. 같은 논문의 표에서 NICE-SLAM은 24.42 dB였다. 렌더링 품질 격차는 명확했다.

Keetha는 2024 CVPR 논문의 Limitations & Future Work에서 motion blur·depth noise·공격적 회전에 대한 민감성, 그리고 known intrinsics와 dense depth 의존을 제거하는 방향을 다음 과제로 들었다. 확장성 개선도 언급했다.

> 📜 **예언 vs 실제.** Keetha는 SplaTAM(2024) 한계 절에서 motion blur·depth noise·공격적 회전 민감성, 그리고 known intrinsics/dense depth 의존 제거를 명시적 과제로 제시했다. depth 의존 제거 방향은 같은 해 Matsuki의 MonoGS(2024 CVPR)가 단안 RGB 세팅으로 답을 내놓았다. intrinsics-free와 대규모 스케일 이슈는 2024-2025년 현재도 진행 중이다. `[부분적중+진행형]`

---

## MonoGS: monocular RGB

Hidenobu Matsuki(Imperial College Dyson Robotics Lab)의 [Matsuki et al. 2024. MonoGS (CVPR)](https://arxiv.org/abs/2312.06741)는 제약을 하나 제거했다. depth sensor 없이, 단안(monocular) RGB 카메라만으로 3DGS SLAM을 구동한다.

monocular 설정의 핵심 난점은 scale이다. depth 없이 메트릭 스케일을 복원하는 것은 SfM에서도 풀리지 않은 문제다. Matsuki의 해법은 Gaussian geometry를 직접 최적화하는 것이었다. 렌더링된 depth와 인접 Gaussian 간의 기하 일관성 손실을 추가했다.

$$\mathcal{L}_{iso} = \sum_k \| \mathbf{s}_k - \bar{s}_k \mathbf{1} \|_1$$

여기서 $\mathbf{s}_k \in \mathbb{R}^3$는 k번째 Gaussian의 스케일 벡터, $\bar{s}_k = \frac{1}{3}\sum_j s_{k,j}$는 세 축 스케일의 평균이다. 이 등방성(isotropy) 정규화는 Gaussian이 지나치게 얇은 판 형태로 퇴화하는 것을 막는다. monocular에서 depth 감독 없이 Gaussian이 카메라 평면에 달라붙는 현상을 억제한다.

Tracking에서 Matsuki는 포즈를 Gaussian 렌더링 photometric loss로 직접 최적화했다. 첫 프레임 초기화에서는 monocular depth prior로 Gaussian의 초기 위치를 잡고, 이후 프레임에서는 이전 포즈를 시작점으로 렌더링 기반 refinement를 돌린다. depth sensor 없이 scale을 유지하는 핵심은 isotropic regularization과 keyframe 간 geometry 일관성 손실의 결합이다.

> 🔗 **차용.** MonoGS의 monocular depth prior 활용은 Ch.11에서 다룬 Godard의 MonoDepth2 계열 아이디어를 계승한다. depth 감독 없이도 구조를 복원할 수 있다는 self-supervised monocular depth의 통찰이 Gaussian 초기화 전략으로 흡수된 사례다.

Matsuki는 Imperial Dyson Robotics Lab 출신이다. Davison이 지도한 연구실에서 나온 Sucar(iMAP), Bloesch(CodeSLAM)와 같은 계보다. Lab의 관심이 implicit MLP → Gaussian explicit 표현으로 이동한 흐름을 MonoGS가 대표한다.

TUM-RGBD 데이터셋에서 MonoGS는 monocular 평균 ATE RMSE 4.44cm, RGB-D 1.58cm를 기록했다. 렌더링 품질은 RGB-D 설정 Replica 기준 평균 PSNR 37.50 dB로 동세대 GS-SLAM 계열과 견줄 만한 수준이었다.

---

## RTG-SLAM과 실시간 처리

GS-SLAM 계열이 풀어야 할 다음 문제는 속도였다. GS-SLAM과 SplaTAM은 실시간이라 부르기 어려웠다. Zhejiang University의 Peng Zhexi 연구팀이 2024년 발표한 [RTG-SLAM (SIGGRAPH 2024)](https://arxiv.org/abs/2404.19706)은 명시적으로 실시간을 목표로 삼았다.

RTG-SLAM의 전략은 Gaussian의 수를 제어하는 것이다. 현재 카메라 뷰에서 기여도가 큰 Gaussian만 골라 최적화한다. Gaussian을 surfel(표면 원소) 기반으로 초기화해 geometry를 유지하면서 수를 줄였다. Replica 데이터셋에서 실시간에 근접한 처리 속도를 냈다.

---

## 사라진 경쟁자들

2024년을 기점으로 TSDF·occupancy grid는 SLAM mapping의 주류에서 밀려났다. embedded 시스템이나 안전이 요구되는 환경에서는 여전히 쓰이지만, 연구 전선에서는 보조 역할로 물러났다. NeRF 기반 SLAM도 같은 해 기준으로 3DGS 대비 렌더링 속도와 update 유연성에서 밀려 보조 위치로 이동했다.

이것은 표현의 전환이면서 하드웨어 친화성의 전환이다. GPU 래스터라이저는 GPU 레이마처(ray marcher)보다 훨씬 잘 최적화되어 있다. 3DGS가 기존 그래픽스 파이프라인 위에서 자연스럽게 돌아간다는 점이 NeRF 대비 채택 속도를 높였다.

> 🔗 **차용.** 3DGS의 differentiable rendering 정신은 NeRF에서 직접 계승한다. 장면 표현을 gradient로 최적화한다는 아이디어, photometric loss로 관측과 렌더링을 연결하는 방식은 Mildenhall et al.(2020)의 유산이다. Kerbl은 표현(implicit MLP → explicit Gaussian)을 교체하면서 패러다임은 계승했다.

> 📜 **예언 vs 실제.** Kerbl et al.은 3DGS(2023) §7.4 Limitations에서 관측이 부족한 영역의 elongated artifact와 popping, 정규화(regularization) 부재, 메모리 소비(훈련 중 20GB 초과, 대규모 씬 렌더링 시 수백 MB)를 한계로 꼽았다. Future work로는 antialiasing, 더 원칙적인 culling, point-cloud 압축 기법 차용을 제안했다. 메모리 축(압축)은 [Compact 3DGS](https://arxiv.org/abs/2311.13681) 계열과 [Niedermayr et al.](https://arxiv.org/abs/2401.02436)이 2024년에 직접 응답했다. dynamic scene 확장([4DGS](https://arxiv.org/abs/2310.08528), [Deformable 3DGS](https://arxiv.org/abs/2309.13101))과 생성·편집([DreamGaussian](https://arxiv.org/abs/2309.16653), [GaussianEditor](https://arxiv.org/abs/2311.14521))은 원 논문이 직접 거론하지 않은 영역이지만 2024년 전후로 별도 계통으로 갈라져 나왔다. `[적중+확장형]`

---

## 🧭 아직 열린 것

Memory scaling. Gaussian의 수는 장면 크기에 따라 선형으로 증가한다. 실내 Replica 데이터셋에서 수십만 개로 충분하던 것이 outdoor 도시 구역에서는 수천만 개로 늘어난다. Gaussian pruning과 level-of-detail 계층화가 연구되고 있지만, 대규모 환경에서 메모리와 렌더링 품질의 트레이드오프를 합리적으로 관리하는 방법은 아직 합의가 없다. Compact 3DGS 계열(Lee et al. 2024, Niedermayr et al. 2024)이 압축 방향을 탐색 중이다.

Semantic 통합. Gaussian에 semantic label을 붙이는 시도([LangSplat](https://arxiv.org/abs/2312.16084), [LERF](https://arxiv.org/abs/2303.09553) 등)가 2023-2024년에 나왔다. 그러나 SLAM 파이프라인에서 semantic Gaussian을 실시간으로 갱신하면서 tracking 품질을 동시에 유지하는 방법은 아직 없다. semantic과 geometry를 공동 최적화할 때 발생하는 interference를 어떻게 처리할 것인가가 핵심 문제다.

Dynamic scene. 4DGS와 Deformable 3DGS는 시간 차원을 Gaussian에 추가하는 방향을 제안했다. SLAM 설정에서 dynamic object는 배경과 다른 움직임을 가지므로 별도로 처리해야 한다. GS-SLAM(Yan et al. 2023), SplaTAM(Keetha et al. 2024), MonoGS(Matsuki et al. 2024) 모두 정적 세계 가정을 유지한다. Dynamic SLAM에서 Gaussian이 어떻게 이동하는 객체를 표현하고 추적할 것인가는 2025년 기준으로 열려 있다.

그러나 3DGS가 남긴 또 다른 질문이 있었다. Gaussian은 어디서 오는가. SfM point cloud에서, 또는 depth sensor에서. 포즈를 알아야 Gaussian을 놓을 수 있고, Gaussian이 있어야 포즈를 추정할 수 있다. 이 닭-달걀 문제는 GS-SLAM 계열이 여전히 외부 초기화에 의존하게 만들었다. Ch.16에서 다루는 DUSt3R와 그 후계들은 다른 출발점을 선택했다. geometry 자체를 처음부터 학습하는 길이다.

---

# Ch.15b — 정적 세계 가정이 무너지는 자리: Dynamic과 Deformable SLAM

2015년 ETH Zürich의 Javier Fuentes-Pacheco는 동료 Ruiz-Ascencio, Rendón-Mancha와 함께 [*Visual simultaneous localization and mapping: a survey*](https://link.springer.com/article/10.1007/s10462-012-9365-8)를 Artificial Intelligence Review에 발표했다. 그 서베이의 마지막 장이 "Dynamic and Deformable Environments"였다. 그 이전에도 움직이는 물체를 다룬 논문은 있었지만 대부분 RANSAC이 걸러내야 할 outlier로 취급했다. Fuentes-Pacheco의 서베이는 동적 환경을 독립 필드로 선언한 첫 문서였다. 10년이 지난 2025년, SLAM Handbook은 이 주제에 37페이지—전 챕터 중 최대 분량—를 할애한다. 저자는 여섯 명이다. MIT의 Lukas Schmid, TU München의 Daniel Cremers, UTS의 Shoudong Huang, 그리고 Zaragoza의 Montiel·Neira·Civera. 세 대륙의 세 학파가 한 챕터에 모인 이유는 2015년과 2025년 사이에 일어난 일에 있다. 정적 세계는 모든 SLAM 시스템의 출발점이었지만, 현실의 로봇—거리의 자율주행차, 집안의 서비스 로봇, 장기의 내시경—이 그 가정 밖에서 작동해야 했다.

---

## 15b.1 세 개의 축

Schmid et al.이 Handbook Ch.15 §15.1에서 그린 프레임은 이전의 "dynamic SLAM" 정의를 다시 쓴다. 환경이 동적인지 정적인지는 환경의 속성이 아니라 *관측의 속성*이다. 같은 물리적 운동이 한 로봇에게는 short-term dynamic, 다른 로봇에게는 long-term dynamic이 된다. 관측률 $\text{Obs}$와 변화율 $\text{Dyn}$의 비율이 결정한다. $\text{Dyn} \ll \text{Obs}$이면 프레임 사이에서 움직임이 보이고, $\text{Dyn} \gg \text{Obs}$이면 방문 사이에서 장면이 변해 있다.

이 관점이 세 축을 낳는다. Observation axis는 short-term과 long-term을 가른다. Reconstruction axis는 pose만 추정할지, scene geometry까지 복원할지, 4D spatio-temporal 이해까지 갈지를 정한다. Time axis는 online과 offline을 가른다. 이전 서술은 "동적 객체를 어떻게 제거할 것인가"라는 단일 질문으로 필드를 압축했는데, 이 3축 공간에서 보면 그 질문은 여덟 사분면 중 한 구석에 지나지 않는다. 필드가 찢어진 이유가 여기에 있다. 서로 다른 사분면에 선 연구자들은 같은 단어를 다르게 써왔다.

---

## 15b.2 Short-term: 마스킹에서 multi-object SLAM으로

첫 해법은 단순했다. 움직이는 것을 지워라.

Zaragoza에서 박사과정을 하던 Berta Bescos는 2018년 [DynaSLAM](https://arxiv.org/abs/1806.05620)을 RA-L에 발표했다. ORB-SLAM2의 frontend에 Mask R-CNN을 끼워 넣어 사람·자동차를 사전 마스킹하는 시스템이었다. 마스킹된 영역은 keypoint 추출에서 제외됐다. 간단했지만 동작했다. TUM-RGBD의 walking sequence에서 ATE가 한 자릿수 cm로 내려갔다.

같은 시기 UCL의 Martin Rünz는 다른 선택을 했다. 움직이는 물체를 지우지 말고 따로 추적하자. Lourdes Agapito 지도하에 [Co-Fusion(Rünz & Agapito, 2017)](https://arxiv.org/abs/1706.06629)과 이듬해 [MaskFusion(Rünz et al., 2018)](https://arxiv.org/abs/1804.09194)을 연달아 내놓았다. 각 객체에 독립된 surfel 모델을 할당해, 카메라 궤적과 객체 궤적을 동시에 추정했다. Edinburgh의 Raluca Scona와 Imperial의 Stefan Leutenegger가 2018년 ICRA에 낸 [StaticFusion](https://arxiv.org/abs/1806.05628)은 또 다른 경로였다. Semantic segmentation 없이 residual clustering만으로 dynamic region을 분리했다. Segmentation 오류에 의존하지 않는 방향이다.

여기서 개념이 한 번 더 바뀐다. 움직이는 객체를 state에 포함시켜 함께 추정하면 어떤가. QUT의 Jun Zhang이 이끈 [VDO-SLAM(Zhang et al., 2020)](https://arxiv.org/abs/2005.11052)은 각 동적 객체를 factor graph의 변수로 올렸다. 카메라 포즈 $T_i^w \in SE(3)$와 객체 $k$의 포즈 $T_{k,i}^w \in SE(3)$가 같은 그래프에 공존했다. Constant-velocity factor가 객체의 선속도·각속도에 연속성 제약을 걸었다. SE(3)과 객체 SE(3)의 product manifold 위에서 joint optimization이 돌아갔다. Zaragoza의 Bescos는 2021년 [DynaSLAM II(Bescos et al., 2021)](https://arxiv.org/abs/2010.07820)에서 ORB-SLAM2 기반으로 같은 아이디어를 구현했다. CMU의 Yuheng Qiu가 2022년 RA-L에 발표한 [AirDOS](https://arxiv.org/abs/2109.09903)는 인간처럼 관절이 있는 객체까지 articulated body로 확장했다.

> 🔗 **차용.** VDO-SLAM의 factor graph 확장은 Ch.6 graph SLAM에서 Dellaert와 Kaess가 세운 iSAM 전통을 직접 계승한다. 변수를 하나 늘리고 factor를 하나 더 다는 것이, dynamic SLAM에서는 움직이는 자동차 하나를 지도에 올리는 일로 바뀌었다.

세 번째 각도는 관성 쪽에서 들어왔다. KAIST URL의 Song·Lim·Lee·Myung이 2022년 RA-L에 발표한 [DynaVINS](https://arxiv.org/abs/2208.11500)는 semantic mask도 multi-object tracking도 쓰지 않았다. IMU preintegration이 준 pose prior와 어긋나는 관측은 bundle adjustment에서 factor weight를 낮추는 식으로, 동적 특징이 joint state로 새어 들어가는 경로를 끊었다. 같은 그룹이 2024년 RA-L에 낸 [DynaVINS++](https://arxiv.org/abs/2410.15373)는 이 아이디어를 adaptive truncated least squares로 다시 짜, dynamic feature가 IMU bias 추정으로 역전파되며 발산하는 실패 양상까지 잡았다.

Handbook은 이 계보를 §15.2.3 "Dense Dynamic SLAM"으로 정리하면서 Schmid 본인의 [Dynablox(Schmid et al., 2023)](https://arxiv.org/abs/2304.10049)를 LiDAR MOS의 현재형으로 배치한다. 2025년의 [AnyCam](https://arxiv.org/abs/2503.23282)은 transformer 기반으로 일상 영상에서 직접 4D를 뽑는다. Rünz가 2017년 문을 연 "simultaneous tracking + reconstruction" 계통의 2025년판이다.

---

## 15b.3 Long-term: 시간을 가로지르는 지도

Short-term이 프레임 사이의 운동이라면, long-term은 방문 사이의 변화다. 어제 본 의자가 오늘은 옆으로 밀려 있다. 이 문제는 다른 계보에서 자랐다.

Sherbrooke의 Mathieu Labbé가 Michaud 지도 아래 2013년부터 개발한 [RTAB-Map](https://introlab.github.io/rtabmap/)은 인간 기억 모델에서 직접 빌려왔다. short-term, working, long-term memory의 계층을 두고, 시간과 관측 빈도에 따라 노드를 옮겼다. 한 세션 안에서는 작동 메모리에 남고, 자주 방문하지 않으면 장기 메모리로 내려가고, 의미가 없어지면 폐기되는 구조다. 2019년 JFR 논문에서 Labbé는 이 구조가 다중 세션 SLAM에서 어떻게 스케일하는지 정리했다. 한국과학기술원의 김아영 팀에서 임현준이 2021년 발표한 [ERASOR](https://arxiv.org/abs/2103.04316)는 다른 각도를 선택했다. 지도를 깨끗이 만드는 문제를 scene differencing으로 풀었다. 같은 장소를 두 번 지나갔을 때 사라진 점을 찾아낸다.

Handbook이 §15.3 전체를 통과하는 프레임 하나가 있다. **absence of evidence vs evidence of absence**. 의자가 없는 것인지, 내가 못 본 것인지를 구별해야 한다. 이 구분이 빠지면 map cleaning은 정당한 객체를 지우고, change detection은 가려진 영역을 잘못 판정한다. Schmid가 2022년 RA-L에 발표한 [Panoptic Multi-TSDF](https://arxiv.org/abs/2109.10165)는 이 문제를 submap 구조로 풀었다. 각 객체를 독립 submap으로 관리하고, local consistency 하에서 active와 inactive를 구분했다. 같은 그룹이 2024년 낸 [Khronos](https://arxiv.org/abs/2402.13817)는 여기서 한 걸음 더 갔다. graduated non-convexity로 association을 견고화하고, loop closure 이후에도 deformable geometric change detection을 돌려, 각 객체의 변화 시점까지 추정한다. Metric-semantic 지도가 4D spatio-temporal 지도로 바뀌는 지점이다.

> 🔗 **차용.** Panoptic Multi-TSDF의 submap 구조는 Ch.7 ORB-SLAM의 Atlas가 세운 다중 지도 관리 구조를 다른 재료로 다시 짠 것이다. keyframe submap이 panoptic object submap으로 바뀌었을 뿐, "지도 하나가 너무 커지면 쪼갠다"는 원칙은 그대로다.

같은 질문이 LiDAR 쪽에서는 따로 굴러갔다. KAIST URL의 Jang·Lee·Nahrendra·Myung이 2026년 공개한 [Chamelion](https://arxiv.org/abs/2602.08189)은 dual-head 네트워크 위에 scene-mixing augmentation을 얹어, 공사 현장이나 재배치가 잦은 실내처럼 구조가 시시각각 뒤집히는 transient 환경에서 change detection을 ground truth 없이 돌린다. Khronos가 RGB-D·panoptic 쪽에서 4D를 세웠다면, Chamelion은 포인트 클라우드 위에서 long-term map maintenance 쪽으로 그 질문을 끌고 간다.

이 계보의 또 다른 축에는 반복성을 다루는 연구가 있다. 스웨덴 Örebro의 Tomáš Krajník과 Achim Lilienthal이 2014년부터 발전시킨 **frequency maps**는 주기적 사건—출근길 차량 흐름, 낮과 밤의 조명 변화—을 Fourier 기반으로 모델링한다. Stockholm Royal Institute of Technology의 Martin Magnusson 그룹이 2019년 정리한 Maps of Dynamics(MoD)는 *typical motion pattern*을 지도에 직접 인코딩했다. "이 복도에서는 사람이 왼쪽으로 걷는다"가 지도의 일부가 되는 셈이다. 2023년 발표된 [Changing-SLAM(Schmid et al., 2023)](https://arxiv.org/abs/2301.09479)은 ORB-SLAM 확장 위에 Kalman 필터로 short-term을, semantic class 매칭으로 long-term을 동시에 다룬 시도다.

---

## 15b.4 Deformable: 형상이 변할 때

배경조차 움직이면 어떻게 되는가. Zaragoza의 Civera와 Montiel이 오랫동안 이 질문 앞에 서 있었다.

시작은 다른 곳이었다. 2015년 CVPR best paper는 Microsoft Research의 Newcombe, Fox, Seitz가 발표한 [DynamicFusion](https://grail.cs.washington.edu/projects/dynamicfusion/)이었다. KinectFusion의 canonical TSDF에 embedded deformation graph를 얹어, 카메라 앞에서 변형하는 객체—얼굴, 몸통—를 실시간 비강체로 복원했다. 회전·이동이 노드마다 할당된 변형 그래프가 매 프레임 최적화됐다. 같은 계열에서 TU München의 Matthias Innmann이 2016년 [VolumeDeform](https://arxiv.org/abs/1603.08161)으로 색 정보를 더했고, 2017년 Miroslava Slavcheva가 낸 [KillingFusion](https://campar.in.tum.de/pub/slavcheva2017cvpr/slavcheva2017cvpr.pdf)은 Killing vector field 정칙화를 들여와 위상 변화—손이 몸통과 붙었다 떨어지는—까지 허용했다. MIT에서 Tedrake 지도로 나온 Wei Gao의 2019년 [SurfelWarp](https://arxiv.org/abs/1904.13073)는 TSDF 대신 surfel을 골라 exploration 친화성을 확보했다.

> 🔗 **차용.** DynamicFusion의 embedded deformation graph는 컴퓨터 그래픽스에서 Sumner, Schmid, Pauly가 2007년 발표한 ED graph를 직접 가져왔다. 메시 변형을 위한 희소 제어 그래프였던 것이, 실시간 비강체 SLAM의 변수 표현이 되었다.

Monocular 쪽의 이야기는 Zaragoza에서 진행됐다. Montiel 지도 아래 박사를 한 Juan Lamarca가 2021년 [DefSLAM](https://arxiv.org/abs/1908.08918)을 RA-L에 발표했다. isometric NRSfM으로 keyframe마다 template를 다시 계산하고, ORB frontend와 Lucas-Kanade optical flow를 섞어 trace를 유지했다. 평면 토폴로지를 가정하는 한계가 있었다. 같은 그룹의 Juan J. Gómez Rodríguez는 2023년 [NR-SLAM](https://arxiv.org/abs/2308.04036)으로 그 한계를 치웠다. dynamic deformable graph로 임의 토폴로지를 다루고, visco-elastic 모델로 시간 방향 정칙화를 넣었다. Handbook §15.4.2가 이 계보를 "deformable SLAM의 monocular 계통"으로 정리한다.

응용은 의료 쪽에 몰려 있다. Tsinghua의 Song이 2018년 낸 [MIS-SLAM](https://ieeexplore.ieee.org/document/8458232)은 stereo endoscopy로 수술 중 장기의 변형을 추적했다. Children's National의 Jayender 그룹이 개발한 EMDQ(Expectation Maximization + Dual Quaternion)는 SURF feature 위에서 부드러운 deformation field를 추정했다. 이들 시스템이 겨냥하는 것은 minimally invasive surgery의 실제 환경에서 intra-operative navigation을 돌리는 일이다.

Handbook §15.4.1이 강조하는 근본 문제 하나가 있다. **Floating Map Ambiguity**. 비강체 객체의 rigid motion과 카메라의 rigid motion은 prior 없이는 구별되지 않는다. 손이 30cm 움직인 것인지 카메라가 30cm 움직인 것인지, 관측만으로는 어느 쪽도 말할 수 없다. Absolute scale 복원은 단안 SLAM의 오래된 scale ambiguity와는 성격이 다르다. Scale만이 아니라 trajectory와 deformation이 동시에 결합하여 ill-posed가 된다. DefSLAM과 NR-SLAM이 isometric prior, visco-elastic prior로 이 ambiguity를 부분적으로 깨지만, 원리적 해법은 2026년 기준에도 없다.

> 📜 **예언 vs 실제.** Newcombe는 DynamicFusion(2015) §7 Future Work에서 "extension to larger scenes and topology changes"와 "integration with loop closure"를 다음 과제로 꼽았다. 토폴로지 변화는 2017년 KillingFusion이 응답했다. 대규모 scene은 surfel 기반 SurfelWarp(2019)가 일부 풀었다. loop closure와의 통합은 2024년 Khronos에 와서야 deformable geometric change detection이라는 이름으로 등장했다. 9년이 걸린 셈이다. `[부분적중+지연]`

---

## 15b.5 세 학파의 지적 계보

Handbook Ch.15 저자 여섯 명의 배치 자체가 증거다.

**Zaragoza 학파**(Montiel, Neira, Civera, Lamarca, Rodríguez)는 MonoSLAM(Ch.5)부터 ORB-SLAM(Ch.7), DynaSLAM, DefSLAM, NR-SLAM으로 이어지는 deformable geometry의 본산이다. Monocular 세팅에서 기하학을 끝까지 밀어붙이는 전통이 20년째 유지되고 있다. **Imperial/TUM 계열**(Davison, Newcombe, Rünz, Cremers)은 dense와 learning-based의 축을 맡는다. KinectFusion(Ch.9)에서 DynamicFusion으로, SLAM++(Ch.18)에서 Co-Fusion·MaskFusion으로 이어졌다. Cremers 그룹이 2020년대 들어 change-aware SLAM 쪽으로 축을 옮기면서 새 계보의 중심이 되었다. **Cambridge/ETH/MIT 계열**(Schmid, Leutenegger, Agapito)은 panoptic 4D로 수렴했다. Schmid 본인이 Cremers 아래에서 박사를 마친 뒤 MIT Carlone 그룹을 거쳐 JPL로 갔다. 그 궤적이 KillingFusion → Dynablox → Panoptic Multi-TSDF → Khronos의 순서와 겹친다.

Handbook Ch.15의 여섯 저자 구성이 이 3학파를 거의 정확히 재현한다. 필드가 세 갈래로 갈라져 있다는 사실이 저자 명단에서 자기 증명된다.

---

## 🧭 아직 열린 것

**Absence vs evidence of absence.** 지도에서 객체가 사라졌는지, 가려서 못 봤는지를 구별하는 문제는 long-term SLAM의 근원적 난제로 남아 있다. Schmid의 Panoptic Multi-TSDF가 active submap 구조로 부분 답을 내놓았지만, outdoor 대규모 환경과 occlusion 60% 이상의 세팅에서는 여전히 판정 오차가 크다. 2026년 기준, 이 문제에 원리적 해법을 주장한 논문은 없다.

**Floating Map Ambiguity.** Deformable SLAM에서 카메라의 rigid motion과 객체의 rigid motion을 분리하는 문제는 isometric·visco-elastic prior로만 우회되고 있다. prior 없이 두 motion을 식별하는 조건이 무엇인지, 어떤 관측이 ambiguity를 깨는지는 미해결이다. Lamarca의 [2023년 IJRR 논문](https://arxiv.org/abs/2302.03710)이 관측 조건을 일부 정리했지만 일반 이론은 아직 없다.

**Online deformable SLAM.** DefSLAM과 NR-SLAM은 실시간에 근접했지만, Khronos 수준의 change-aware 통합을 단안 RGB에서 online으로 돌리는 시스템은 없다. Optimization 계산량이 실시간 한계를 넘어선다. GPU 가속과 learned prior가 가능성을 열고 있으나 검증된 파이프라인이 아직 없다.

**의료 MIS의 실세계 격차.** MIS-SLAM과 NR-SLAM이 phantom과 ex vivo 데이터에서는 동작하지만, 실제 수술 환경의 혈액·연기·도구 가림·급격한 조명 변화 앞에서는 견고성이 떨어진다. 2024년 EndoGS 같은 Gaussian 기반 시도가 나오고 있지만 배포 수준에 도달한 시스템은 보고되지 않았다.

---

## 참고: Ch.18 §18.4 재프레이밍 권고

이 챕터를 지나고 보면 Ch.18 §18.4 "Semantic SLAM 과열과 실패"의 제목은 사정이 달라진다. Dense dynamic SLAM, change-aware SLAM, deformable SLAM은 semantic을 *보조 단서*로 삼아 2020-2025년에 실질적 성공을 거뒀다. 실패한 것은 "semantic이 SLAM frontend를 지배할 것"이라는 SLAM++ 식 예언, 즉 **object-as-landmark** 경로였다. §18.4의 제목을 "Object-as-landmark 경로의 축소"로 좁히고, 이 챕터와 상호 참조를 거는 개정이 자연스럽다. 구체 수정은 Phase D3-B에서 별도 다룬다.

---

# Ch.16 — Foundation 3D: DUSt3R에서 VGGT까지

Naver Labs Europe의 Philippe Weinzaepfel과 Jerome Revaud는 2022년 CroCo를 발표하면서, 두 이미지가 같은 장면을 찍었다는 사실을 단서 삼아 visual representation을 학습하는 cross-view self-supervised pretraining 방식을 제안했다. 그것은 feature learning 논문처럼 보였다. 1년 뒤 같은 팀이 CroCo의 구조 위에서 calibration 없이 pointmap을 직접 출력하는 시스템을 만들었을 때, DUSt3R는 multi-view geometry 전체를 재정의하는 논문이 되었다. Naver Labs Europe에서 시작한 계보가 Oxford의 VGG 그룹으로 이어지며, 2026년 현재 "SfM이 무엇인가"라는 질문 자체를 다시 쓰고 있다.

---

## 16.1 DUSt3R — learned pointmap

2013년부터 10년간 3D 재건은 동일한 절차를 따랐다. 특징점을 찾고, 매칭하고, 카메라 내부 파라미터와 외부 파라미터를 추정하고, triangulation으로 점군을 만들고, bundle adjustment로 전체를 정제한다. [COLMAP(Schönberger & Frahm, 2016)](https://openaccess.thecvf.com/content_cvpr_2016/html/Schonberger_Structure-From-Motion_Revisited_CVPR_2016_paper.html)이 이 파이프라인의 가장 완성된 형태였다. 오차는 줄었지만 절차의 구조는 바뀌지 않았다.

[Shuzhe Wang et al. 2023. DUSt3R: Geometric 3D Vision Made Easy](https://arxiv.org/abs/2312.14132)는 이 절차를 우회한다. 두 이미지를 입력으로 받아 각 픽셀에 대한 3D 좌표를 직접 출력한다. 내부 파라미터(focal length, principal point)를 요구하지 않는다. pointmap이라 불리는 이 출력은, 이미지 좌표계가 아닌 공통된 3D 공간에서의 좌표다. 카메라가 어떤 렌즈를 달고 있는지 몰라도 된다.

왜 가능한가. DUSt3R의 transformer는 CroCo에서 물려받은 encoder-decoder 구조를 쓴다. 각 이미지는 독립적으로 encoding된 뒤, decoder에서 cross-attention을 통해 다른 이미지의 encoder 출력을 참조한다. self-attention이 단일 이미지 내 픽셀 관계를 처리한다면, cross-attention은 두 이미지 사이의 대응을 암묵적으로 학습한다. 어느 픽셀이 어느 픽셀과 같은 3D 점을 보는지, 이 대응 관계를 대규모 데이터에서 패턴으로 흡수한다. 규칙 코딩이 빠진다. DUSt3R의 훈련 데이터는 MegaDepth, ScanNet, ARKitScenes, BlendedMVS 등 수백만 장의 이미지 쌍이다. ground truth는 COLMAP이 만들었다. 고전 SfM이 학습 시대의 ground truth를 공급한다는 역전이 여기서 일어난다.

> 🔗 **차용.** DUSt3R의 backbone은 ViT([Dosovitskiy et al. 2020](https://arxiv.org/abs/2010.11929))에서 가져온다. 그러나 결정적 발판은 Naver Labs Europe 내부의 선행 작업인 CroCo([Weinzaepfel et al. 2022](https://arxiv.org/abs/2210.10716))다. CroCo는 두 이미지에서 한 쪽의 masking된 영역을 다른 이미지의 정보로 복원하는 cross-view self-supervised pretraining을 제안했다. DUSt3R는 CroCo의 encoder-decoder 구조를 그대로 물려받아 태스크만 "pointmap 예측"으로 바꿨다.

두 이미지에서 pointmap 한 쌍을 얻으면, 카메라 포즈는 이 pointmap들 사이의 rigid alignment로 구한다. Procrustes alignment의 일반화다. pose estimation이 pointmap의 파생물이 된다.

세 장, 열 장의 이미지로 확장할 때 DUSt3R는 global alignment를 푼다. 모든 이미지 쌍의 pointmap을 하나의 공통 좌표계로 정합하는 최적화 문제다. 이때 비로소 bundle adjustment와 유사한 무언가가 등장하지만, 피처 매칭이나 카메라 모델 없이 진행된다.

---

## 16.2 매칭을 삼킨다: MASt3R

DUSt3R의 결과는 novel view synthesis보다 reconstruction에 가깝다. 그런데 재건에서 중요한 서브태스크(두 이미지 사이의 정밀한 픽셀 대응 찾기, 즉 feature matching)를 DUSt3R는 암묵적으로만 처리한다. SuperPoint+SuperGlue, LightGlue가 수행하는 명시적 매칭을 대체하려면 추가 장치가 필요했다.

[Vincent Leroy et al. 2024. Grounding Image Matching in 3D with MASt3R (ECCV)](https://arxiv.org/abs/2406.09756)는 DUSt3R에 matching head를 추가한다. pointmap과 함께 각 픽셀의 feature descriptor를 출력하도록 훈련하되, 3D 위치와 feature가 일관되도록 joint learning한다. 이렇게 나온 feature는 3D 공간에 anchored되어 있다. 매칭은 이 feature descriptor를 nearest neighbor 검색하는 것으로 단순화된다.

> 🔗 **차용.** MASt3R의 3D-anchored matching은 SuperGlue([Sarlin et al. 2020](https://arxiv.org/abs/1911.11763))가 풀려던 문제(2D descriptor의 모호성을 context로 해소)를 다른 방향에서 공략한다. SuperGlue는 그래프 신경망으로 2D 매칭의 모호성을 줄였다. MASt3R는 3D 구조를 직접 학습함으로써 모호성의 원인 자체를 없앤다.

MASt3R 공개 이후 수개월 내에 SLAM 커뮤니티에서 SuperPoint+SuperGlue 조합을 MASt3R로 교체하는 실험이 여러 그룹에서 보고되었다. 2024년 말 [Riku Murai, Eric Dexheimer, Andrew Davison](https://arxiv.org/abs/2412.12392) — Imperial College London의 Davison 그룹 — 이 MASt3R-SLAM을 공개했을 때, 이 시스템은 MASt3R의 매칭을 frontend로, 그래프 기반 global optimization을 backend로 사용했다. 고전적 SLAM 아키텍처의 모양은 유지한 채 내부 부품이 거의 전부 교체된 형태다.

MASt3R의 강점은 ground-truth calibration 없이도 dense 매칭이 가능하다는 점이다. 2026년 현재 COLMAP 기반 SfM 파이프라인에 DUSt3R나 MASt3R를 삽입하는 것이 실험 설정에서 표준화되고 있다.

> 📜 **예언 vs 실제.** DUSt3R 논문 자체는 별도의 "Future Work" 절을 두지 않았지만, pair-wise + global alignment라는 구조 자체가 sequence 처리와 실시간 구동을 다음 과제로 암시한다. Spann3R는 2024년 8월, MASt3R-SLAM은 2024년 말 나왔다. 두 후속 작업이 각각 sequential extension과 SLAM 통합 문제에 6-12개월 내에 응답했다. 이 속도 자체가 이 분야의 이상한 점이다. `[진행형]`

---

## 16.3 Spann3R — sequential 처리

그런데 batch 처리에는 근본적인 제약이 있다. SLAM은 이미지가 미리 다 갖춰지지 않는다.

DUSt3R와 MASt3R는 이미지 집합을 입력받아 일괄 처리한다. 가방 속 이미지들을 한 번에 펼쳐 놓고 정합하는 방식이다. SLAM은 다르다. 이미지가 시간 순서로 들어오고, 시스템은 각 프레임마다 지도를 갱신해야 한다.

[Hengyi Wang & Lourdes Agapito 2024. 3D Reconstruction with Spatial Memory (Spann3R)](https://arxiv.org/abs/2408.16061)는 DUSt3R의 구조를 sequential 처리에 맞게 고친다. 핵심 아이디어는 spatial memory다. 이미 처리한 프레임들의 정보를 memory bank에 저장하고, 새 프레임이 들어올 때 이 메모리에 cross-attention을 수행한다. 새 이미지의 각 픽셀이 과거 프레임의 어떤 정보와 연관되는지 attention이 결정한다.

> 🔗 **차용.** Spann3R의 spatial memory 메커니즘은 concept 면에서 cross-attention memory와 유사하다. 구조적으로 DUSt3R의 사전학습된 ViT encoder-decoder를 그대로 물려받되, 디코더 출력(geometric feature)과 이미지 feature를 결합한 memory key를 두어 appearance와 distance를 동시에 반영한 메모리 조회를 구현한다. DUSt3R가 잡아낸 기하 표현이 그대로 sequential 메모리의 색인으로 재활용되는 경로다.

Spann3R는 calibrated 카메라 없이도 작동하는 DUSt3R의 특성을 그대로 가져간다. 순차 이미지가 들어올 때마다 현재까지의 지도를 점진적으로 갱신한다. 완전한 실시간은 아니지만, DUSt3R의 일괄 처리 방식보다 SLAM 적용에 한 발 더 가깝다.

---

## 16.4 VGGT — multi-view joint inference

Spann3R는 sequential 처리를 가능하게 했다. 그러나 pair-wise pointmap + global alignment라는 DUSt3R의 기본 골격은 그대로였다. Oxford의 VGG 그룹이 들어온다. Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny가 2025년 초 DUSt3R의 논리를 끝까지 밀어붙였다. 임의의 다중 이미지를 동시에 입력받아, 카메라 포즈와 깊이, 점군을 한 번의 forward pass로 출력한다.

[Jianyuan Wang et al. 2025. VGGT: Visual Geometry Grounded Transformer](https://arxiv.org/abs/2503.11651)는 DUSt3R의 pair-wise 처리를 진정한 multi-view joint inference로 바꿨다. DUSt3R에서 N장의 이미지를 처리하려면 N(N-1)/2쌍의 pointmap을 구한 뒤 global alignment를 풀어야 한다. VGGT는 N장을 한꺼번에 transformer에 통과시킨다. attention이 모든 이미지 쌍 사이의 관계를 동시에 처리한다.

> 🔗 **차용.** VGGT가 COLMAP의 역할을 재정의하는 맥락에서, 더 오래된 계보가 있다. COLMAP 자체가 학습 시대의 ground truth를 만든다는 아이러니는 앞에서 언급했다. 그런데 COLMAP이 실제로 수행하는 일(pair-wise geometry estimation → graph construction → global optimization)의 각 단계가 VGGT 안에서 implicit하게 재현된다. 고전 SfM이 알고리즘으로 구현한 것을 foundation model이 weight로 흡수한 형태다.

DUSt3R와의 정량 비교에서 VGGT는 카메라 포즈 추정 정확도와 점군 품질 면에서 일관된 우위를 보였다. 처리 속도도 global alignment 최적화가 없으므로 더 빠르다. 그리고 여기서 개념적인 문제가 생긴다.

---

## 16.5 pose estimation과 reconstruction의 경계 소멸

전통 컴퓨터 비전은 두 문제를 구분했다. 카메라 포즈 추정은 이미 알려진 지도에서 현재 위치를 찾는 것이고, 3D 재건은 알려지지 않은 환경의 기하를 복원하는 것이다. SLAM은 이 둘을 동시에 풀기 때문에 어려웠다.

DUSt3R부터 VGGT까지의 시스템은 이 구분에 무관심하다. pointmap을 예측하면 포즈가 나오고, 포즈가 나오면 reconstruction이 나온다. "카메라를 먼저 구하고 점군을 나중에" 또는 "점군을 먼저 구하고 카메라를 나중에"라는 순서 자체가 사라진다. 하나의 forward pass가 전부를 동시에 출력한다.

그러면 지금까지 배운 multi-view geometry는 폐기되는가. 그렇지 않다. DUSt3R·MASt3R·VGGT가 잘 작동하는 이유는 epipolar constraint, triangulation, bundle adjustment가 구현하는 기하 원리를 transformer weight 안에서 학습했기 때문이다. 폐기된 것은 명시적 알고리즘 구현 방식이다. 기하 자체는 implicit하게 들어가 살아남았다.

그러나 연구자에게 이것은 실질적 전환이다. Schönberger의 COLMAP 코드를 디버깅하던 방식으로 DUSt3R를 디버깅할 수 없다. 어디서 실패했는지, 왜 실패했는지가 attention weight 안에 묻혀 있다. 해석 가능성 문제가 새로운 형태로 등장한다.

> 📜 **예언 vs 실제.** MASt3R 논문은 결론부를 짧게 맺으며 ground-truth calibration이 없는 매칭이 여러 downstream 태스크에 열려 있다고 시사했다. 명시적 파이프라인 재편 예언은 아니었다. 2026년 현재 여러 photogrammetry 소프트웨어가 DUSt3R/MASt3R를 initialization 단계로 채택하는 것을 평가 중이며, hybrid 삽입의 형태로 자리잡고 있다. `[진행형]`

Naver Labs Europe이라는 한 연구소가 CroCo(2022) → DUSt3R(2023) → MASt3R(2024)의 단계를 2년 내에 밟았다. 이 속도는 특이하다. 한 팀이 pretraining 방법론부터 매칭 시스템까지의 스택을 연속해서 발표했다. 진원지는 Naver Labs Europe이었다. Google Brain, DeepMind, Meta AI가 아니다. Weinzaepfel·Revaud·Leroy를 중심으로 한 소규모 팀의 집중적 투자가 만든 결과다. SLAM 단계로 옮기는 일은 Imperial College London의 Davison 그룹(MASt3R-SLAM)이 이어받았다.

---

## 16.6 다른 갈래 — semantic foundation이 지도로 들어온다

지금까지의 서술은 geometric foundation이다. DUSt3R·MASt3R·VGGT는 pointmap·카메라 포즈·기하 구조를 다룬다. 그런데 2022년 전후로 "foundation 3D"라는 단어가 SLAM 문헌에서 두 갈래로 쓰이기 시작했다. 한쪽은 Naver Labs Europe에서 출발한 geometric 계보고, 다른 쪽은 CLIP·DINO·SAM을 지도 안으로 끌어들이는 semantic 계보다. 전자는 calibration을 없앴고, 후자는 dictionary를 없앴다.

semantic 갈래의 시작은 MIT의 Luca Carlone 그룹에서 나왔다. [Nathan Hughes et al. 2022. Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization](https://arxiv.org/abs/2201.13360)는 Kimera(Rosinol 2020)의 metric-semantic mesh 위에 objects → places → rooms → buildings의 hierarchical scene graph를 online으로 얹었다. closed-set 분류기를 쓰는 한 handbook이 "100-1000 labels predefined dictionary"라고 못 박은 제약 안이었지만, Hydra는 hierarchical map이 실시간으로 굴러간다는 것을 처음 보여줬다.

dictionary의 벽은 foundation model이 허물었다. [Songyou Peng et al. 2023. OpenScene: 3D Scene Understanding with Open Vocabularies (CVPR)](https://arxiv.org/abs/2211.15654)가 ETH/Pollefeys 그룹에서, 곧이어 [Qiao Gu et al. 2024. ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning (ICRA)](https://arxiv.org/abs/2309.16650)가 Montréal-MIT 협업으로 발표됐다. OpenScene은 CLIP feature를 3D 점군에 distillation해서 "이 점은 의자와 얼마나 가까운가"를 자연어 질의로 풀 수 있게 했다. ConceptGraphs는 한 걸음 더 나아갔다. class label 대신 VLM이 생성한 language description을 node attribute로 달고, object 사이 관계를 LLM이 서술한다. Hydra의 hierarchical 구조에 Peng의 open-vocabulary feature가 결합되면서 scene graph는 사전에 없는 개념까지 수용하게 됐다.

[Dominic Maggio et al. 2024. Clio: Real-time Task-Driven Open-Set 3D Scene Graphs](https://arxiv.org/abs/2404.13696)는 이 계보를 task 쪽으로 돌렸다. 로봇이 받은 자연어 task를 information bottleneck으로 해석해서, 그 task에 필요한 추상화 수준만 scene graph에 남긴다. "커피 머신 근처 청소"라는 지시에서 커피 머신과 그 주변 객체는 보존되고, 무관한 디테일은 묶인다. hierarchical graph의 어느 층을 노출할지가 task에 따라 달라지는 것이다.

> 🔗 **차용.** ConceptGraphs와 Clio의 계보는 Hydra의 hierarchical 구조를 그대로 물려받는다. Carlone 그룹의 scene graph 정의(Armeni → Rosinol-Kimera → Hughes-Hydra → Maggio-Clio)가 8년에 걸쳐 누적된 뒤, 그 위에 CLIP·VLM·LLM이 얹혔다. "표현이 바뀌어도 구조는 살아남는다"는 5부 전체의 패턴이 여기서도 성립한다. 바뀐 것은 node에 붙는 feature고, 살아남은 것은 objects-places-rooms라는 계층 자체다.

지도에 semantic을 싣는 문제는 Ch.18 §18.4가 2017-2019년 object-as-landmark 계보의 시장 축소를 짚은 뒤 이어지는 별개 궤적이다. semantic SLAM이 hierarchical scene graph라는 형태로 귀환했다는 사실, 그리고 geometric foundation(DUSt3R 계보)과 semantic foundation(Hydra → ConceptGraphs → Clio 계보)이 2026년 현재 아직 본격적으로 만나지 않았다는 사실이 기록할 만하다. VGGT의 pointmap에 CLIP feature를 붙인 end-to-end 시스템, 또는 Clio의 scene graph에 DUSt3R의 calibration-free 기하를 결합한 구성은 아직 보고되지 않았다. 만남의 지점은 Ch.19의 열린 문제로 넘긴다.

---

## 16.7 SLAM에서 무엇이 남는가

MASt3R-SLAM이 고전 SLAM의 아키텍처를 빌려 쓴다는 점은 흥미롭다. keyframe 선택과 loop closure, map management, 이 구조들이 새 표현 위에서도 그대로 필요했다. DUSt3R 계열이 feature matching과 reconstruction의 내부를 교체했지만, SLAM 시스템 수준의 판단들은 고전 방법이 해결한 방식 그대로 재사용한다.

이 관찰은 5부 전체에 걸쳐 반복되는 패턴과 일치한다. NeRF-SLAM이 NeRF를 map 표현으로 채택하면서도 keyframe 기반 tracking을 유지했다. 3DGS-SLAM이 Gaussian을 채택하면서도 loop closure를 classical 방식으로 했다 (Ch.15). Ch.15b의 dynamic SLAM도 mask 제거라는 front-end만 교체했을 뿐 back-end는 그대로였다. 표현(representation)을 바꿔도 시스템 구조는 그대로 살아남는다.

Foundation 3D의 경우에도 이 패턴이 반복된다. 2025년 MIT의 Dominic Maggio와 Luca Carlone이 [VGGT-SLAM](https://arxiv.org/abs/2505.12549)을 공개했다. VGGT가 local submap을 재건하면 factor graph가 그것들을 global 좌표계로 엮는 구성이다. transformer가 기하를 흡수했지만 factor graph는 살아남았다. Revaud 본인도 Handbook Ch.13에서 "a form of factor graph is still necessary"라고 적었다. 흡수의 속도는 특이하지만 최종 형태는 여전히 열려 있다. 실시간 대규모 sequence를 foundation 3D가 어디까지 감당할지, 그리고 16.6에서 언급한 semantic 갈래와 어디서 합류할지는 2026-2027년의 관찰 대상이다.

---

## 🧭 아직 열린 것

**대규모 sequence 처리의 벽.** DUSt3R와 VGGT의 transformer는 이미지 수에 quadratic하게 메모리를 요구한다. 100장까지는 현실적이지만 1,000장, 10,000장은 다른 문제다. Spann3R의 incremental 방식이 partial answer지만, 대규모 outdoor 환경의 매끄러운 처리는 미해결이다. 누가 지금 붙어 있는가. 여러 그룹이 sparse attention, hierarchical global alignment를 탐색하고 있으나 합의된 방법이 없다.

**Loop closure를 이 프레임에서 어떻게 정의하는가.** 고전 SLAM에서 loop closure는 이전에 방문한 장소를 인식하고 누적 오차를 교정하는 메커니즘이다. DUSt3R 계열에서 "이전에 방문한 장소"를 어떻게 표현하고, pointmap 기반 지도에서 교정을 어떻게 propagate하는가. MASt3R-SLAM이 기존 방식으로 처리하지만, 이것이 최선인지 원리적 해법인지 알 수 없다.

**Metric scale의 일반화.** DUSt3R의 pointmap은 relative scale이다. 두 이미지 사이의 깊이 비율은 복원하지만 절대 스케일은 모른다. Metric3D나 Depth Anything v2가 metric depth를 목표로 했듯, foundation 3D에서도 metric scale을 일반화하는 문제가 남는다. 카메라 독립적 metric은 foundation 규모에서도 쉽지 않다. GPS나 IMU 없이 absolute scale을 결정하는 물리적 제약은 데이터 규모와 무관하게 존재한다.

**이 흐름이 SLAM의 미래인가, 별개 갈래인가.** 15장의 3DGS처럼 foundation 3D도 SLAM 커뮤니티가 흡수하는 중이다. MASt3R-SLAM과 VGGT-SLAM이 2024-2025년에 연달아 등장하며 흡수의 경로는 윤곽이 잡혔다. 그러나 실시간 대규모 sequence 구동, 그리고 §16.6이 짚은 semantic 갈래(Hydra → ConceptGraphs → Clio)와의 합류 지점은 여전히 불분명하다. geometric foundation과 semantic foundation이 한 시스템에서 만나는 형태는 Ch.19 열린 문제의 핵심 축이다.

---

5부의 세 챕터가 같은 결론에 도달한다. NeRF든 foundation model이든, 표현을 바꾸면 reconstruction과 localization의 내부가 바뀐다. 그러나 SLAM 시스템 수준의 구조(keyframe, loop closure, map management)는 새 표현 위에서도 살아남는다. 6부는 이 패턴의 경계 밖을 본다. Ch.17은 카메라가 아닌 LiDAR를 중심에 둔 평행한 발전 궤적으로 넘어간다. 같은 시기에 같은 문제를 다른 센서와 다른 문화로 풀었던 계보다.

---

# Ch.17 — LiDAR 평행 우주: LOAM에서 FAST-LIO까지

Ch.1 사진측량에서 시작해 Ch.16 Foundation 3D에 이르는 계보는 하나의 공통 전제 위에 서 있다. 센서는 카메라다. MonoSLAM·PTAM·ORB-SLAM·DSO·DUSt3R — 이 이름들은 모두 픽셀로 세계를 읽는 전통 안에 있다. 같은 시간, 같은 로봇공학 커뮤니티 안에서 전혀 다른 계보가 자라고 있었다. LiDAR 계보는 카메라 진영의 keypoint·photometric consistency·feature descriptor와 무관하게, ICP의 뼈대 위에서 자체적인 문법을 만들어냈다. 두 계보는 논문을 서로 인용하지 않았고, 벤치마크도 학회도 달랐다.

Ji Zhang이 2014년 RSS에서 LOAM을 발표했을 때, Visual SLAM 커뮤니티는 그 논문에 별 관심을 기울이지 않았다. 그해 Visual 진영은 ElasticFusion과 LSD-SLAM으로 분주했다. LiDAR 측도 마찬가지였다. LOAM은 카메라 기반 방법론과 코드를 공유하지 않았고, 연구 커뮤니티도 겹치지 않았다. 두 계보는 같은 로보틱스라는 이름 아래서 서로를 거의 보지 않은 채 10년을 달렸다. LOAM은 [ICP (Besl·McKay, 1992)](https://graphics.stanford.edu/courses/cs164-09-spring/Handouts/paper_icp.pdf)의 오래된 뼈대 위에 섰고, Graph SLAM의 factor graph는 Visual 진영에서 표준이 되고 한참 뒤에야 LiDAR 쪽으로 건너왔다. 평행 우주는 교류 없이 성숙했다.

---

## 17.1 LOAM: edge와 plane, 그리고 KITTI의 점령

2014년, Google의 Waymo 전신 프로그램이 도로 위를 달리고 있었고, DARPA Urban Challenge의 여파가 채 가시지 않은 때였다. Velodyne HDL-64E는 한 대에 75,000달러였다. LiDAR를 연구 대상으로 삼을 수 있는 그룹은 CMU, MIT, Stanford 정도였다. CMU Robotics Institute의 Autonomous Mobile Robot Lab — Sanjiv Singh 교수 연구실 — 이 그 안에 있었다.

LOAM 이전에도 LiDAR로 지도를 쌓는 시도는 있었다. [Lu & Milios 1997. "Globally Consistent Range Scan Alignment for Environment Mapping" (Autonomous Robots)](https://doi.org/10.1023/A:1008854305733)은 2D 레인지 스캔을 노드로 두고 스캔 간 상대 제약을 edge로 묶어 전체 궤적을 동시 최적화하는 방식을 제안했고, 이 "network of poses" 발상은 훗날 pose-graph SLAM의 원형으로 거슬러 올라간다(Ch.6 참조). 매칭 자체는 Besl·McKay의 ICP 외에도 [Biber·Straßer 2003. "The Normal Distributions Transform" (IROS)](https://doi.org/10.1109/IROS.2003.1249285)가 제안한 NDT — 셀 단위 가우시안 분포에 정렬하는 분포 기반 매칭 — 와 이후 Magnusson의 3D 확장이 ICP 대안으로 공존했다. 이 모두가 2D 또는 오프라인 3D였다. LOAM의 몫은 실시간 3D였다.

Ji Zhang은 Singh의 지도 아래 [Zhang & Singh 2014. "LOAM: Lidar Odometry and Mapping in Real-time" (RSS)](https://www.roboticsproceedings.org/rss10/p07.pdf)를 냈다. LiDAR 포인트를 두 종류의 feature로 분류했다. **edge point**는 smoothness $c$가 높은 지점(곡률 높음), **planar point**는 $c$가 낮은 지점(곡률 낮음). ICP처럼 포인트 전체를 등록하지 않고 이 두 feature 집합만 매칭한다. edge point는 이웃 scan의 edge line에, planar point는 이웃 scan의 local plane에 point-to-line·point-to-plane 거리로 제약을 건다. 계산 비용이 낮아진다. 실시간 가능성이 열린다.

알고리즘 구조는 두 단계로 나뉜다. Lidar Odometry는 스캔 간 6-DoF 변환을 10Hz에서 추정한다. Lidar Mapping은 더 낮은 주파수(1Hz)에서 전체 맵과 정합해 오차를 보정한다. 고주파 odometry와 저주파 mapping을 분리해 drift를 억제하면서도 실시간성을 유지한다. 이 two-tier 구조는 이후 LiDAR SLAM의 기본 문법이 된다.

KITTI benchmark에서 LOAM은 공개 직후 1위를 차지했고, 수년간 그 자리를 지켰다. 정확히는 Visual-LiDAR 융합 방법이 나타나기 전까지. 시퀀스 00에서 Zhang이 보고한 relative translation error는 0.78%. 같은 시기 visual odometry 최고치가 1%대였던 것과 비교하면 LiDAR의 구조적 우위가 명확하다.

> 🔗 **차용.** LOAM의 feature-based 포인트 등록은 Besl·McKay(1992)의 ICP에서 출발한다. 차이는 edge와 planar feature만 선택적으로 매칭한다는 점이다. 고전 등록을 선별적으로 재사용함으로써 속도와 정밀도 모두를 얻었다.

---

## 17.2 LeGO-LOAM: 땅을 먼저 잘라낸다

LOAM의 문제는 지면(ground plane)을 명시적으로 다루지 않는다는 점이었다. 실외 자율주행 환경에서 포인트 클라우드의 상당 비율은 도로면이 차지한다. 이걸 edge/planar feature로 뭉뚱그리면 매칭 노이즈가 생긴다.

Stevens Institute of Technology의 Robust Field Autonomy Lab에서 Tixiao Shan과 지도교수 Brendan Englot은 [Shan & Englot 2018. LeGO-LOAM](https://doi.org/10.1109/IROS.2018.8594299)에서 ground segmentation을 첫 단계로 분리했다. 포인트 클라우드를 range image로 투영한 뒤, 지면 포인트를 먼저 분리하고 비지면 포인트를 다시 클러스터링한다. Ground는 roll·pitch 추정에, 클러스터는 yaw·translation 추정에 각각 사용된다. 두 단계 최적화다.

결과는 LOAM 대비 연산 절감이었다. 원래 LOAM이 Velodyne VLP-16에서 실시간 동작이 버거웠다면, LeGO-LOAM은 동일 센서에서 임베디드 플랫폼(NVIDIA Jetson)에서도 돌아간다. 경량화의 대가는 있다. 포인트 희소 환경이나 지면 구조가 불규칙한 환경 — 레이저가 가리는 구간, 울퉁불퉁한 야지, 건물 내부 — 에서는 segmentation이 실패하고 odometry가 흔들린다.

하지만 LeGO-LOAM의 진짜 기여는 "센서 입력을 구조화된 모듈로 전처리한 뒤 odometry를 돌린다"는 설계 원칙이었다. FAST-LIO와 LIO-SAM이 뒤에 이 원칙을 받아들인다.

LeGO-LOAM과 같은 시기, Bonn 대학의 Jens Behley와 Cyrill Stachniss는 edge/plane feature가 아니라 **surfel**(surface element)을 outdoor LiDAR에 가져왔다. [Behley & Stachniss 2018. "Efficient Surfel-Based SLAM using 3D Laser Range Data in Urban Environments" (RSS)](http://www.roboticsproceedings.org/rss14/p16.pdf)의 **SuMa**는 각 포인트 이웃을 원반 모양 surfel로 요약해 scan-to-model 등록을 수행했고, 후속 [Chen et al. 2019. "SuMa++" (IROS)](https://doi.org/10.1109/IROS40897.2019.8967704)는 semantic segmentation을 결합해 움직이는 물체를 surfel 수준에서 걸러냈다. Kinect 실내 RGB-D 계보(Ch.9)에서 Kintinuous·ElasticFusion이 쓰던 surfel representation이 outdoor Velodyne으로 건너온 순간이다. feature 선택(LOAM), segmentation 선행(LeGO-LOAM), surfel 누적(SuMa)의 세 갈래가 2018년 전후로 동시에 경쟁하고 있었다.

---

## 17.3 FAST-LIO — tightly coupled LiDAR-IMU

LiDAR의 스캔 주파수는 10-20Hz다. 그 사이사이에서 빠른 움직임이 있으면 포인트 클라우드에 motion distortion이 생긴다. 스캔이 끝나는 순간의 센서 위치와 시작 순간의 위치가 다르기 때문이다. 고속 이동체에서 LOAM 계열이 흔들리는 주된 이유가 여기에 있다.

IMU는 100-400Hz로 동작한다. LiDAR의 틈을 채우기에 충분하다. 그런데 LiDAR와 IMU를 어떻게 결합하느냐에 따라 성능이 갈린다. loosely coupled는 각각 독립적으로 추정한 뒤 fusion하고, tightly coupled는 하나의 상태 추정기 안에서 동시에 처리한다. 후자가 이론적으로 우월하지만 구현이 어렵다.

Hong Kong University(HKU) MaRS Lab의 Wei Xu와 지도교수 Fu Zhang은 2021년 RA-L에 [**FAST-LIO**](https://arxiv.org/abs/2010.08196)를 발표했다. 드론 제어 연구실에서 나온 논문이었다. 로터 진동이 심하고 기동이 빠른 UAV에서도 LiDAR odometry가 버텨야 한다는 현장 동기가 있었다. 이들이 선택한 도구는 **iterated Extended Kalman Filter(iEKF)**였다. iEKF는 측정 업데이트 단계에서 선형화 점을 현재 추정치로 반복 갱신한다. 한 번의 linearization으로 끝내는 기본 EKF보다 고속 비선형 운동에서 일관되게 낫다.

이듬해 TRO에 발표한 **FAST-LIO2**([Xu et al. 2022](https://doi.org/10.1109/TRO.2022.3141876))는 ikd-Tree를 추가했다. 기존 kd-Tree는 포인트가 추가될 때마다 재구성 비용이 크다. ikd-Tree는 부분 재구성만 수행하는 incremental 방식이다. 맵 포인트가 수백만 개에 달해도 실시간 nearest-neighbor 탐색이 가능하다. 실험에서는 UAV·핸드헬드·자율주행차에서 일관된 성능이 나왔다. 드론 환경에서도 drift가 낮게 유지됐다.

FAST-LIO 계보의 다음 수는 motion distortion을 아예 없애는 쪽이었다. 같은 MaRS Lab에서 나온 [He et al. 2023. "Point-LIO: Robust High-Bandwidth Light Detection and Ranging Inertial Odometry" (Advanced Intelligent Systems)](https://doi.org/10.1002/aisy.202200459)는 LiDAR 포인트가 들어올 때마다 state를 갱신한다. point-by-point 관측 업데이트다. 각 포인트를 자기 시각에서 바로 fusion해 왜곡이 발생할 틈을 지운다. 고기동 플랫폼에서 FAST-LIO2보다 drift가 줄어든 것이 보고됐다.

> 🔗 **차용.** FAST-LIO의 tightly coupled IMU 통합은 Visual-Inertial SLAM 진영에서 먼저 정리된 수식 체계를 LiDAR로 이식한 것이다. IMU preintegration 이론은 [Forster et al. 2016. "On-Manifold Preintegration" (TRO)](https://doi.org/10.1109/TRO.2016.2597321)에서 완성됐고, FAST-LIO는 그 정신을 iEKF 형식으로 재구현했다.

---

## 17.4 LIO-SAM: factor graph가 LiDAR로 건너오다

같은 시기, Visual SLAM 진영에서는 factor graph가 이미 표준이었다. [GTSAM (Dellaert·Kaess, 2012)](https://gtsam.org/)은 Visual-Inertial 시스템의 backend로 자리 잡아 있었다. 그런데 LiDAR 진영은 여전히 EKF 계열이거나 scan-matching 기반이었다. graph optimization의 주요 이점인 loop closure 후 전체 trajectory 교정을 LiDAR 시스템은 제대로 쓰지 않았다.

Tixiao Shan이 LeGO-LOAM 이후 낸 [Shan et al. 2020. LIO-SAM](https://doi.org/10.1109/IROS45743.2020.9341176)은 GTSAM의 factor graph를 LiDAR-IMU 시스템의 backend로 명시적으로 채택했다. IMU preintegration factor, LiDAR odometry factor, GPS factor, loop closure factor를 하나의 그래프에 통합한다. 각 keyframe이 node가 되고, 센서 제약이 edge가 된다. Marginalization으로 그래프 크기를 제어한다.

> 🔗 **차용.** LIO-SAM의 factor graph backend는 Visual SLAM 진영에서 GTSAM이 표준화한 graph optimization을 LiDAR 시스템으로 그대로 가져온 것이다. Dellaert(2006 이후)가 정리한 factor graph 프레임워크는 센서 종류와 무관하게 로보틱스 상태 추정의 공통 언어가 됐다. LIO-SAM은 그 이동을 보여주는 사례다.

LIO-SAM은 FAST-LIO2보다 drift 누적 시나리오에서 강하다. loop closure가 있기 때문이다. 반면 계산 비용이 높고 GPS나 추가 sensor input이 없으면 factor graph의 강점이 줄어든다. 두 시스템은 설계 목표가 다르다. FAST-LIO2는 실시간 단일 센서 구성에서 최고 속도와 정밀도를, LIO-SAM은 다중 센서 long-term mapping에서 일관성을 목표로 한다.

역설적인 회귀도 있었다. LOAM 이후 10년 가까이 LiDAR odometry는 feature 선택·surfel·neural descriptor로 점점 복잡해지는 쪽을 달렸는데, 2023년 Bonn 대학의 [Vizzo et al. 2023. "KISS-ICP: In Defense of Point-to-Point ICP" (RA-L)](https://doi.org/10.1109/LRA.2023.3236571)은 반대 방향을 냈다. feature 추출도, 학습된 descriptor도 없이, 적응형 threshold로 튜닝이 거의 필요 없는 point-to-point ICP 하나로 KITTI에서 경쟁력 있는 odometry를 보였다. 이름 그대로 Keep It Small and Simple이다. 저자들의 주장은 "LOAM이 생긴 건 엔지니어링이 부족했기 때문"이라는 역사 수정에 가까웠다. 고전 등록법으로의 회귀가 10년 만에 가능해진 배경에는 GPU와 kd-Tree 구현 수준의 실무적 진보가 있다.

---

## 17.5 센서 가격 하락과 보급: 2007–2024

LiDAR SLAM의 역사에서 기술 논문 못지않게 중요한 것이 센서 가격이다.

2007년 DARPA Urban Challenge에서 주요 팀들이 장착한 Velodyne HDL-64E는 대당 75,000달러였다. 자율주행 연구팀이나 국방 프로젝트가 아니면 접근하기 어려운 장비였다. 2012년에도 HDL-32E가 30,000달러 수준. LOAM이 발표된 2014년에는 VLP-16이 7,999달러로 내려왔지만 여전히 연구 예산의 상당 부분이었다.

그 이후 10년간 반전이 일어났다. 중국 스타트업 Livox(DJI 계열)가 2019년 Livox Mid-40을 599달러에 출시했다. Ouster가 128채널 센서를 수천 달러대로 끌어내렸다. 2023-2024년에는 solid-state LiDAR가 RoboSense, Innovusion, Livox에서 500달러 이하로 내려왔다. 가격이 100배 이상 떨어지는 데 10년이 걸렸다.

보급 속도는 알고리즘 발전 속도보다 빠르지 않았다. Solid-state LiDAR는 spinning 타입과 달리 시야각(FoV)이 제한적이다. 70°×70°이거나 그보다 좁다. LOAM·FAST-LIO가 가정한 360° 전방위 스캔이 아니다. 기존 알고리즘이 바로 작동하지 않는다. 저가 센서의 확산은 동시에 새로운 알고리즘 연구 과제를 만들었다.

---

## 17.6 Visual-LiDAR 계보 분리의 원인

Visual SLAM과 LiDAR SLAM이 동시대에 발전했음에도 두 커뮤니티는 오랫동안 교류하지 않았다. 이유는 한 층이 아니었다.

첫째는 센서 자체다. 카메라는 texture와 color를 보고, LiDAR는 range와 geometry를 본다. 카메라 기반 방법이 keypoint·descriptor·photometric consistency를 중심으로 발전할 때, LiDAR는 edge·plane·range image로 분화했다. 문제 공식 자체가 달랐다.

학회도 달랐다. CVPR·ICCV는 카메라 기반 방법의 주 발표 무대였고, ICRA·IROS·RSS는 LiDAR SLAM이 주로 나왔다. 연구자 집단이 겹치지 않았다. Velodyne이 구글과 자율주행 업계에 공급되던 2010년대 초중반에 LiDAR SLAM 연구자 집단은 자율주행 로봇공학 쪽에 밀집했다.

Place recognition 방법도 달랐다. 카메라는 DBoW2·NetVLAD처럼 visual appearance를 사용한다. LiDAR는 [Scan Context(Kim·Kim, 2018)](https://gisbi-kim.github.io/publications/gkim-2018-iros.pdf)나 [PointNetVLAD](https://arxiv.org/abs/1804.03492) 같이 3D point cloud의 구조적 특징을 활용한다. 동일 장소라도 인식하는 신호 자체가 다르다.

수렴의 첫 신호는 2020년대 초에 나타났다. LiDAR-Camera 융합을 다루는 논문이 CVPR에 올라오기 시작했고, Tixiao Shan이 낸 [LVI-SAM (2021)](https://arxiv.org/abs/2104.10831)은 LIO-SAM에 visual-inertial 서브시스템을 붙인 시도였다. 저자들은 tightly coupled factor graph로 제시했지만, 두 서브시스템(LIS·VIS)이 독립적으로 동작하다 실패 시 서로를 돕는 구조에 가깝다는 점에서 완전한 단일 상태 추정은 아직 열려 있다.

---

## 17.7 Visual-LiDAR 수렴 시도: 2024-2025

2024년을 기점으로 분위기가 달라졌다. Foundation model이 센서와 무관하게 feature를 뽑는 방향으로 발전하면서, 카메라와 LiDAR를 하나의 프레임에서 처리하는 시도가 늘었다. 갈래는 둘이다.

하나는 multi-modal pretrained feature. LiDAR와 카메라를 같은 embedding space로 align하는 방식. [CLIP(Radford et al., 2021)](https://arxiv.org/abs/2103.00020)이 image-text alignment를 해낸 것처럼, LiDAR-image contrastive learning을 사용하는 접근이다. 2023-2024년 여러 그룹에서 실험 단계다.

다른 하나는 unified sensor abstraction. 센서 출력을 geometric primitive나 neural field로 통합한 뒤 단일 backend에서 처리하는 방향. 이쪽은 아직 연구 논문 단계이고 실시간 동작을 보인 시스템은 드물다.

어느 방향도 아직 LiDAR SLAM과 Visual SLAM을 실질적으로 통합한 단일 계보를 만들지 못했다. FAST-LIO2와 ORB-SLAM3는 여전히 독립적으로 쓰인다.

---

## 17.8 Radar는 본 책의 scope 밖이다

LiDAR 평행 우주 바로 옆에는 또 하나의 평행 우주가 있다. Radar SLAM은 spinning radar(Navtech CIR 계열)와 SoC 기반 4D mmWave radar라는 두 하드웨어 분기 위에, Doppler radial velocity를 직접 측정해 correspondence-free odometry가 가능하다는 점, 그리고 speckle·multipath·receiver saturation 같은 전파 고유의 noise 모델 위에서 독립 subfield로 성숙했다. [Cen & Newman 2018](https://doi.org/10.1109/ICRA.2018.8460687)의 Oxford 계열 radar localisation에서 출발해 Adolfsson·Magnusson의 **CFEAR**, 그 후속 **TBV-SLAM**, Burnett·Barfoot의 continuous-time ICP까지 계보가 이어졌고, Oxford Radar RobotCar·Boreas·MulRan 같은 전용 데이터셋이 이 영역의 벤치마크 기반을 이룬다. 악천후와 연기 관통성이라는 실용 동기는 분명하지만, 본 책이 추적해 온 photogrammetry → SfM → Visual SLAM → learning → 3D foundation의 계보와는 접점이 얇다. radar는 "앞으로 합류할 이웃"으로 남겨 두고, 이 책은 별도 역사를 쓰지 않는다 — 상세는 Handbook of SLAM(2026) Ch.9 참조.

---

## 📜 예언 vs 실제

> Zhang·Singh는 2014년 LOAM 논문 Conclusion에서 다음 두 가지를 명시적 future work로 꼽았다. 첫째, loop closure를 도입해 drift를 보정하는 것. 둘째, IMU 출력을 Kalman filter로 자신들의 방법과 결합하는 것. 두 방향 모두 이후 10년 안에 실현됐다. IMU 결합은 FAST-LIO(2021)·FAST-LIO2(2022)가 iEKF로 tightly coupled 방식으로 정리했고, loop closure는 LIO-SAM(2020)이 factor graph backend로 통합했다. 저자들이 스케치한 경로는 꽤 정확히 구현됐다. 그러나 이 두 축 너머에는, 논문 Conclusion에는 등장하지 않았지만 실무 현장에서 꾸준히 부각된 과제가 있었다. dynamic object 처리다. LiDAR 포인트에서 움직이는 보행자·차량을 실시간 분리하는 작업은 2026년 현재도 주로 deep learning segmentation에 의존하고, SLAM 알고리즘 자체에 내장된 해법은 여전히 부재하다. `[적중+진행형]`

---

## 🧭 아직 열린 것

**Visual+LiDAR 완전 융합.** LVI-SAM 이후로도 두 센서를 하나의 상태 추정기 안에서 tightly coupled로 처리하는 시스템은 실용 단계에 이르지 못했다. 안개·강우에서 카메라가 실패하고 LiDAR가 빈자리를 채워야 하는 시나리오는 자율주행에서 명확한 요구다. 알고리즘과 센서 캘리브레이션 난이도가 여전히 장벽이다. 2024-2025년 여러 그룹이 transformer 기반 융합을 실험 중이지만 일관된 결과가 없다.

**Solid-state LiDAR에 최적화된 알고리즘.** LOAM·FAST-LIO는 모두 360° spinning LiDAR를 전제한다. Livox·RoboSense의 solid-state 제품은 비반복 스캔 패턴을 사용한다. 같은 지점을 여러 번 찍어서 누적하는 방식이다. 이 특성에 맞는 feature extraction과 motion distortion 보정은 별도 연구가 필요하다. Livox LOAM이 있지만 일반화 수준은 미흡하다.

**동적 물체 처리.** 이 문제는 Zhang의 2014년 예언에서도, 2026년 현재도 동일한 위치에 있다. 정적 환경 가정은 SLAM의 오래된 전제이고, LiDAR도 예외가 없다. 움직이는 물체를 포인트 클라우드에서 실시간 분리하는 작업은 segmentation network에 맡기는 것이 현재의 편법이다. SLAM 내부에서 geometry 기반으로 처리하는 방법은 연산 비용이 높고 정확도가 불안정하다. Waymo·Argo AI 같은 회사들이 자체 솔루션을 운영하지만 공개된 일반 알고리즘은 아니다.

---

LiDAR 계보는 Visual 주축과 교차하지 않은 채로 성숙했다. 두 계보는 각자의 언어를 갖추었고, 그 언어들 사이의 번역은 아직 진행 중이다.

---

# Ch.18 — 실패 사례와 사라진 계보

LOAM과 FAST-LIO2가 성숙해가던 같은 시간, 로보틱스 커뮤니티 안에는 다른 방향으로 걷고 있던 사람들이 있었다. 카메라 계보도 LiDAR 계보도 아닌 계보들. 그들이 선택받지 못했다고 해서 역사에 없던 것은 아니다.

SLAM의 역사는 성공한 계보만으로 이루어지지 않는다. 매 10년마다 충분한 논문과 초기 결과를 갖추고도 주류로 진입하지 못한 접근들이 있었다. 공학적 확장이 막히거나, 더 실용적인 대안이 먼저 자리를 잡은 경우였다. 기술적 실패와는 다른 문제였다.

---

## 18.1 RatSLAM — place cell 기반 위상 지도

2004년 ICRA에서 [Milford et al. 2004](https://doi.org/10.1109/ROBOT.2004.1302555)가 발표한 RatSLAM은 장소 인식 문제를 전혀 다른 방식으로 접근했다. 쥐의 해마 안에 있는 **place cell**과 **head direction cell**의 발화 패턴을 모방해, 로봇이 환경을 탐색하면서 자연스럽게 장소 표현을 형성하게 했다. 계산 모델의 이름은 **Continuous Attractor Network(CAN)**이었다. 뉴런들의 활성화 상태가 2D 격자 위에서 연속적인 활성화 'bump'를 형성하고, 로봇의 속도·회전 입력(path integration)에 따라 그 bump가 격자를 따라 이동하는 구조다. 시각 입력이 들어오면 저장된 장소 표현과 비교해 bump 위치를 보정(correction)한다. 이 loop(이동으로 인한 bump 전파, 시각 매칭으로 인한 보정)이 RatSLAM의 핵심 동작 원리다.

> 🔗 **차용.** [O'Keefe와 Dostrovsky(1971)](https://pubmed.ncbi.nlm.nih.gov/5124915/)의 place cell 발견은 신경과학에서 시작해 인지 지도(cognitive map) 이론으로 이어졌다. RatSLAM은 그 생물학적 메커니즘을 공학 시스템으로 옮긴 최초의 완성된 시도였다. 문은 열렸지만 안으로 들어온 공학자는 많지 않았다.

Milford와 Gordon Wyeth는 Queensland University of Technology(QUT) 로보틱스 연구실을 거점으로, 2004년부터 2008년 사이에 브리즈번 교외 도로에서 실외 주행 실험을 반복했다. 실험 차량은 지붕에 카메라를 달고 교외 주택가를 달렸다. RatSLAM은 그 이미지 스트림을 받아 이미 지나온 길을 알아보고 loop를 닫았다. [Milford & Wyeth 2008](https://doi.org/10.1109/TRO.2008.2004520) IEEE T-RO 논문에는 66km 경로에서 수만 장의 이미지를 처리한 결과가 실렸다. 같은 시기 기하학적 SLAM 시스템들이 몇 백 미터 단위에서 고전하던 때였으니, 숫자만 보면 RatSLAM이 앞서 있었다.

그러나 공학적 확장은 거기서 멈췄다. CAN은 장소 수가 늘수록 계산 복잡도가 올랐다. 더 깊은 문제는 정밀도였다. RatSLAM이 만드는 위상 지도(topological map)는 "여기 왔던 적 있다"는 판단은 했지만, 미터 단위의 metric 위치 추정은 안정적으로 내놓지 못했다. 자율주행과 조작(manipulation)이 요구하는 것은 정확한 좌표였다. 인지 지도는 그 요구에 맞지 않았다.

> 📜 **예언 vs 실제.** Milford·Wyeth는 2008년 T-RO 논문 Conclusion에서 RatSLAM이 "vision-only SLAM의 대안적 접근"이며, 기존 state-of-the-art SLAM에게는 도전이 될 만한 환경—장거리 경로, 큰 누적 오차, 시각적 모호성—에서 반복적이고 신뢰도 높은 loop closure를 수행한다고 주장했다. 대체가 아니라 대안이라는 주장이었다. 실제로 이 주장은 부분적으로 맞았다. RatSLAM은 특정 benchmark에서 경쟁력을 보였다. 그러나 이후 분야 전체의 흐름에서는 2012년 이후 graph-based SLAM과 visual odometry가 정확도·속도 모두에서 앞서 나갔고, 위상 지도는 지금도 일부 place recognition 연구에 등장하지만, metric-topological 통합이라는 RatSLAM의 원래 야망은 다른 방식으로 이어지지 않았다. `[부분적중+무산]`

RatSLAM이 남긴 것은 "장소 표현이 기하학 없이도 가능하다"는 아이디어였다. 그 아이디어는 place recognition 문헌에 스며들었다. 2012년 [SeqSLAM](https://doi.org/10.1109/ICRA.2012.6224623)이 같은 Milford 그룹에서 나왔고, 이미지 시퀀스 비교 기반 장소 인식은 visual place recognition 벤치마크의 한 축이 됐다. 계보 자체는 살아남았고, 다만 형태가 달라졌다.

---

## 18.2 biologically-inspired SLAM의 공학적 한계

RatSLAM은 biologically-inspired SLAM의 가장 완성된 사례였지만 혼자가 아니었다. 2000년대 중반부터 2010년대 초반까지 인지 지도, entorhinal grid cell, hippocampal replay를 모방한 SLAM 변형들이 꾸준히 나왔다. 모두 비슷한 문제를 안고 있었다.

생물학적 모델은 뇌가 *어떻게* 공간을 표현하는지 기술한다. 그것이 *왜* 그 방식인지, 그 방식이 공학적 목적에도 맞는지는 다른 질문이다. 쥐의 해마는 수억 년의 진화가 특정 환경과 행동 패턴에 맞게 빚은 구조다. 로봇이 작동하는 조건과 같지 않다.

공학적 SLAM은 미터 이하의 위치 추정 정확도, 실시간 처리, 새로운 환경에 대한 빠른 적응, 검증 가능한 오류 경계를 요구한다. 인지 모델은 이 조건들을 보장하기 어려웠다. 신경과학과 로봇공학은 서로에게서 영감을 얻을 수 있지만, 그 간격은 짧지 않았다.

2020년대 들어 이 논의는 다시 열릴 여지가 생겼다. Foundation model이 large-scale representation을 스스로 형성하는 방식이 place cell의 emergent 특성과 구조적으로 닮았다는 관찰이 나왔다. 재발견인지, 다른 경로의 수렴인지는 아직 모른다.

---

## 18.3 Event SLAM — 하드웨어와 알고리즘 성숙 격차

Patrick Lichtsteiner, Christoph Posch, Tobi Delbruck가 ETH Zürich Institute of Neuroinformatics(INI)에서 개발한 [Dynamic Vision Sensor(DVS)](https://doi.org/10.1109/JSSC.2007.914337)는 ISSCC 2008에서 처음 공개됐다. 각 픽셀이 독립적으로 대수(log) 광도 변화를 임계값과 비교해 양(ON) 또는 음(OFF) 극성의 이벤트를 비동기(asynchronous)로 출력하는 구조다. 전역 셔터 없이 픽셀별로 발화 시점을 마이크로초 단위로 기록한다. 프레임이 없는 카메라였다.

> 🔗 **차용.** DVS event sensor(Lichtsteiner et al. 2008)는 생물학적 망막의 변화 감지 메커니즘에서 착안한 하드웨어였다. Event SLAM은 이 센서를 손에 쥐고 시작했다. 하드웨어가 알고리즘을 앞서 달렸고, 그 간격을 메우는 데 10년이 걸렸다.

이벤트 카메라의 장점은 목록으로 나열하기 좋았다. μs 단위의 시간 해상도, 고속 운동에서 블러 없음, 고동적 범위(HDR)로 터널과 햇빛 직사 환경 모두 대응, 전력 소비는 기존 카메라의 수십 분의 일. 논문에 쓰기 좋은 숫자들이었다.

2014년 ICRA에서 [Weikersdorfer et al. 2014](https://doi.org/10.1109/ICRA.2014.6906882)는 event 기반 3D SLAM을 발표했다. 같은 해 다른 그룹에서도 event-based optical flow와 depth 추정이 나왔다. 2016-2018년 사이에 Henri Rebecq(Davide Scaramuzza 그룹, University of Zurich RPG 연구실)가 [EVO](https://doi.org/10.1109/LRA.2016.2645143)(RA-L 2017)와 [ESIM](https://proceedings.mlr.press/v87/rebecq18a.html)(CoRL 2018) 등을 발표하면서 event SLAM 파이프라인이 구체화됐다.

현실 환경에서 결과는 기대에 미치지 못했다. 문제는 두 곳에 있었다. 첫째는 해상도였다. 초기 DVS 센서는 128×128 픽셀이었다. 기존 VGA 카메라와 비교할 수 없는 수준이었고, feature matching과 map building이 해상도에 직접 의존하는 SLAM에서 이 제약은 컸다. 둘째는 알고리즘 패러다임 자체였다. 기존 프레임 기반 알고리즘을 이벤트 스트림에 그대로 쓸 수 없었다. 새로운 방식이 필요했고, 그 개발에 시간이 걸렸다.

2014년부터 2018년까지 event SLAM은 controlled 환경과 low-texture 조건에서 좋은 결과를 냈지만, 일반 환경에서 기존 visual-inertial odometry를 앞서지 못했다.

그 사이에도 event 접근의 적용 범위는 odometry 바깥으로 조용히 번졌다. [EventVLAD](https://ieeexplore.ieee.org/document/9635907/)(Lee & Kim, IROS 2021)는 event stream에서 복원한 edge 이미지를 NetVLAD descriptor로 묶어, 급격한 조명 변화와 모션 블러 조건에서도 장소 재인식이 가능함을 보였다. frame 기반 VPR이 어려워하던 환경을 event가 대신 건드리는 시도였다.

---

## 18.4 Semantic SLAM — object-as-landmark 경로의 축소

2017년부터 2019년까지 CVPR, ECCV, IROS 세션 제목에는 "semantic"이 빠지지 않았다. 딥러닝이 instance segmentation과 object detection에서 연속으로 돌파구를 열던 시기였다. SLAM 연구자들은 질문을 던졌다. "이 semantic 이해를 SLAM에 통합하면 무슨 일이 일어날까?" 질문 자체가 틀린 것은 아니었는데, 실행이 담론을 따라가지 못했다.

[Salas-Moreno et al. 2013](https://doi.org/10.1109/CVPR.2013.178)의 **SLAM++**가 그 계보의 첫 대형 선언이었다. Imperial College의 Salas-Moreno와 지도교수 Andrew Davison 그룹은 기존의 포인트나 패치 대신 *사물(object)*을 지도의 기본 단위로 삼았다. 의자, 책상, 모니터 같은 사전 정의된 3D 객체 모델을 데이터베이스에 저장하고, SLAM 실행 중 RGB-D 입력에서 ICP(Iterative Closest Point) 기반 정합으로 그 객체들을 인식해 지도에 올렸다. 포인트 수천 개 대신 객체 수십 개로 지도를 표현하면, 맵 크기가 줄고 장소 인식과 loop closure가 더 의미론적으로 이루어질 수 있었다.

> 🔗 **차용.** SLAM++의 object-level representation은 그래픽스의 scene graph 표현과 컴퓨터비전의 model-based recognition을 결합한 것이었다. 그 아이디어는 2020년대 LERF(Language Embedded Radiance Field)와 LangSplat으로 이어졌다. 표현 단위가 object에서 language feature로 바뀌었을 뿐, "지도가 semantic해야 한다"는 직관은 살아남았다.

SLAM++ 이후 2017-2019년 사이에 [SemanticFusion](https://arxiv.org/abs/1609.05130)(McCormac et al., 2017, ICRA), [MaskFusion](https://arxiv.org/abs/1804.09194)(Rünz et al., 2018, ISMAR), [SuperPoint](https://arxiv.org/abs/1712.07629)(DeTone et al., 2018) 기반 feature 계열 등이 연달아 발표됐다. 이 시기의 공통된 주장은 하나였다. "deep semantic feature는 geometric feature보다 환경 변화에 강건하고, semantic 이해가 통합된 SLAM이 다음 단계다."

실제 전개는 달랐다. 2019년까지 autonomous driving benchmark에서 성능을 끌어올린 것은 ORB-SLAM2, VINS-Mono, LIO-SAM 같은 전통적 geometric 파이프라인이었다. Deep semantic feature를 통합한 시스템들은 특정 실내 환경과 고정된 객체 클래스에서만 경쟁력이 있었다. 새로운 객체 카테고리나 처음 보는 환경에서 semantic prior가 오히려 drift를 키우는 사례도 나왔다.

> 📜 **예언 vs 실제.** Salas-Moreno는 SLAM++ 논문 Conclusion에서 자신들의 방식이 "보다 일반적(generic) SLAM 방법으로 가는 첫 걸음"이라며, 낮은 차원의 형상 변이를 갖는 객체, 나아가 장기적으로는 스스로 객체 클래스를 분할·정의하는 시스템으로 확장되기를 기대했다. 논문 도입부는 이에 더해 객체 단위 표현이 "맵 저장량의 큰 압축"과 "효율성·견고성 이득"을 준다고 주장했다. 실제 전개는 일부만 적중했다. Object-level map은 AR과 특정 manipulation 응용에서 자리를 찾았고, 압축·효율 측면의 이점은 실내 반복 객체 환경에서 재확인됐다. 그러나 주류 geometric SLAM은 2026년 기준에도 sparse point와 keyframe 기반 graph를 유지하고 있고, 객체를 스스로 segmentation·정의하는 단계는 도달하지 못했다. Semantic은 결국 SLAM의 내부가 아니라 하류(downstream) 태스크—semantic mapping, task planning—에 자리를 잡았다. `[부분 적중+경로 전환]`

왜 semantic-first SLAM은 주류가 되지 못했나. 원인은 두 곳에 있었다. 하나는 의존성이었다. Semantic SLAM은 segmentation이 정확해야 했는데, segmentation이 틀리면 지도 전체가 오염됐다. 기하학적 파이프라인은 feature matching이 부분적으로 실패해도 robust estimation으로 버텼다. 다른 하나는 일반화였다. 특정 객체 클래스로 훈련한 semantic prior는 그 클래스 밖에서 쓸모가 없었다. SLAM이 들어가야 할 환경은 그 prior가 상정한 세계보다 훨씬 넓었다.

축소된 것은 object-as-landmark 경로였다. 같은 시기 다른 경로가 살아남았다. [SuMa++](https://doi.org/10.1109/IROS40897.2019.8967704)(Chen et al., IROS 2019)가 LiDAR point cloud에 semantic class를 덧씌워 동적 물체를 걸러냈고, [Kimera](https://doi.org/10.1109/ICRA40945.2020.9196885)(Rosinol et al., ICRA 2020)가 metric-semantic mesh와 3D scene graph를 묶었다. [Hydra](https://doi.org/10.15607/RSS.2022.XVIII.050)(Hughes et al., RSS 2022)는 그 scene graph를 실시간·계층적으로 확장했고, [ConceptGraphs](https://doi.org/10.1109/ICRA57147.2024.10610243)(Gu et al., ICRA 2024)와 [Clio](https://doi.org/10.1109/LRA.2024.3451395)(Maggio et al., RA-L 2024)에 이르러 open-vocabulary foundation feature가 그 위에 얹혔다. Semantic은 지도의 상위 layer로 올라가 살아남았다. 이 계보는 2026년까지 진행 중이고, [Ch.15b](chapter_15b_dynamic.md)(Dynamic·static 분리의 semantic 귀환), [Ch.16](chapter_16_foundation_3d.md)(foundation 3D·metric-semantic 본체), [Ch.19 §19.7](chapter_19_open_problems.md#197-semantic-표현의-귀환과-open-world)(Semantic의 귀환)에서 이어 다룬다.

---

## 18.5 Manhattan-World 가정 — 적용 범위와 소멸

비슷한 시기, 다른 계보가 조용히 시도됐다가 조용히 사라졌다. Manhattan-world assumption을 이용한 SLAM이었다.

가정 자체는 단순했다. [Coughlan & Yuille 1999](https://doi.org/10.1109/ICCV.1999.790349)의 Manhattan world 개념을 이어받아, 실내 환경은 대부분 세계 좌표계의 세 직교 축(x, y, z)에 정렬된 구조라고 봤다. 벽, 바닥, 천장이 그 방향을 만든다. 이미지 속 평행선 묶음은 소실점(vanishing point)으로 수렴하며, 각 소실점은 카메라의 회전 행렬 R과 방향 벡터 d의 관계 `v = K R d`로 기술된다(K: 카메라 내부 행렬). 세 직교 소실점을 찾으면 R의 세 열을 직접 복원할 수 있다. IMU나 feature matching 없이 기하학적 제약만으로 drift를 억제할 수 있다는 말이다. 이 아이디어를 visual odometry와 결합하려는 시도들이 이 시기에 등장했다.

긴 복도와 직사각형 방에서는 drift가 실제로 줄었다. 문제는 그 밖이었다. 야외로 나가거나, 둥근 구조물이 있거나, 불규칙한 산업 환경에 들어서면 Manhattan-world 가정 자체가 성립하지 않았다. 환경에 꼭 맞춘 prior는 그 환경 밖에서 오히려 발목을 잡았다. 2015년 이후 general-purpose visual-inertial odometry가 성숙하면서 이 계보는 관심을 잃었다. 일부 실내 mapping 도구에 보조 제약으로 남아 있지만, 독립 연구 계보로는 사라졌다.

---

## 18.6 소멸 계보의 재발견 패턴

계보가 죽는다는 것이 무엇을 뜻하는지는 사례마다 다르다. RatSLAM의 topological map 아이디어는 SeqSLAM으로 이어졌고, 그 후예가 visual place recognition 분야에서 살아 있다. SLAM++의 object-level map 직관은 2022년 이후 NeRF와 Gaussian splatting이 언어와 결합하면서 다른 형태로 돌아왔다. [LERF](https://arxiv.org/abs/2303.09553)(Kerr et al., 2023)와 [LangSplat](https://arxiv.org/abs/2312.16084)(Qin et al., 2023)이 그 경우다.

Event camera SLAM은 경로가 달랐다. 하드웨어가 아직 거기까지 오지 못한 상태였다. 2022년 이후 640×480 이상의 event camera가 시장에 나왔고, 고속 드론과 HDR 환경에서의 필요가 분명해졌다. [Guillermo Gallego](https://arxiv.org/abs/1904.08405)(TU Berlin)를 중심으로 한 event vision 커뮤니티는 2020-2024년 사이에 event-based depth estimation과 ego-motion 추정에서 경쟁력 있는 결과를 냈다.

영감이 좋아도 공학이 따라오는 데 시간이 걸리고, 센서가 새로워도 알고리즘은 따로 만들어야 한다. 그 간격을 메우는 데 얼마나 걸리느냐는 알고리즘의 성숙도와 하드웨어의 실용화 속도에 달렸다. 그리고 그 사이에 더 나은 대안이 먼저 자리를 잡느냐도 변수였다.

---

## 🧭 아직 열린 것

**Biologically-inspired SLAM.** Foundation model이 대규모 비지도 학습으로 공간 표현을 형성하는 방식은 인지 지도와 구조적으로 닮은 특성이 있다. Place cell과 유사한 단위가 transformer 내부에서 관찰됐다는 보고가 2023-2024년에 나왔다. 수렴인지 우연인지는 모른다. RatSLAM류의 계보가 foundation model 패러다임 안에서 다른 이름으로 돌아올 가능성은 있다.

**Event camera SLAM의 주류화.** 2022년 이후 상업용 고해상도 event camera가 보급되면서 연구 기반이 넓어졌다. 그러나 event 데이터를 효과적으로 처리하는 알고리즘 패러다임은 아직 안정적인 공통 프레임워크를 갖추지 못했다. Frame 기반 pipeline과의 통합과 새로운 event representation, 그리고 real-world benchmark의 다양화와 평가 기준 정립이 동시에 진행 중이다. 주류화 여부는 2026년 기준에도 판단이 이르다.

**"Semantic map" 개념의 향방.** 2017년 semantic SLAM의 과열이 식은 뒤, semantic 표현은 SLAM 외부—downstream task—로 밀려났다. 2024-2025년 LERF와 Gaussian splatting이 언어 feature를 밀도 있는 scene representation과 결합하면서 다른 형태가 나왔다. 내부화로 이어질지, 다시 downstream으로 남을지는 모른다. geometry가 먼저 옳아야 semantic이 쓸모 있다는 패턴이 이번에도 반복될지, 아니면 representation 자체의 변화가 그 순서를 바꿀지가 관건이다. 2026년 기준 "풀렸다"고 여기는 것들 가운데 미래의 누군가가 이 챕터에 추가할 이름은 아직 나오지 않은 것 중에 있을 것이다.

---

# Ch.19 — 오늘의 지도와 내일의 공란

Ch.0은 2026년의 풍경을 이렇게 묘사했다. AR 레이어가 벽에 달라붙고, 실내 배송 로봇이 지도 없이 주방과 회의실을 구분하며, DUSt3R 계열에 사진 몇 장을 던지면 수 초 안에 3D 구조가 나온다. 그 묘사는 정확하다. 그리고 이 책의 전제를 뒷받침하는 동시에 무너뜨린다.

풀린 것은 2003년의 문제다. 정적 장면, 안정된 조명, 제한된 공간, 단안 카메라의 기하학—이 가정들 위에서 EKF가 작동했고, graph SLAM이 루프를 닫았으며, ORB-SLAM이 keyframe을 관리했다. 각 답은 진짜 답이고, 각 가정은 진지하게 선택된 단순화였다.

18개 챕터의 마지막 절에는 동일한 표시가 남아 있다. 아직 열린 것들. 각 챕터가 풀었다고 선언한 자리 바로 옆에 꽂아둔 깃발들을 한 자리에 펼치는 작업이다.

---

## 19.1 조명과 환경 변화: 카메라가 감당하지 못하는 현실

Visual SLAM이 실외로 나온 순간부터 따라다닌 문제가 있다. 카메라의 측광 모델이 감당하지 못하는 조건은 현장에서 항상 먼저 도착한다.

Learned descriptor는 훈련 도메인에선 ORB를 능가하지만 underwater·thermal·low-light에서 일관성이 없고, 2026년에도 우열 합의가 없다 (Ch.2 §2.7 참조). Ch.5가 기록한 저조도·동적 추적 실패는 여전하다. 2007년 PTAM이 "Small AR Workspaces"로 스스로 범위를 제한한 이유도 대부분의 feature-based SLAM에 지금도 암묵 가정으로 남아 있다 (Ch.5 §🧭 참조).

Direct method에서 이 문제는 더 구조적이다. 밝기 보존이라는 근본 전제가 자동 노출, 역광, 터널-야외 전환에서 즉각 붕괴하고, 조명 모델을 동적으로 추정하는 완전한 해법은 없다 (Ch.8 §🧭 참조). Place recognition에서도 같은 장벽이 10년째 같은 자리다. [DINOv2](https://arxiv.org/abs/2304.07193) 기반 방법이 격차를 줄였어도, [Nordland](https://nikosuenderhauf.github.io/projects/placerecognition/)·[Oxford RobotCar](https://robotcar-dataset.robots.ox.ac.uk/)의 계절·조명 극변에서 눈 쌓인 겨울과 나뭇잎 무성한 여름을 99% 정확도로 연결하는 단일 모델은 없다 (Ch.10 §10.7 참조).

ORB-SLAM의 장기 지도 재사용도 같은 경계에 막힌다. Atlas가 멀티맵을 가능하게 했지만 아침에 만든 지도로 저녁을 인식하는 일은 조명 앞에서 실패한다 (Ch.7 §🧭 참조). Ch.2·5·7·8·10이 같은 장벽을 각자의 언어로 보고했을 뿐이다.

---

## 19.2 동적 세계 가정: 가장 오래된 단순화의 한계

정적 세계 가정은 SLAM의 가장 오래된 단순화다. 그리고 이 가정에 가장 많은 챕터가 각자의 깃발을 꽂았다.

SfM 계보에서 동적 물체는 COLMAP 포함 모든 현행 시스템의 공통 취약점이고, 2026년 기준 COLMAP 수준의 범용성을 가진 Dynamic SfM 구현체는 없다 (Ch.3 §3.7 참조). KinectFusion부터 BundleFusion까지 모두 정적 장면 전제 위에 있고, DynaSLAM·MaskFusion의 실시간 segmentation 결합 시도는 비용·robustness 모두에서 실배치 수준에 못 미친다 (Ch.9 §🧭 참조).

Monocular depth에서는 self-supervised가 moving object를 masking으로 회피한다 (Ch.11 §🧭 참조). 3DGS SLAM은 2025년에도 정적 세계 가정 위에 있고, [4DGS](https://arxiv.org/abs/2310.08528)·[Deformable 3DGS](https://arxiv.org/abs/2309.13101)가 시간 차원을 탐색 중이지만 SLAM 설정의 통합된 방식은 없다 (Ch.15 §🧭 참조). LiDAR SLAM도 면제되지 않는다. Zhang이 2014년 예견한 동적 처리 문제는 같은 자리이고, Waymo·Argo AI의 사내 솔루션은 공개 알고리즘이 아니다 (Ch.17 §🧭 참조). 다섯 챕터에서 같은 질문이 돌아오는 것은 올바른 접근법 자체가 아직 나오지 않았기 때문일 것이다.

[Ch.15b](chapter_15b_dynamic.md)가 수확한 long-term dynamic/deformable 항목도 같은 층위다. **Absence vs evidence of absence**(객체가 사라졌는가, 가려졌는가)는 [Schmid의 Panoptic Multi-TSDF](https://doi.org/10.1109/LRA.2022.3148854)(2022)가 부분 답을 냈지만 대규모 outdoor·60% 이상 occlusion에서 판정 오차가 크다. **Floating Map Ambiguity**(카메라 rigid motion과 객체 rigid motion 분리)는 isometric·visco-elastic prior로 우회될 뿐 prior 없는 식별 조건은 미해결이다. Monocular RGB에서 Khronos 수준의 change-aware 온라인 통합 시스템은 없고, 의료 MIS는 phantom·ex vivo를 넘어 실제 수술 환경에서 견고성이 떨어진다. Ch.15b의 네 항목이 여기서 다시 열린 채 남는다.

---

## 19.3 Scale과 표현 메모리: 크기가 달라지면 문제가 달라진다

SLAM 시스템이 방 한 칸에서 건물로, 건물에서 도시로 확장될 때마다 같은 질문이 새로운 형태로 돌아왔다.

Monocular scale은 1980년대 SfM 이론이 이미 증명한 기하학적 사실이고, IMU·depth로 우회될 뿐 순수 단안으로 metric scale을 유지하는 방법은 계속 형태를 바꿔 돌아온다 (Ch.5 §🧭 참조). Ch.11에서는 같은 질문이 다른 언어로 재등장한다. [Metric3D v2](https://arxiv.org/abs/2404.15506)·[Depth Anything v2](https://arxiv.org/abs/2406.09414)가 intrinsic 조건부 metric depth를 내놓았지만, intrinsic을 모르는 상황(스마트폰, CCTV, 아카이브, 위성)이 흔하고 카메라 독립적 metric depth는 foundation scale에서도 쉽지 않다 (Ch.11 §🧭 참조).

TSDF 계보에서 메모리 문제는 표현의 한계로 드러났다. [Voxblox](https://arxiv.org/abs/1611.03631)·[OctoMap](https://octomap.github.io/)이 비용을 줄였어도 건물 층·도시 블록 dense 표현은 여전히 수십 GB이고, 어느 영역에 어느 해상도를 둘지 자동 결정하는 adaptive resolution map은 범용 해법이 없다 (Ch.9 §🧭 참조). NeRF-SLAM도 같은 천장에 막혔다—도시 규모는 개방형이다 (Ch.14 §🧭 참조). Gaussian Splatting은 scene 크기에 따라 선형으로 Gaussian 수가 늘어 실내에서는 수십만, outdoor에서는 수천만에 이르고, [Compact 3DGS](https://arxiv.org/abs/2311.13681)(Lee et al. 2024) 계열의 압축이 탐색 중이지만 합의된 방법은 없다 (Ch.15 §🧭 참조). Foundation 3D에서 이 문제는 transformer의 물리적 한계로 재정의된다. 이미지 수에 quadratic한 메모리 요구가 100장에선 현실적이지만 1,000장, 10,000장은 다른 문제고, Spann3R의 incremental 방식은 부분 답이다 (Ch.16 §🧭 참조). 표현이 바뀌어도 크기의 장벽은 같은 자리에 있다.

크기 문제의 다른 얼굴은 **데이터 이동 비용**이다. 용량이 아니라 프로세서-메모리 사이 비트 이동의 물리적 비용이 전력을 먹는다. Davison은 Handbook Ch.18 §18.8에서 12번째 SLAM 지표로 "on-device data movement, measured in bits × millimetres"를 제안하며 metric을 하드웨어 공학의 언어로 재정의한다. Hierarchical scene graph가 flat voxel 대비 $O(L \cdot V/\delta^3)$에서 $O(N_\text{sub} + N_\text{obj} + N_\text{rooms})$로 압축한다는 [Hughes et al.](https://doi.org/10.15607/RSS.2022.XVIII.050)의 주장도 같은 맥락이다 (Handbook Ch.16 Eq. 16.34-16.36). Davison의 12번째 지표 재정의가 얼마나 받아들여질지는 결론이 없다.

---

## 19.4 학습 기반 시스템의 불확실성 calibration

Julier와 Uhlmann이 Ch.4에서 EKF의 inconsistency를 증명한 이래, SLAM 시스템이 "자신이 어디 있는지 모른다는 것을 얼마나 정확하게 아는가"는 이 분야의 물음으로 남아 있다.

비가우시안 불확실성은 EKF의 핵심 가정에 닿는다. 현실 센서 오류는 다중 모드·heavy-tail이 흔하고, Stein particle·normalizing flow·learned uncertainty가 시도되나 실시간 검증은 제한적이다 (Ch.4 §4.8 참조). Graph SLAM에서 robust cost function 선택도 직관에 기댄다—Huber·Cauchy·Geman-McClure 중 환경·센서에 맞는 kernel을 사전 결정하는 원칙적 방법이 없다 (Ch.6 §🧭 참조). [Ch.6b](chapter_06b_certifiable.md)의 tightness 경계도 같은 층위다. SE-Sync의 exact recovery는 노이즈 $\beta$ 이하라는 충분조건만 주고, 실제 인스턴스에서 $\beta$를 사전 계산하는 방법은 없다. Visual SLAM·VIO로 certifiable을 확장하는 문제, 새 측정이 들어올 때 SDP를 다시 풀어 certificate를 갱신하는 online certification도 열린 채 남는다.

학습 기반 방법에서 문제는 더 날카롭다. Bayesian PoseNet 실패 이후에도 learned uncertainty가 OOD 입력에서 calibrated인지는 열려 있다 (Ch.12 §🧭 참조). DROID-SLAM 계보에서 확인했듯 learned prior는 훈련 도메인 밖에서 조용히 degrade한다—learned 실패는 그럴듯한 모양으로 나타나고, geometric 실패는 명시적이다. [TartanAir](https://arxiv.org/abs/2003.14338) 같은 합성 데이터로도 sim-to-real gap이 남는다 (Ch.13 §🧭 참조).

Foundation 3D에서는 이 문제가 loop closure 재정의로 이어진다. DUSt3R 계열에서 pointmap 기반 교정 propagate는 MASt3R-SLAM이 기존 방식으로 처리하지만 원리적 해법인지는 불확실하다 (Ch.16 §🧭 참조). 자율주행·의료 로봇에서 calibrated uncertainty가 필수인데 그 수준의 시스템은 드물다.

Davison은 Handbook Ch.18에서 문제를 재정식화한다. *"100장으로 3D 모델을 만든 네트워크에 이미지 1장이 추가되면 전체를 다시 돌려야 하는가"* (p.528). 장기 표현과 fusion을 인정하는 순간 probabilistic state estimation과 modular scene representation이 필요해진다. 대안으로 제시된 [GBP Learning](https://arxiv.org/abs/2312.14294)(Nabarro et al.)은 신경망 weight를 factor graph의 random variable로 넣어 *"training time"*과 *"test time"*의 구분을 지우는 방향이다 (p.543). 이것이 원리적 답인지 문제 이관인지는 판단이 이르다.

---

## 19.5 센서 융합과 새 모달리티: 통합의 미완

Visual SLAM과 LiDAR SLAM은 같은 시기에 같은 문제를 다른 언어로 풀었다. 두 계보가 실질적으로 합쳐진 적은 없다.

LVI-SAM이 LIO-SAM에 visual odometry를 결합했지만 loosely coupled에 머물렀고, 안개·강우 같은 자율주행 필수 시나리오에서 tightly coupled 융합의 알고리즘·캘리브레이션 난이도가 여전히 장벽이다 (Ch.17 §🧭 참조). Solid-state LiDAR 보급이 가져온 알고리즘 공백도 같은 층위다. LOAM·FAST-LIO가 전제한 360° spinning과 달리 Livox·RoboSense의 비반복 스캔 패턴에는 별도 연구가 필요하고 일반화 수준이 미흡하다 (Ch.17 §🧭 참조).

Wide-baseline 매칭은 융합의 다른 각도다. 시점 변화 45도를 넘으면 Harris·ORB 성능이 급락하고, DUSt3R는 matching 자체를 회피하는 돌파구를 열었지만 이것이 descriptor 문제의 종말인지 우회인지는 판단이 이르다 (Ch.2 §2.7 참조). Place recognition과 metric localization의 통합도 파이프라인 수준의 단절이다. 두 과정을 하나의 표현으로 통합하는 2023-2025년 시도들이 있었지만 정밀도·속도를 동시에 달성한 방법은 없다 (Ch.10 §10.7 참조).

Event camera는 모달리티가 새로울 때 알고리즘이 얼마나 뒤따르는지를 보여준다. 2022년 이후 상업 고해상도 event camera가 보급되었지만 frame 기반 pipeline과의 통합, event representation, real-world benchmark가 동시 진행 중이다 (Ch.18 §🧭 참조). Kinect가 2010년 출시되고 1년 뒤 KinectFusion이 나왔던 순서와 같다.

이 책이 범위 밖으로 둔 모달리티가 있다. **4D imaging radar**와 **legged/proprioceptive SLAM**이다. Radar는 카메라·LiDAR가 안개·강우에서 동시 실패하는 조건을 보완하는 유일한 상용 센서로, Oxford Radar RobotCar(2019), NuScenes, 2023년 이후 4D imaging radar(Arbe, Mobileye)가 자율주행 주류에 진입했다. Legged SLAM은 ANYmal·Spot·Unitree의 2020년대 실외 배포와 함께 kinematic·contact prior 융합의 별도 계보를 열었다. 둘 다 visual·LiDAR·foundation 3D와 다른 원류·벤치마크를 가지며, 각자의 역사서가 필요한 크기다.

---

## 19.6 계산 구조와 하드웨어의 재결합

SLAM 역사서에 좀처럼 등장하지 않던 축이 2020년대 후반 Davison Handbook Ch.18에서 전면으로 올라왔다. 알고리즘의 그래프 구조와 실리콘의 그래프 구조를 정합시키는 문제다.

Dennard scaling 붕괴로 단일 코어 clock speed가 2000년대 중반 4GHz에서 멈춰 있고 *"this has stopped being true"* (Handbook Ch.18, p.528), 착용형 Spatial AI의 제약은 안경 한 짝—65g, <1W—으로 남아 있다. 이 간극이 **heterogeneous·specialized·parallel** 아키텍처로 분야를 밀어넣는다.

구체 실리콘 사례가 2020년대 중반에 모였다. [Apple Vision Pro R1](https://www.apple.com/apple-vision-pro/specs/)(2023)은 센서 데이터 12 ms 처리 전용 칩을 탑재하고, [Meta ARIA Gen 2](https://www.projectaria.com/ariagen2/)(2024)는 "ultra low power and on-device machine perception" custom silicon을 쓴다. [Graphcore IPU](https://www.graphcore.ai/products/ipu)는 수천 코어가 로컬 메모리와 메시지 패싱으로 연결되고, Manchester [SCAMP5](https://personalpages.manchester.ac.uk/staff/p.dudek/papers/carey-iscas2013.pdf)는 256×256 per-pixel in-plane processing을 1.2W에 처리하며, [SpiNNaker](https://apt.cs.manchester.ac.uk/projects/SpiNNaker/)는 ARM 코어 최대 100만 개의 neuromorphic 구조로 동작한다. 각자 다른 graph topology를 요구하고, 어느 실리콘에 어떻게 매핑할지에 대한 체계적 이론은 아직 없다.

이 축 위에서 Davison의 후기 track **Gaussian Belief Propagation**이 자리를 잡았다. [Ortiz et al.](https://arxiv.org/abs/2203.11618)(2022)은 IPU에서 GBP로 Bundle Adjustment를 CPU 대비 30× 가속했고, [Murai et al. Robot Web](https://arxiv.org/abs/2306.04620)(2024)은 여러 로봇이 Wi-Fi로 factor graph 조각을 공유해 asynchronous message passing으로 수렴하는 다중 로봇 SLAM을 보였다. *"We must get away from the idea that a 'god's eye view' of the whole structure of the graph will ever be available"* (Handbook Ch.18, p.541)가 이 계보의 철학이다. Factor graph를 master representation으로 두고 full posterior를 포기한 채, 메시지가 그래프 위를 "bubble"하며 국지적으로 수렴한다. 이 접근이 MASt3R-SLAM 같은 transformer 기반 시스템과 결합할지, 끝까지 다른 줄기로 남을지는 아직 답이 없다.

Davison이 제안한 12개 지표 중 11번 "power usage"와 12번 "on-device data movement"가 하드웨어 공학의 새 지표다. 정확도만큼 **전력과 이동 거리**로 평가하라는 제안이고, TUM·KITTI·EuRoC 같은 주류 벤치마크로 흡수될지는 합의가 없다. 알고리즘 중심인 이 책의 편향 바깥 영역이며, 그 편향 자체가 2020년대 후반 새로 문제화되고 있다.

---

## 19.7 Semantic 표현의 귀환과 Open-World

Semantic이 landmark 자리에서 축소됐다는 [Ch.18 §18.4](chapter_18_dead_ends.md#184-semantic-slam--object-as-landmark-경로의-축소)의 판정은 좁은 의미에서 사실이다. ORB-SLAM3도 MASt3R-SLAM도 object-level primitive를 쓰지 않는다. 그러나 같은 시기에 semantic은 **지도의 상위 layer**로 올라가 실질적 성공 궤적을 만들었다. Ch.1-18 서사에서 충분히 드러나지 않은 갈래다.

궤적은 뚜렷하다. [Kimera](https://doi.org/10.1109/ICRA40945.2020.9196885)(2020)가 metric-semantic mesh와 3D scene graph를 묶고, [Hydra](https://doi.org/10.15607/RSS.2022.XVIII.050)(2022)가 이를 실시간·계층적으로 확장했다—*"first online system to produce fully hierarchical scene graphs that included objects, places, and rooms"* (Handbook Ch.16, §16.4.2). 그 위에 foundation feature가 얹혔다. [ConceptFusion](https://arxiv.org/abs/2302.07241)·[VLMaps](https://arxiv.org/abs/2210.05714)(2023)가 CLIP을 dense map에, [ConceptGraphs](https://doi.org/10.1109/ICRA57147.2024.10610243)(2024)가 open-vocabulary object node에, [Clio](https://doi.org/10.1109/LRA.2024.3451395)(2024)가 task-driven hierarchy에, [LERF](https://arxiv.org/abs/2303.09553)·[LangSplat](https://arxiv.org/abs/2312.16084)이 radiance field와 Gaussian splatting에 CLIP을 실었다. Semantic SLAM은 표현 층위를 올렸다.

그러나 이 궤적이 해결한 것보다 연 것이 더 많다. Hughes/Carlone이 꼽은 open problem은 *"performing uncertainty quantification in hierarchical representations mixing discrete and continuous variables is still a largely unexplored problem"* (p.488). object category·room ID 같은 discrete 변수와 pose·surface 같은 continuous 변수가 섞인 그래프의 불확실성 전파는 원리적 답이 없다. Outdoor·unstructured로 scene graph를 확장하는 문제도, task-driven hierarchy의 동적 재구성(Clio의 Information Bottleneck, Handbook Ch.16 Eq. 17.8)의 일반화도 열려 있다.

더 큰 질문은 "지도가 여전히 필요한가"다. Ch.17 §17.4.2 "Revisiting the Question of the Need for Maps"에서 Paull과 편집자들이 직접 다룬다. long-context VLM에 과거 프레임을 다 넣으면 explicit scene graph 없이 planning이 가능한가? [OpenEQA](https://open-eqa.github.io/)와 [Mobility VLA](https://arxiv.org/abs/2407.07775)(2024)의 결과는 map-free가 단기·단순 과제엔 작동하지만 공간·시간 지평이 길어지면 실패한다는 것이다. *"the need for an explicit map representation ... largely depend[s] on the spatial and temporal horizons of the considered tasks and remains an active area of research"* (p.515). 풀렸다는 선언도, 불필요하다는 선언도 나오지 않았다.

SLAM과 생성형 로봇 정책의 관계도 같은 지평이다. [RT-2](https://robotics-transformer2.github.io/)(2023)·[OpenVLA](https://arxiv.org/abs/2406.09246)(2024)·[π₀](https://www.physicalintelligence.company/blog/pi0)(2024) 같은 VLA 모델이 SLAM을 대체하는가, 위에 서는가. Handbook의 **마지막 문장**이 답한다. *"true generalization and scalability to compositional tasks ... could be achieved through some form of explicit structure that is learned through a process such as SLAM. ... these two paradigms ... are entirely complementary"* (Paull/Carlone, Handbook Ch.17, p.520). 527페이지가 두 계보가 서로를 필요로 한다는 한 문장으로 수렴한다. 합의에 가장 가까운 입장이지만 "complementary"가 어떤 아키텍처 결합인지는 열려 있다.

---

## 19.8 열린 질문의 구조

이 책이 추적한 18개 챕터의 열린 것들을 모아보면 패턴이 있다.

열린 문제들이 같은 방식으로 남아 있는 것은 아니다. Ch.5의 monocular scale ambiguity는 SfM 이론에서 이미 증명된 기하학적 사실이고, 2026년에도 같은 정식화로 남아 있다. 반면 동적 세계 가정은 형태를 바꾸면서 20년 동안 되돌아왔다. Ch.3의 SfM 언어로, Ch.9의 dense SLAM 언어로, Ch.15의 Gaussian 언어로, Ch.17의 LiDAR 언어로 각각 다르게 나타났다. Foundation 3D 계보에서 loop closure를 어떻게 재정의할 것인지, learned uncertainty를 어떻게 calibrate할 것인지는 2026년에 비로소 문제라는 이름을 얻었다. 그 이름을 얻은 지 몇 년 되지 않았다.

Ch.0은 SLAM이 풀렸다고 여겨지는 시대를 묘사했다. 그 묘사는 정확하다. 같은 2026년 SLAM Handbook의 Epilogue에서 편집자 5인이 공동으로 *"If someone tells you 'SLAM is solved,' don't listen to them"*이라고 적은 것도 같은 풍경을 내부에서 본 것이다. SLAM의 역사는 언제 무엇을 놓아줘야 하는지 배우는 과정이었다. 어떤 가정을 놓아주는 순간, 이전에 닫혔던 문제가 새로운 형태로 돌아온다. EKF의 선형 가정을 내려놓자 particle filter가 뒤를 이었고, sparse feature를 놓자 dense method가, geometric prior를 놓자 learned prior가 그 자리를 채웠다. 각 전환은 새로운 가정 체계로 넘어가는 일이었다.

2026년에 풀렸다고 여기는 것도 대부분 이 순환 어딘가에 있다. 지금 확신하는 가정이 흔들릴 때 공란이 다시 생긴다.

---

## 19.9 계보 약도

```mermaid
graph TD
 PM[사진측량 1858]
 BA[Bundle Adjustment Brown 1958]
 SfM[Photo Tourism 2006]
 COLMAP[COLMAP 2016]

SC[Smith-Cheeseman 1986]
  Mono[MonoSLAM 2003]
  PTAM[PTAM 2007]
  ORB[ORB-SLAM 2015]
  ORB3[ORB-SLAM3 2020]

LSD[LSD-SLAM 2014]
  DSO[DSO 2016]
  VIDSO[VI-DSO 2018]

LM[Lu-Milios 1997]
 FG[Factor Graph Dellaert 2000s]
 iSAM[iSAM 2008]
 iSAM2[iSAM2 2012]
 g2o[g2o 2011]

Forster[Preintegration Forster 2016]
 VINS[VINS-Mono 2018]

Kinect[KinectFusion 2011]
  Elastic[ElasticFusion 2015]

SESync[SE-Sync 2019]
  TEASER[TEASER 2020]

LOAM[LOAM 2014]
  FAST[FAST-LIO 2021]

NeRF[NeRF 2020]
  iMAP[iMAP 2021]
  NICE[NICE-SLAM 2021]

GS3D[3DGS 2023]
  Spla[SplaTAM 2024]
  MonoGS[MonoGS 2024]

DROID[DROID-SLAM 2021]
  DPV[DPV-SLAM 2024]

DUSt3R[DUSt3R 2023]
  MASt[MASt3R 2024]
  VGGT[VGGT 2025]
  MASlam[MASt3R-SLAM 2025]

Hydra[Hydra 2022]
  Clio[Clio 2024]

PM --> BA --> SfM --> COLMAP
  SC --> Mono --> PTAM --> ORB --> ORB3
  PTAM -.-> LSD --> DSO --> VIDSO
  LM --> FG --> iSAM --> iSAM2
  FG --> g2o
  iSAM2 -.-> SESync --> TEASER
  Forster --> VINS --> ORB3
  VIDSO --> Forster
  Kinect --> Elastic
  Elastic -.-> LOAM --> FAST
  NeRF --> iMAP --> NICE
  NICE -.-> GS3D --> Spla --> MonoGS
  PTAM -.-> DROID --> DPV
  COLMAP -.-> DUSt3R --> MASt --> VGGT
  MASt --> MASlam
  ORB3 -.-> Hydra --> Clio

click PM "#chapter-1" "Ch.1 선사시대 — 사진측량"
  click BA "#chapter-1" "Ch.1 선사시대 — Bundle Adjustment"
  click SfM "#chapter-3" "Ch.3 Structure from Motion"
  click COLMAP "#chapter-3" "Ch.3 SfM — COLMAP"
  click SC "#chapter-4" "Ch.4 EKF-SLAM — Smith-Cheeseman"
  click Mono "#chapter-5" "Ch.5 MonoSLAM·PTAM"
  click PTAM "#chapter-5" "Ch.5 MonoSLAM·PTAM"
  click ORB "#chapter-7" "Ch.7 ORB-SLAM 계열"
  click ORB3 "#chapter-7" "Ch.7 ORB-SLAM3"
  click LSD "#chapter-8" "Ch.8 Direct Methods — LSD-SLAM"
  click DSO "#chapter-8" "Ch.8 Direct Methods — DSO"
  click VIDSO "#chapter-8" "Ch.8 Direct Methods — VI-DSO"
  click LM "#chapter-6" "Ch.6 Graph SLAM — Lu-Milios"
  click FG "#chapter-6" "Ch.6 Graph SLAM — Factor Graph"
  click iSAM "#chapter-6" "Ch.6 Graph SLAM — iSAM"
  click iSAM2 "#chapter-6" "Ch.6 Graph SLAM — iSAM2"
  click g2o "#chapter-6" "Ch.6 Graph SLAM — g2o"
  click Forster "#chapter-7" "Ch.7b IMU Preintegration (Ch.7 뒤)"
  click VINS "#chapter-7" "Ch.7 — VINS-Mono"
  click Kinect "#chapter-9" "Ch.9 RGB-D — KinectFusion"
  click Elastic "#chapter-9" "Ch.9 RGB-D — ElasticFusion"
  click SESync "#chapter-6" "Ch.6b Certifiable (Ch.6 뒤)"
  click TEASER "#chapter-6" "Ch.6b Certifiable — TEASER"
  click LOAM "#chapter-17" "Ch.17 LiDAR — LOAM"
  click FAST "#chapter-17" "Ch.17 LiDAR — FAST-LIO"
  click NeRF "#chapter-14" "Ch.14 NeRF-SLAM"
  click iMAP "#chapter-14" "Ch.14 NeRF-SLAM — iMAP"
  click NICE "#chapter-14" "Ch.14 NeRF-SLAM — NICE-SLAM"
  click GS3D "#chapter-15" "Ch.15 Gaussian Splatting"
  click Spla "#chapter-15" "Ch.15 — SplaTAM"
  click MonoGS "#chapter-15" "Ch.15 — MonoGS"
  click DROID "#chapter-13" "Ch.13 Hybrid — DROID-SLAM"
  click DPV "#chapter-13" "Ch.13 — DPV-SLAM"
  click DUSt3R "#chapter-16" "Ch.16 Foundation 3D — DUSt3R"
  click MASt "#chapter-16" "Ch.16 — MASt3R"
  click VGGT "#chapter-16" "Ch.16 — VGGT"
  click MASlam "#chapter-16" "Ch.16 — MASt3R-SLAM"
  click Hydra "#chapter-16" "Ch.16 §16.6 Semantic Foundation"
  click Clio "#chapter-16" "Ch.16 §16.6 — Clio"
```

# Ch.0 — SLAM Solved?

In 2026, you pick up a phone and an AR layer sticks to the wall. Indoor delivery robots tell the kitchen from the conference room without being handed a map. Throw a few photos at a [DUSt3R](https://arxiv.org/abs/2312.14132)-family model and a 3D structure comes out in seconds. These are less demos than products by now, and largely background. So there is a mood that treats SLAM as a more-or-less solved problem.

---

Go back to 2003 and the scene is different. Andrew Davison, in a lab at Imperial College London, demonstrated real-time 3D tracking with one laptop and one webcam. The system, called [MonoSLAM](https://www.doc.ic.ac.uk/~ajd/Publications/davison_iccv2003.pdf), ran at 30Hz on a desktop, watched about ten features per frame, and held a sparse map on the order of a few dozen landmarks. One desk in one room; when the camera left the desk, the map diverged. That was the state of the art.

The peak then was a few hundredths of the feature count a phone AR session tracks at any given moment today, and it took 23 years to close that gap. More striking than the gap itself is *which path* it was closed along.

---

SLAM's history is not a single development curve. It is the trace of four separate traditions running independently, then colliding and absorbing each other. Photogrammetrists solved bundle adjustment by hand a century ago. Roboticists began treating maps in the language of probability with [Smith-Cheeseman](https://arxiv.org/abs/1304.3111)'s 1986 stochastic spatial-relations framework, and the name "SLAM" was attached to this problem setting nine years later, in [Durrant-Whyte & Leonard's 1995 survey](https://ieeexplore.ieee.org/document/476131). Computer vision researchers were fixated on real-time feature tracking. And the 2020s deep learning community is trying to absorb all of it into a single network.

The question this book puts is not "how" but "why this way." Was the replacement of EKF-based SLAM by graph-based SLAM a natural technical evolution, or a contingency that a few people's choices decided? Was the split between feature-based and direct methods foreseen from the start? Why has deep learning been so slow to replace the geometry pipeline? Counterfactuals are worth asking only when the alternatives actually existed. This book shows that they did.

---

Tracing that path needs tools. Listing years gives you a chronicle; explaining techniques gives you a textbook. This book reads the history through two lenses: lineage and prediction. Where did an idea come from? And how did the future researchers saw from their vantage point diverge from the future that actually unfolded?

There are four repeating devices in this book. They can serve as guides as you read each chapter.

**Lineage openings** sit in the first paragraph or two of a chapter. They show, through names and years, which intellectual inheritance the chapter's protagonist took on. No idea in SLAM was born in a vacuum. Follow the lineage and the terrain of borrowing becomes visible.

**🔗 Borrowed boxes** are margin annotations that state in one or two sentences where a specific technique came from. Something like "ORB-SLAM's structure here came from Strasdat 2011." Researchers cite, but they often do not make the lineage explicit. The box does.

**📜 Prediction vs. outcome boxes** contrast what the original paper's Conclusion, Future Work, or Summary section pointed to against what actually happened. The place where [Triggs 1999](https://dblp.org/rec/conf/dagstuhl/TriggsMHF99.html)'s bundle adjustment (BA) synthesis paper, in §12 "Summary and Recommendations," left exploitation of large-scale sparse structure as its core guidance was filled, in the 2010s, from another angle: [COLMAP](https://openaccess.thecvf.com/content_cvpr_2016/papers/Schonberger_Structure-From-Motion_Revisited_CVPR_2016_paper.pdf) turned tens-of-thousands-of-image SfM into an open-source production tool. The direction of the prediction was right, the route was not. The gap between the future a researcher saw from their moment and the future that arrived is what this device is about.

**🧭 Still open** sits at the end of the chapter. These are items on the chapter's subject that, as of 2026, remain unresolved. They pull out the open problems hiding inside the perception that SLAM is solved. Ch.19 harvests these items across all chapters and reassembles them by theme.

---

The book runs in six parts.

**Part 1: Prehistory** traces the tools that photogrammetry and classical computer vision built up before SLAM was born in robotics. Why is bundle adjustment still the skeleton of every optimization backend.

**Part 2: The Birth of SLAM** follows the period in which robots first started building their own maps, from Smith-Cheeseman's 1986 stochastic framework to Davison's MonoSLAM. The problem setting was fixed in 1986; the acronym "SLAM" and the standard terminology settled in the community with Durrant-Whyte and Leonard's 1995 survey. How EKF became the dominant paradigm as a tool, and why its limits were structural.

**Part 3: The Parallel Revolution** covers the decade from PTAM splitting mapping and camera tracking in 2007 through graph-based SLAM and loop closure, up to ORB-SLAM. The ten years in which "real-time SLAM" became possible on a desktop.

**Part 4: Methodological Divergence** handles the split between feature-based and direct methods, the arrival of RGB-D, and the process by which place recognition broke off into its own subfield. How different assumptions produced different ecosystems.

**Part 5: The Inflow of Learning** covers monocular depth estimation, end-to-end SLAM, Neural Radiance Fields, and 3D Gaussian Splatting. The speed at which deep learning absorbs the geometry pipeline, and the sources of friction along the way.

**Part 6: Dead Ends and Open Problems** pulls out the failed routes in SLAM's history and the structural unresolved problems still sitting behind today's perception that it is "solved."

---

Setting the scope is what turns the book into a map. Questions like whether foundation models will replace SLAM are not this book's concern. What happened in the past and why is the material. Arguing that some research was wrong is not the goal either. The closer goal is to show what a given choice meant under the constraints of its moment. Homogeneous coordinates, epipolar geometry, and EKF formulas are assumed known. This book's job is tracing lineage, not explaining concepts, and which camera or LiDAR to pick is a different book's subject.

If you want the equations, the theorems, and the proofs laid out systematically, there is the [SLAM Handbook](https://github.com/SLAM-Handbook-contributors/slam-handbook-public-release). Edited by Carlone, Kim, Barfoot, Cremers, and Dellaert, published by Cambridge University Press in 2026, it covers the current theory and systems of SLAM in 18 chapters. This book records the path that led to that state.

One of the lines the five editors left jointly in that Handbook's Epilogue is *"If someone tells you 'SLAM is solved,' don't listen to them."* The "mood that treats it as solved," mentioned at the opening of this chapter, is an observed phenomenon inside the field, not a consensus of the field.

---

When Davison stood in front of his webcam in 2003, he did not know exactly what he was starting. That demo video is still on the internet. The shaky frame, the blinking landmark dots, the sparse map on the order of dozens of points. This book records what happened between there and here.

The record starts well before MonoSLAM. Before the acronym "SLAM" settled in the 1995 survey, even before Smith-Cheeseman wrote a probabilistic map down as equations, photogrammetrists were already recovering 3D structure from cameras. The next chapter traces that prehistory.

---

# Ch.1 — Photogrammetry and Bundle Adjustment: The 100 Years Before Triggs

The skeleton of today's SLAM optimization backend was born in German surveying. In the early twentieth century, Carl Pulfrich's method of hand-computing two-view triangulation on glass plates combined with Albrecht Meydenbauer's photogrammetric system to form a single surveying tradition. That tradition passed through Duane C. Brown's numerical formulation in 1958, and in 1999 Bill Triggs, Philip McLauchlan, Richard Hartley, and Andrew Fitzgibbon translated it into the language of computer vision. Bundle adjustment was not an invention in itself but the work of carrying a century-old surveying inheritance into a language the computer vision community could use. Triggs et al. (1999) inherited the parallax principle from Pulfrich's geometry and the reprojection formulation from Brown's (1958) military surveying. The solver skeleton came from Levenberg-Marquardt.

---

## 1. Early twentieth-century glass plates and stereophotogrammetry

In 1901, [Carl Pulfrich](https://en.wikipedia.org/wiki/Carl_Pulfrich) presented the **stereocomparator**, built by the Zeiss optical works, at the Hamburg conference of natural scientists (this was the formal unveiling, following a prototype stereoscopic rangefinder shown in Munich in 1899). The device photographed the same point from two camera viewpoints and computed distance by reading the coordinate difference on the glass plates. The principle was simple: the parallax between two views is inversely proportional to depth. The mathematics was Greek-era trigonometry; what was new was the precision of the optical instrument.

A generation earlier, [Albrecht Meydenbauer](https://de.wikipedia.org/wiki/Albrecht_Meydenbauer) had systematized **architectural photogrammetry** for the preservation of buildings. In 1858, after a fall while surveying the exterior of Wetzlar Cathedral, he conceived of using photographs in place of direct measurement. In 1885 he founded the Royal Prussian Photogrammetric Institute (Königlich Preussische Messbild-Anstalt).

These two streams joined into the tradition that carried into twentieth-century aerial surveying — aerotriangulation, in which a plane photographed the terrain from above and two-view photographs yielded three-dimensional maps. It was the age of hand calculators.

> 🔗 **Borrowed.** Modern SLAM's stereo depth estimation runs on the same principle as Pulfrich's stereocomparator. Depth comes from the baseline between two cameras and the parallax. The glass plate of 125 years ago has only become a pixel array.

---

## 2. 1958, Brown, and numerical bundle adjustment

What Pulfrich and Meydenbauer had solved with optical instruments, Brown moved into equations.

[Duane C. Brown](https://digital.hagley.org/08206139_solution) was a surveying engineer inside the United States Air Force ballistic missile development program. He worked on the problem of jointly estimating satellite orbits and ground coordinates — that is, simultaneously optimizing many camera viewpoints and many ground control points.

In his 1958 report "A Solution to the General Problem of Multiple Station Analytical Stereotriangulation" (RCA-MTP Data Reduction Technical Report No. 43, AFMTC-TR-58-8), Brown left one of the early documents that formulated **bundle adjustment (BA)** numerically (Helmut Schmid is named as a co-inventor from the same period).

The core is the **reprojection error**. Minimize the difference between the 2D image coordinate $x_{ij} \in \mathbb{R}^2$ observed in camera $i$ and the predicted coordinate $\pi(K_i, R_i, t_i, X_j)$ obtained by projecting the 3D point $X_j \in \mathbb{R}^3$ through the intrinsic matrix $K_i$ and extrinsic matrix $[R_i | t_i]$:

$$E = \sum_{i,j} \| x_{ij} - \pi(K_i, R_i, t_i, X_j) \|^2$$

The name "bundle" comes from the bundle of rays extending out from each camera center to the observed 3D points. Camera poses and point locations are adjusted together so that those rays meet at the 3D points. It took forty years for a technique that began in military and intelligence applications to be absorbed into academia.

> 🔗 **Borrowed.** Bundle techniques from the satellite geolocation field entered the computer vision community in the 1990s. During the years they were held under military classification, academia independently rediscovered the same problem. Triggs 1999 is the confluence point of those two streams.

---

## 3. Levenberg and Marquardt — pioneers of nonlinear optimization

Brown had the objective function to minimize in hand; the tool that would actually solve it came from somewhere else entirely.

Reprojection error minimization is a nonlinear least-squares problem. There is no analytical solution, so iterative numerical optimization is needed.

In 1944, [Kenneth Levenberg](https://cs.uwaterloo.ca/~y328yu/classics/levenberg.pdf) published a method that interpolated between Gauss-Newton and steepest descent with a damping parameter $\lambda$. Larger $\lambda$ moves toward steepest descent for safe convergence; smaller $\lambda$ uses the fast convergence of Gauss-Newton. The strategy is expressed by adding $\lambda \mathbf{I}$ to the objective function, improving numerical stability. It was twenty years ahead of computer vision. In 1963, [Donald Marquardt](https://epubs.siam.org/doi/10.1137/0111030) independently rediscovered the same idea and formulated it more explicitly. The name settled as the **Levenberg-Marquardt (LM) algorithm**.

It took about another thirty-five years for the LM algorithm to become the standard BA solver in computer vision, not because the technology was missing but because of the walls between fields.

---

## 4. 1999, Triggs et al. — a hundred years of inheritance integrated

Thirty-five years after Levenberg-Marquardt had readied the numerical tool, computer vision finally picked it up.

At the 1999 Vision Algorithms Workshop, Bill Triggs, Philip McLauchlan, Richard Hartley, and Andrew Fitzgibbon presented ["Bundle Adjustment — A Modern Synthesis"](https://link.springer.com/chapter/10.1007/3-540-44480-7_21).

What this paper did was translate and synthesize the BA theory scattered across twentieth-century surveying and aerial photogrammetry into the language of the computer vision community. Triggs et al. contributed two things. First, they made the structural properties of sparse BA explicit. Using the sparse block structure of the Hessian matrix (the Schur complement trick), joint camera-point optimization can be performed far more efficiently. Second, they treated gauge freedom (the arbitrariness of the reference frame) explicitly.

Seven years after this paper, Noah Snavely's [Photo Tourism (2006)](https://phototour.cs.washington.edu/Photo_Tourism.pdf) automatically reconstructed famous landmarks like Notre-Dame and the Trevi Fountain from hundreds of photographs scattered across the Internet. Ten years after that, Johannes Schönberger's [COLMAP (2016)](https://openaccess.thecvf.com/content_cvpr_2016/papers/Schonberger_Structure-From-Motion_Revisited_CVPR_2016_paper.pdf) open-sourced robust incremental structure from motion (SfM) at the scale of tens to hundreds of thousands of images, bringing a research stream that had already reached the million-image range into a tool anyone could reproduce. Without Triggs's language, that path would have been much slower.

---

## 5. Reprojection Error — the formation of the concept

If Triggs et al. described which error function was being minimized, it is worth tracing separately how that function itself settled into its current form.

There were two transitions before this error function took its present shape.

Early twentieth-century aerial triangulators measured error as "distance difference in the ground coordinate frame." Because the comparison was made directly in 3D space, a misaligned camera lens or poor calibration would dissolve into the ground coordinate residual and become invisible.

Brown moved the object of comparison to the image plane in his 1958 report. The method matches "the projected location of a 3D point in the image" to "the actual image observation" in pixel units. Calibration error, lens distortion, and extrinsic parameter error all surface together in one residual. It is also cleaner statistically. Camera image noise can be modeled as an isotropic Gaussian in pixel units, and under that model reprojection error minimization becomes maximum likelihood estimation.

Triggs et al. (1999) polished that formulation into the language of computer vision textbooks and standardized it. This reprojection error minimization is, as of 2026, the core measurement function of factor graph-based SLAM backends.

> 🔗 **Borrowed.** The observation model for a visual landmark in SLAM, $z = \pi(K, T, p) + \epsilon$, directly inherits Brown's (1958) reprojection formula. A SLAM backend that minimizes this with Gauss-Newton has the same mathematical structure as a 1958 aerial triangulation solver.

---

## 6. The skeleton of the SLAM backend — through 2026

[Durrant-Whyte and Leonard's 1995 survey](https://ieeexplore.ieee.org/document/476131) fixed the acronym "SLAM" as standard terminology, but the mathematics of that backend inherits Brown's 1958 reprojection formulation, traced in this chapter, almost unchanged. Look at today's SLAM optimization backends. ORB-SLAM3 jointly optimizes SE(3) poses and 3D landmark locations through g2o. LIO-SAM runs the LM algorithm on top of GTSAM's factor graph. DROID-SLAM gets its update direction from GRU-based optical flow, but the final bundle adjustment layer still uses the Schur complement trick.

Lie groups and factor graphs replaced the matrix notation of 1999, and neural networks took over descriptor computation, but the substance of the computation is unchanged. The reprojection error of points observed from multiple viewpoints is minimized to estimate camera poses and the map together. Pulfrich's glass plate has become a pixel array, and hand calculation has become the GPU — that is all.

This continuity is the field's strength and its weakness. Strength: a hundred years of convergence proofs and practical validation come along for free. Weakness: when BA's assumptions (static world, point features, Gaussian noise) break against the real environment, there is no alternative in hand.

---

> 📜 **Prediction vs. outcome.** Triggs et al. (1999) named scaling BA to large problems (thousands of cameras, millions of points) as the main challenge. That direction was achieved over the following twenty years. In 2006, Snavely's Photo Tourism reconstructed landmarks from hundreds of Internet photographs; in 2016, COLMAP standardized the robust incremental SfM implementation of that line. It was not, however, the "direct scaling" Triggs imagined. What arrived was an engineering layer — incremental BA and visibility graph pruning, with vocabulary tree loop closure on top. `[hit]`

---

## 🧭 Still open

**Global optimum guarantees for nonlinear BA.** The LM algorithm converges to a local minimum. With a bad initial value, it converges to the wrong structure. Methods for initialization (the 5-point algorithm, PnP, epipolar geometry estimation) appeared in turn, but these themselves depend internally on RANSAC and iterative optimization. Convex relaxation approaches that guarantee a global optimum in large-scale environments are being researched, but they are not yet practical at the speed and scale of real-time SLAM.

**The gap between photogrammetric accuracy and Visual SLAM.** Aerial photogrammetry standardly demands subpixel (below 0.1 pixel) accuracy. It has calibrated cameras and high-quality GCPs (ground control points), and the optimization runs offline. Real-time Visual SLAM uses the same formulation but operates under the constraints of GPS-denied environments, low-resolution cameras, and immediate estimation. The environments in which Visual SLAM systematically reaches the surveying field's accuracy standard (RMSE < 5 cm at 500 m range) are limited, and attempts to unify the accuracy standards of the two fields in a single framework are ongoing.

---

BA's assumptions (static world, point features, Gaussian noise) begin to break when the camera meets a moving object. Surveyors measured bridges, not robot soccer fields. The inheritance passed to computer vision in the form of a single question: which pixels in a moving image should serve as the "corresponding points" that bundle adjustment would later receive?

---

# Ch.2 — The Classical CV Toolbox: Harris to SIFT, and on to ORB

Bundle adjustment requires "corresponding points" — the same physical location found independently in two or more images. The surveyor planted targets in the field by hand; computer vision had to hand that role off to an algorithm. Feature detection and description is the problem that started there.

In the late 1970s, Hans Moravec tried to locate salient points in the environment with a camera on the Stanford Cart project. The work was written up in his 1980 Stanford doctoral thesis, ["Obstacle Avoidance and Navigation in the Real World by a Seeing Robot Rover"](https://frc.ri.cmu.edu/~hpm/project.archive/robot.papers/1975.cart/1980.html.thesis/index.html). The intuition that texture-rich corners are good to track was there, but no mathematical definition. Eleven years later, Chris Harris and Mike Stephens formalized that intuition as eigenvalues of the autocorrelation matrix. Lucas and Kanade had laid down the frame for pixel tracking seven years earlier. Lowe absorbed both ideas and built a descriptor invariant to scale and rotation. Rublee did the same thing faster and without a patent. The SLAM front-end runs on top of this lineage.

---

## 2.1 The idea of a corner: from Moravec to Harris

A point where an image patch changes a lot under a small camera motion is called a "corner". Moravec's (1977) criterion was simple. If the Sum of Squared Differences (SSD) against neighbor pixels is large in every direction — up, down, left, right — the point counts as a corner.

Harris and Stephens replaced this with continuous differentiation at the 1988 Alvey Vision Conference in ["A Combined Corner and Edge Detector"](https://www.bmva.org/bmvc/1988/avc-88-023.html). For image $I$, shifting a window $W$ around point $(x,y)$ by $(\Delta x, \Delta y)$ and approximating the intensity change gives:

$$M = \sum_{(x,y) \in W} \begin{pmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{pmatrix}$$

The two eigenvalues $\lambda_1, \lambda_2$ of $M$ classify the point: both large indicates a corner, one large an edge, and both small a flat region. Harris avoided the eigenvalue decomposition altogether by using the score $R = \det(M) - k \cdot \text{tr}(M)^2$. $k$ is typically 0.04–0.06.

> 🔗 **Borrowed.** Harris's (1988) autocorrelation-matrix idea refined Moravec's (1977) SSD-based corner search through continuous differentiation. The prototype of the concept was in the Stanford Cart report.

In 1994, Jianbo Shi and Carlo Tomasi showed in ["Good Features to Track"](https://cecas.clemson.edu/~stb/klt/shi-tomasi-good-features-cvpr1994.pdf) (CVPR 1994) that using $\min(\lambda_1, \lambda_2)$ directly, in place of the Harris score, is more stable for optical flow tracking. This criterion is the Shi-Tomasi corner detector. OpenCV implemented it as the `goodFeaturesToTrack` function. Thirty years later, that function is still the same.

---

## 2.2 The archetype of tracking: Lucas-Kanade and KLT

Harris's matrix $M$ finds the point. Finding the same point again in the next frame is a separate problem. Bruce Lucas and Takeo Kanade, in the 1981 paper ["An Iterative Image Registration Technique"](https://www.ijcai.org/Proceedings/81-2/Papers/017.pdf), formulated inter-frame pixel motion as a minimization problem under the brightness constancy assumption.

Brightness constancy assumption: the intensity of pixel $(x,y)$ is the same before and after the motion.

$$I(x, y, t) = I(x + u, y + v, t + 1)$$

A Taylor expansion followed by linearization gives:

$$I_x u + I_y v + I_t = 0$$

One equation, two unknowns. Lucas-Kanade adds the assumption that pixels inside a $3\times3$ or $5\times5$ window move with the same $(u,v)$, producing an overdetermined system solved by least squares.

$$\begin{pmatrix} \sum I_x^2 & \sum I_x I_y \\ \sum I_x I_y & \sum I_y^2 \end{pmatrix} \begin{pmatrix} u \\ v \end{pmatrix} = -\begin{pmatrix} \sum I_x I_t \\ \sum I_y I_t \end{pmatrix}$$

The matrix on the left is the same structure matrix $M$ as Harris's. Corner detection and optical flow sit on the same math.

Tomasi and Kanade, in the 1991 tech report ["Detection and Tracking of Point Features"](https://cecas.clemson.edu/~stb/klt/tomasi-kanade-techreport-1991.pdf), gave a concrete implementation that selects tracking-window quality by the eigenvalue criterion and refines displacement through Newton-Raphson iteration. Bouguet (Intel, 2000) later added an image-pyramid-based coarse-to-fine strategy so the tracker would converge under large motion, and this combination settled into the KLT (Kanade-Lucas-Tomasi) tracker. Real-time VIO systems like [VINS-Mono](https://arxiv.org/abs/1708.03852) (2018) still run a front-end from this lineage. A least-squares tracker from 1981 runs inside the VIO of a smartphone drone forty-odd years later.

> 🔗 **Borrowed.** Lucas-Kanade (1981) → KLT tracker → Qin et al.'s VINS-Mono (2018): a 38-year-old optical flow survives unchanged as the feature-tracking backbone of real-time VIO.

---

## 2.3 SIFT — invariance and the patent

KLT fits the case of a single camera moving a little at a time. Connecting the same point across images taken by different cameras, on different days, is a different order of problem. A change in viewpoint alters the patch's shape, size, and orientation for the same point, and a plain pixel comparison no longer works. That is why a **descriptor** is needed.

David Lowe (UBC) presented the idea at ICCV 1999. The talk was titled "Object Recognition from Local Scale-Invariant Features", and the demo compared 128-dimensional vectors to match the same object across different photographs. Five years later, in 2004, the complete version, ["Distinctive Image Features from Scale-Invariant Keypoints"](https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf), appeared in IJCV, and this is the paper cited today as SIFT. SIFT (Scale-Invariant Feature Transform) runs in two stages.

**Detection stage.** Compute DoG (Difference of Gaussians) at several scales and select local extrema as keypoints. DoG is an approximation of the Laplacian of Gaussian. With $L(x,y,\sigma) = G(x,y,\sigma) * I(x,y)$ as the Gaussian-smoothed image:

$$D(x, y, \sigma) = L(x, y, k\sigma) - L(x, y, \sigma)$$

Here $k$ is the ratio between adjacent scales (typically $2^{1/s}$, where $s$ is the number of scales per octave). Searching for extrema across multiple octaves makes it possible to detect the same point under scale change.

**Descriptor stage.** A $16\times16$ window around the keypoint is divided into $4\times4$ blocks, and the 8-bin gradient-orientation histogram in each block is concatenated into a 128-dimensional vector. Since the patch is rotated relative to the keypoint's dominant gradient direction, rotation invariance is obtained as well.

The result was a 128-dimensional descriptor robust to scale, rotation, and partial affine deformation. That is why researchers had to use SIFT in the era before KITTI, before SLAM benchmarks existed.

Lowe filed a patent on SIFT in March 2000, and it was granted in March 2004 (US6711293B1, with priority from March 1999). The patent imposed licensing fees for commercial use, and until it expired in March 2020 it was one of the motivations for efforts to replace SIFT.

> 📜 **Prediction vs. outcome.** In "9 Conclusions" of the 2004 SIFT paper, Lowe listed the descriptor's possible extensions as "view matching for 3D reconstruction, motion tracking and segmentation, robot localization, image panorama assembly, epipolar calibration". Most of the directions landed — SfM, SLAM, panoramas, and early vision-based robot localization leaned on SIFT in the late 2000s. In the long-term correspondence problem, however, SIFT's position wobbled after CNNs arrived. After AlexNet in 2012, demand on the object-recognition side shifted to CNNs, and the local-descriptor slot for SLAM was gradually taken over by learned descriptors like SuperPoint and R2D2. The application domains were predicted correctly; the descriptor form diverted. `[partial hit]`

---

## 2.4 SURF — a speed–accuracy compromise

SIFT's 128-dimensional descriptor was accurate but slow. Hundreds of milliseconds per image on the desktop CPUs of the time. Not usable for real-time SLAM. Herbert Bay (ETH Zürich) presented ["SURF: Speeded-Up Robust Features"](https://people.ee.ethz.ch/~surf/eccv06.pdf) at ECCV 2006. Two ideas sit at the core.

Detect keypoints with the *determinant of the Hessian matrix* instead of DoG. Approximate the second Gaussian derivatives with box filters on an integral image to speed up computation. The descriptor is 64-dimensional, half of SIFT's. The neighborhood of the keypoint is split into $4\times4$ subregions, and in each subregion four values from Haar wavelet responses $d_x, d_y$, $(\sum d_x,\, \sum d_y,\, \sum|d_x|,\, \sum|d_y|)$, are concatenated into a $4\times4\times4=64$-dimensional vector. A 128-dimensional extension (SURF-128) exists, but the default is 64-dimensional.

SURF was 3–7 times faster than SIFT. But the accuracy gap between 128 and 64 dimensions remained, and Bay could not avoid a patent either (ETH Zürich patent). SIFT was edged out over speed; SURF was edged out over accuracy and the patent both. What solved both problems at once was ORB.

> 🔗 **Borrowed.** Lowe's (1999/2004) DoG scale-space → Bay's (2006) Hessian integral image: two answers for achieving scale invariance. DoG is theoretically elegant; the Hessian approximation is engineered to be fast.

---

## 2.5 ORB — binary descriptor and release from the patent

In 2011, Ethan Rublee (Willow Garage), Vincent Rabaud, Kurt Konolige, and Gary Bradski presented ["ORB: An Efficient Alternative to SIFT or SURF"](https://www.gwylab.com/download/ORB_2012.pdf) at ICCV. The title is direct. Willow Garage was also the birthplace of ROS. The motive to build a feature that robotics researchers could actually use is spelled out in the title.

ORB combines and improves two existing techniques.

**Detection.** [FAST](https://www.edwardrosten.com/work/rosten_2006_machine.pdf) (Features from Accelerated Segment Test, Rosten & Drummond 2006). Cycles through 16 points around a pixel and declares it a corner if there is a contiguous arc that is sufficiently brighter or darker. More than 10 times faster than SIFT's DoG. ORB adds a Harris score on top of FAST and keeps only the strong responses.

**Descriptor.** [BRIEF](https://www.cs.ubc.ca/~lowe/525/papers/calonder_eccv10.pdf) (Binary Robust Independent Elementary Features, Calonder et al. 2010). Compares the intensities of randomly chosen point pairs in the patch around a keypoint to produce a bit string. 256 bits by default. Matching uses Hamming distance instead of Euclidean distance, so comparison is a single XOR.

BRIEF's weak point was the lack of rotation invariance. Rublee built **rBRIEF (rotated BRIEF)** by rotation-correcting the patch along the direction of the FAST corner's intensity centroid. With orientation estimation in place, BRIEF finally became a descriptor usable in practice.

$$\theta = \text{atan2}(m_{01},\, m_{10}), \quad m_{pq} = \sum_{x,y} x^p y^q I(x,y)$$

Computation was 100 times faster than SIFT, there was no patent, and it was integrated into OpenCV right away. [ORB-SLAM](https://arxiv.org/abs/1502.00956) (Mur-Artal et al. 2015), as the name says, was built on ORB, and the line continued through the trilogy. ORB-SLAM3 had still not changed the front-end as of 2021.

> 🔗 **Borrowed.** Calonder et al.'s (2010) BRIEF → Rublee et al.'s (2011) ORB: adding intensity-centroid-based orientation estimation to a binary descriptor secured rotation invariance.

---

## 2.6 Learned descriptors

If ORB is the practical peak, the next question is natural. Are learned rules better than hand-designed ones? Yi et al.'s 2016 [LIFT](https://arxiv.org/abs/1603.09114) (Learned Invariant Feature Transform, ECCV 2016) tried to replace the three stages — detection, orientation estimation, descriptor — with CNNs. Three separately trained networks wired into a pipeline.

In 2018, DeTone et al.'s [SuperPoint](https://arxiv.org/abs/1712.07629) (CVPRW 2018) trained keypoint detection and a 256-dimensional descriptor jointly, under a self-supervised scheme called homographic adaptation. Pretrained on synthetic data, adapted to real images. The first learned descriptor to catch attention in the SLAM community.

Even so, as of 2026, the traditional descriptors have not disappeared. ORB is faster than SuperPoint on embedded devices, and it behaves more predictably than learned descriptors, which generalize unstably on out-of-domain images. DINOv2-based features have entered place recognition through work like [AnyLoc](https://arxiv.org/abs/2308.00688) (Keetha et al. 2023), but ORB-SLAM3, since its 2021 release, still uses ORB. Moravec's 1977 intuition runs on robots in the 2020s.

---

## 2.7 🧭 Still open

**Generalization limits of learned descriptors.** SuperPoint, R2D2, DISK, and others beat the classical methods inside the training domain, but in new environments (underwater, thermal, low-light) they are inconsistent. There is no consensus on which side is better. The question is still open in 2026.

**Failure modes of wide-baseline matching.** Harris- or ORB-based matching degrades sharply once the camera viewpoint change exceeds 45 degrees. Affine-covariant detectors (ASIFT, MSER) patched part of the gap, but there is no complete solution. [DUSt3R](https://arxiv.org/abs/2312.14132) (Wang et al. 2023) opened a path by bypassing matching itself, though it is still too early to judge whether this is the end of the descriptor problem or a detour around it.

---

Harris's intuition and Lowe's invariance laid the base, and Rublee's speed optimization dragged it onto the shop floor. The toolbox was complete. These techniques were each designed to work between one or two images. Connecting dozens or hundreds of images simultaneously, in a geometrically consistent way, needed another layer.

---

*References*

---

# Ch.3 — Structure from Motion: From Longuet-Higgins to COLMAP

While Harris and Lowe were sharpening how to pick out the "points worth looking at" inside a single image, a different lineage asked what could be known when those points were captured in two photographs at once. *Detecting* features and *reconstructing space* from features developed side by side through the same years, and only in the mid-2000s did they merge into one pipeline.

In 1981, H.C. Longuet-Higgins, a theoretical psychologist at Cambridge, published a three-page paper in *Nature*. The title was "[A Computer Algorithm for Reconstructing a Scene from Two Projections](https://cseweb.ucsd.edu/classes/fa01/cse291/hclh/SceneReconstruction.pdf)". He showed that from eight pairs of coordinates for the same points captured in two photographs, one could simultaneously solve for how the camera had moved and what shape the scene took in three dimensions. He was neither a roboticist nor a computer vision researcher. Structure from Motion (SfM) began in those three pages, and the mathematics became engineering only in 2016, when Johannes Schönberger released COLMAP.

---

## 3.1 Essential Matrix and the 8-point Algorithm

Longuet-Higgins started from a simple point. When two cameras capture the same point, an algebraic constraint holds between the pair of image coordinates. Once the coordinate system is normalized, this constraint collapses into a single matrix. He defined it as the **essential matrix** $\mathbf{E}$.

Let the two camera centers be $\mathbf{O}_1$ and $\mathbf{O}_2$, and let the corresponding points in normalized coordinates be $\mathbf{x}_1$ and $\mathbf{x}_2$. The constraint is:

$$\mathbf{x}_2^\top \mathbf{E} \mathbf{x}_1 = 0$$

$\mathbf{E}$ factors through the rotation $\mathbf{R}$ and translation $\mathbf{t}$ between the cameras as $\mathbf{E} = [\mathbf{t}]_\times \mathbf{R}$, where $[\mathbf{t}]_\times$ is the skew-symmetric matrix of $\mathbf{t}$.

Once scale ambiguity is removed, the essential matrix has five degrees of freedom. But before the non-linear 5-point algorithm ([Nistér 2004](http://www.cad.zju.edu.cn/home/gfzhang/training/SFM/2004-PAMI-David%20Nister-An%20Efficient%20Solution%20to%20the%20Five-Point%20Relative%20Pose%20Problem.pdf)) that solved it with five correspondences, the standard approach was to fix one of the nine matrix entries as unit scale, treat the remaining eight as unknowns, and solve a linear system from eight correspondences — before enforcing the rank-2 constraint and unit-scale constraint. This is the **8-point algorithm**. Longuet-Higgins himself gave a procedure that produced a unique solution from exactly eight points. The implementation was simple, and the computational cost was small.

The problem was numerical stability. When image coordinates run in the hundreds or thousands of pixels, the magnitudes of the coefficient matrix entries diverge sharply, and the SVD becomes unstable.

> 🔗 **Borrowed.** Hartley's 1997 normalized 8-point algorithm ([In Defense of the Eight-Point Algorithm](https://www.cse.unr.edu/~bebis/CS485/Handouts/hartley.pdf)) applied a linear transform to image coordinates so that their mean was zero and their average distance was $\sqrt{2}$, and then estimated the essential matrix. The geometry of Longuet-Higgins was left untouched; only the numerical conditioning was fixed. Every textbook afterward adopted this normalized version as the standard.

The fundamental matrix $\mathbf{F}$ generalizes the essential matrix. Even without knowing the camera intrinsics $\mathbf{K}$, the relation $\mathbf{x}_2^\top \mathbf{F} \mathbf{x}_1 = 0$ holds. With intrinsics $\mathbf{K}_1$, $\mathbf{K}_2$ for the two cameras, the relationship is $\mathbf{F} = \mathbf{K}_2^{-\top} \mathbf{E} \mathbf{K}_1^{-1}$. For images from the same camera ($\mathbf{K}_1 = \mathbf{K}_2 = \mathbf{K}$) it simplifies to $\mathbf{F} = \mathbf{K}^{-\top} \mathbf{E} \mathbf{K}^{-1}$. In an SfM pipeline, when $\mathbf{K}$ is unknown $\mathbf{F}$ is estimated first; when $\mathbf{K}$ is known, $\mathbf{E}$ is solved directly.

---

## 3.2 Tomasi-Kanade Factorization

For ten years after 1981, SfM was studied mostly as the geometry between two photographs. Processing many photographs at once was a separate problem. Its outline came into view in 1992, when Carlo Tomasi and Takeo Kanade at CMU published the **[factorization method](https://people.eecs.berkeley.edu/~yang/courses/cs294-6/papers/TomasiC_Shape%20and%20motion%20from%20image%20streams%20under%20orthography.pdf)**.

The idea runs as follows. Given $F$ frames observing $P$ points, the image coordinates stack into a $2F \times P$ matrix $\mathbf{W}$. Each entry $w_{fp}$ is the coordinate of point $p$ in frame $f$. Under an orthographic (scaled orthographic) camera model, $\mathbf{W}$ is a rank-3 matrix. The original paper (Tomasi & Kanade 1992) started from exactly this assumption. Then:

$$\mathbf{W} = \mathbf{M} \mathbf{S}$$

where $\mathbf{M}$ is a $2F \times 3$ motion matrix and $\mathbf{S}$ is a $3 \times P$ structure matrix. Keeping only the top three singular values of $\mathbf{W}$ through SVD gives $\mathbf{M}$ and $\mathbf{S}$ at once.

The heart of the method was that a single SVD estimated the motion of every frame and the 3D position of every point together. The computational complexity was a light $O(F \cdot P)$, and the implementation was easy.

> 🔗 **Borrowed.** Nistér, Naroditsky, and Bergen's 2004 CVPR paper "Visual Odometry" is cited in the later literature as redirecting real-time ego-motion estimation into an applied branch of this lineage. Instead of using Tomasi-Kanade's batch factorization directly, the work moved toward solving the relative pose between frames inside a short window, and it stands as an early point of the shift that traded batch accuracy for latency.

The limitation sat in the orthographic/affine assumption. An affine camera ignores perspective distortion. The model holds only when the depth variation in the scene is small compared to the distance to the camera — that is, for distant, small objects. For close scenes, wide-angle lenses, or scenes with large foreground–background depth differences, the error grew. From the late 1990s, extensions to the perspective camera were attempted from several directions, and these efforts led to the rediscovery of bundle adjustment.

---

## 3.3 Hartley & Zisserman and the canonization

If Tomasi-Kanade's factorization framed the multiple-view problem, what remained was extension to the perspective camera and tying the scattered mathematics together in a single language.

In 2000, Richard Hartley and Andrew Zisserman's textbook *[Multiple View Geometry in Computer Vision](https://www.robots.ox.ac.uk/~vgg/hzbook/)* appeared. 680 pages. It consolidated the SfM mathematics scattered from 1981 through the 1990s into the language of projective geometry.

Hartley & Zisserman did more than arrange. They derived essential matrix, fundamental matrix, homography, camera calibration, and bundle adjustment all from a single projective-geometry framework. For the first time, it became clear that concepts that had run separately came from the same root.

Bundle adjustment got particular weight in this book. The reprojection-error minimization problem that Triggs et al. (1999) had formally introduced in Ch.1 was placed by Hartley & Zisserman inside the projective-geometry framework, and a *robust cost function* $\rho$ was put on it explicitly. To keep optimization from breaking on real data with outliers mixed in, the error was suppressed through Huber or Cauchy functions. The solver was Levenberg-Marquardt, and the sparsity of the Jacobian was exploited to cut computation.

Most SLAM and visual odometry (VO) papers in the early 2000s cited this textbook as their standard reference. With concept definitions unified through this one book, large-scale applications like Photo Tourism could focus on implementation without redefining the basics.

---

## 3.4 Photo Tourism and Bundler — Internet-scale SfM

In 2006, Noah Snavely, Steven Seitz, and Richard Szeliski published the SIGGRAPH paper "[Photo Tourism](https://doi.org/10.1145/1179352.1141964)". They gathered photographs of tourist sites uploaded to the internet (the Florence Duomo, the Trevi Fountain in Rome) and tried to reconstruct them in 3D.

The setting itself was a challenge. Cameras and weather varied, composition varied, and some of the images were unrelated indoor shots. This was not a systematically captured dataset but thousands of images uploaded in no particular order by thousands of people.

Snavely's pipeline ran in the following order. SIFT feature detection and matching found correspondences between image pairs. RANSAC with the fundamental matrix removed geometrically inconsistent matches. Starting from image pairs with high connectivity, cameras were added one at a time in incremental SfM. Each time a camera was added, bundle adjustment re-optimized the full set of poses and points.

The datasets reported in the paper included the Notre Dame Cathedral (597 registered out of 2,635 candidates), the Trevi Fountain in Rome (360 out of 466), Yosemite Half Dome (325 out of 1,882), the Great Wall (82 out of 120), and Trafalgar Square (278 out of 1,893), with an average reprojection error of about 1.5 pixels on 1,611×1,128 images. No prior attempt had reconstructed anything at this scale from uncontrolled internet images.

The implementation of this pipeline was Bundler. Snavely released it as open source, and it became the default starting point for SfM researchers.

---

## 3.5 COLMAP — engineering maturity

> 📜 **Prediction vs. outcome.** The "Discussion and future work" section of Snavely et al. 2006 states "Ultimately, we wish to scale up our reconstruction algorithm to handle millions of photographs" and lists better image-registration ordering, lens-distortion modeling, repeated-structure handling, and disconnected-structure reconstruction as remaining problems. Scale-up was taken up by COLMAP (Schönberger 2016) and OpenSfM, reaching tens to hundreds of thousands of images, while real-time and online processing was answered separately by the SLAM lineage — not through incremental refinement, but through fixed-lag smoothers and loop closure. Of the items Snavely listed, scale-up was the one most clearly filled in. `[partial hit]`

In 2016, Johannes Schönberger and Jan-Michael Frahm published the CVPR paper "[Structure-from-Motion Revisited](https://openaccess.thecvf.com/content_cvpr_2016/papers/Schonberger_Structure-From-Motion_Revisited_CVPR_2016_paper.pdf)". The "Revisited" in the title was modest, but the paper was a systematic redesign that bundled ten years of improvements since Bundler.

COLMAP differed from Bundler most in three places.

First, the order in which cameras were added. Bundler started from pairs with high connectivity, but had no systematic criterion for which pair to extend first. COLMAP automated the choice of initial image pair and the camera-registration order using triangulation angle, feature-track length, and visibility score. The stability of reconstruction rose substantially.

Second, the bundle-adjustment cadence. Running a full bundle adjustment after every camera addition is expensive. COLMAP alternated local bundle adjustment (optimizing only the recently added camera together with cameras that shared many points with it) and periodic global bundle adjustment.

Third, geometric verification. For each pair of matched feature points, COLMAP ran RANSAC with two models in parallel: fundamental matrix and homography. The fundamental matrix covered general non-planar scenes; the homography covered planar scenes or pure rotation. COLMAP compared the inlier counts of the two models to classify the scene type, and filtered out matches that fit neither. It held up better than Bundler on poor matches and planar-degeneracy situations.

> 🔗 **Borrowed.** COLMAP's incremental bundle adjustment strategy modularized Snavely's Bundler pipeline and added quality control at each stage. The core mathematics of the algorithm (essential matrix estimation, triangulation, Levenberg-Marquardt) came from the Hartley & Zisserman textbook. COLMAP's contribution sat not in new mathematics but in the systematization of engineering judgment.

COLMAP became the de facto standard not only for performance. The codebase was well organized and the documentation was adequate, and CUDA acceleration handled thousands of images within reasonable time. After NeRF appeared in 2020, every NeRF training codebase took COLMAP's output (camera poses + sparse point cloud) as input. 3D Gaussian Splatting did the same. More than an SfM tool, COLMAP became the entryway to 3D reconstruction research.

---

## 3.6 The split between SfM and SLAM

SfM and SLAM use the same mathematics yet solve different problems. The distinction came into sharp relief in the early 2000s.

SfM is *offline*. All images are gathered before processing, so there is no time constraint, and global bundle adjustment can be run multiple times with full access to the whole dataset. If a camera pose was wrong, one can go back and recompute.

SLAM is *online*. Sensor data streams in in real time, and the robot's current position has to come out on the spot. Past data cannot be referenced indefinitely, the computation grows as the map does, and when the robot loops back to a place it first visited, the accumulated drift has to be corrected.

The place where the two fields diverge most is loop closure. In SfM, global bundle adjustment cleans up every inconsistency. In SLAM, the moment a loop closes has to be detected, and the drift at that moment has to be corrected locally. The techniques for this (visual place recognition, pose graph optimization, covisibility-based local optimization) were problems unique to SLAM, with no counterpart in SfM.

Uncertainty propagation differed as well. SLAM tracks the uncertainty of the current pose in real time and updates it with each new observation. A probabilistic representation in the form of EKF or factor graph is needed. In SfM, covariance can be computed after optimization finishes, and real-time tracking is not required.

Davison's [MonoSLAM (2003)](https://www.doc.ic.ac.uk/~ajd/Publications/davison_iccv2003.pdf) called itself "real-time SfM". But the structure that kept the camera pose and landmarks together in an EKF state vector differed from SfM's global batch. Over the 2000s, the two fields split into independent lineages, each with its own problem setting.

---

## 3.7 🧭 Still open

**SfM with dynamic objects.** Every current SfM system, including COLMAP, assumes a static world. Bundle adjustment is solved on the premise that every point in the scene is stationary, so in scenes crowded with cars or pedestrians, contaminated matches distort the optimization. RANSAC filters some of them, but it is not a root fix. Research into dynamic SfM (segmentation integration, per-object independent motion estimation) is in progress, but as of 2026 there is no general-purpose implementation at COLMAP's level.

**The blurring boundary between SfM and SLAM.** In 2023, [DUSt3R](https://arxiv.org/abs/2312.14132) (Wang et al.) took two images into a single pretrained network and produced a dense point map and camera poses at once. No feature matching, no RANSAC, and no bundle-adjustment initialization. Extended as [MASt3R](https://arxiv.org/abs/2406.09756) (2024), it handled tens of images. Each module of the traditional SfM pipeline is being replaced one at a time. If COLMAP was the entryway to NeRF and 3DGS, the DUSt3R line is trying to replace that entryway itself. Whether this paradigm will actually push COLMAP out, or win only in specific domains, is still unknown.

---

While SfM was refining precise offline reconstruction, a different kind of question was piling up elsewhere. Not photographs, but a moving robot. The images have not been gathered yet. A pose estimate has to come out right now. The question Randall Smith and Peter Cheeseman [posed in 1986](https://people.csail.mit.edu/brooks/idocs/Smith_Cheeseman.pdf), how to propagate uncertain spatial relations, grew under that pressure into a separate field called SLAM.

---

# Ch.4 — Smith-Cheeseman and the Rise and Fall of EKF-SLAM

The photogrammetry, SfM, and bundle adjustment covered in Part I all assumed one thing. The camera sits still, or there is enough time after capture to crunch all the images offline as a batch. Hartley-Zisserman's geometry, RANSAC's robust estimation, Levenberg-Marquardt's iterative optimization — these tools knew how to measure the world, but did not ask where a moving robot was *right now*. Part II starts from that question. Building a map while knowing your own position, refusing to give up on the estimate while uncertainty piles up. The problem of probabilistic mapping opened in a small memo out of SRI International.

In 1986 Randall Smith and Peter Cheeseman wanted to treat mathematically how uncertain a robot's measurement is when it measures something in space. The idea that came out of SRI International inherited Kalman's (1960) filter mathematics but extended it in a different direction, propagating uncertainty not over a single state estimate but over a whole *network of spatial relationships*. Several years later Hugh Durrant-Whyte in Sydney and John Leonard at MIT welded onto that mathematics the problem statement "a robot estimates its own position while building a map." The acronym "SLAM" is the product of that join.

---

## 4.1 The mathematics of uncertain spatial relationships — Smith, Self, Cheeseman (1988)

In 1986 Randall Smith and Peter Cheeseman at SRI International wanted to capture in equations how error propagates when a robot accumulates measurements across several places. Their working notes came out in 1988 as ["Estimating Uncertain Spatial Relationships in Robotics"](https://arxiv.org/abs/1304.3111). The question itself was clear. When a robot measures B from A and then C from B, how does the uncertainty from A to C get computed?

The [Kalman filter](https://www.cs.unc.edu/~welch/kalman/kalmanPaper.html) already existed. It had been used since 1960 for radar tracking, ballistic calculation, and satellite orbit correction. What Smith and Cheeseman did was reformulate Kalman's covariance propagation equations to fit the composition of spatial transforms. Put the robot pose $\mathbf{x}_r$ and the landmark positions $\mathbf{m}_i$ into a single state vector, and maintain the joint covariance $\mathbf{P}$ over all of it.

$$\mathbf{x} = [\mathbf{x}_r^\top,\ \mathbf{m}_1^\top,\ \ldots,\ \mathbf{m}_N^\top]^\top$$

$$\mathbf{P} = \begin{bmatrix} \mathbf{P}_{rr} & \mathbf{P}_{rm} \\ \mathbf{P}_{mr} & \mathbf{P}_{mm} \end{bmatrix}$$

The off-diagonal block $\mathbf{P}_{rm}$ was the point. Robot-pose uncertainty and landmark-position uncertainty are *correlated*, and only by tracking that correlation can the estimate stay consistent. The paper proved this explicitly, and the whole SLAM field took its starting position from there.

> 🔗 **Borrowed.** Smith-Cheeseman's (1988) spatial-relationship mathematics inherits directly from Kalman's (1960) covariance propagation. A technique for tracking a single moving object became a framework for tracking a robot and every element of its map at once.

---

## 4.2 How the name "SLAM" settled in

There is no "SLAM" in the 1988 Smith-Cheeseman paper. Hugh Durrant-Whyte, who had moved from Oxford to Sydney, and John Leonard at MIT were, in the early 1990s, each calling the same problem a different name in their own labs. Once the two groups started citing each other, a shared term was needed, and "SLAM" converged into place. Researchers' memories differ on which document used it first. No canonical first-use paper exists.

Leonard and Durrant-Whyte's 1991 paper, ["Simultaneous Map Building and Localization for an Autonomous Mobile Robot"](https://doi.org/10.1109/IROS.1991.174711), is often cited as an early instance that put this problem front and center in a mainstream robotics title. That mapping and localization are inseparably entangled, and must be done simultaneously — that intuition sat there before the acronym did.

"Simultaneous Localization and Mapping," abbreviated SLAM. For the next ten years this name was the center of gravity the field converged around.

> 🔗 **Borrowed.** [Bar-Shalom's multi-target tracking](https://archive.org/details/trackingdataasso0000bars) (multi-target tracking, collected as a 1988 monograph) supplied a framework for estimating the states of many objects at once. Leonard and Durrant-Whyte can be read as mapping "target position" to "landmark position" and "tracker position" to "robot pose" inside that framework. A case of radar technology translated into indoor robot mapping.

---

## 4.3 The EKF-SLAM formulation

Applying the Extended Kalman Filter (EKF) to SLAM was less a choice than a natural convergence. The EKF had been used for nonlinear system estimation since before 1988 and runs in two stages — predict and update.

Predict stage: when the robot moves, the motion model $f(\cdot)$ predicts the state, and the Jacobian $\mathbf{F}$ propagates the covariance.

$$\hat{\mathbf{x}}^- = f(\hat{\mathbf{x}}, \mathbf{u})$$
$$\mathbf{P}^- = \mathbf{F}\mathbf{P}\mathbf{F}^\top + \mathbf{Q}$$

Update stage: when a sensor measurement $\mathbf{z}$ arrives, the Jacobian $\mathbf{H}$ of the observation model $h(\cdot)$ yields a Kalman gain $\mathbf{K}$ that updates the state and covariance.

Iterating these two stages is the whole of EKF-SLAM. The structure is simple, and that simplicity carried a scalability ceiling from the start.

The problem is state dimension. Put a 6DOF pose together with $N$ 3D landmarks into the state, and the state vector has dimension $6 + 3N$; the covariance matrix is an $O(N^2)$ structure of $(6+3N)^2$ entries. A single update costs $O(N^2)$ both for the Kalman gain (inverting $\mathbf{S} = \mathbf{H}\mathbf{P}^-\mathbf{H}^\top + \mathbf{R}$) and for the covariance update. 100 landmarks gives $306 \times 306 \approx$ 94k entries; 1,000 landmarks gives $3006 \times 3006 \approx$ 9M. The number of landmarks a regular PC in the early 2000s could keep in real time topped out in the tens to low hundreds.

That [Andrew Davison's MonoSLAM (2003)](https://www.doc.ic.ac.uk/~ajd/Publications/davison_iccv2003.pdf) was locked to a few dozen landmarks in its live demos was no accident. The $O(N^2)$ wall of EKF-SLAM set that number.

---

## 4.4 The scalability wall

When Davison ran real-time 3D tracking from a single webcam at ICCV 2003, he was mapping a desk-sized space with a few dozen features. In an environment with no commercial SLAM systems, a real-time monocular demo was a rare thing to see. The issue was that its ceiling came not from the algorithm but from the size of the covariance matrix.

At 100 landmarks the covariance matrix is $306 \times 306$ (6DOF pose + 100 3D landmarks, state dimension $6 + 3 \times 100 = 306$). At 1,000 it is $3006 \times 3006$. Every time step that matrix has to be updated along with a matrix inversion. On top of that, because the EKF keeps the full joint distribution in one block, adding a new landmark immediately generates cross-correlations with every existing landmark. As the map grows, update cost grows exponentially.

The fix attempted through the mid-2000s was submaps. Partition the whole map into overlapping small regions, run an EKF inside each submap, and connect submaps by a separate structure. [Chong and Kleeman (1999)](http://www.cs.cmu.edu/afs/cs/Web/People/motionplanning/papers/sbp_papers/integrated1/chong_feature_map.pdf) proposed an early form. Information loss at submap boundaries, difficulty of loop closure, and implementation complexity added friction to putting submap approaches into practice.

> 🔗 **Borrowed.** The submap-partitioning idea of Chong-Kleeman (1999) carries forward into the local-window optimization of modern SLAM. ORB-SLAM's local map and VINS-Mono's sliding window sit conceptually on the same principle. Only the implementation tool changed, from EKF to bundle adjustment.

---

## 4.5 The consistency problem: Julier-Uhlmann's counterexample

A deeper flaw in EKF-SLAM broke at ICRA 2001. Simon Julier and Jeffrey Uhlmann analyzed the behavior of EKF-based SLAM through numerical experiments and showed that the filter trusts itself too much. Their paper title at IEEE ICRA was ["A Counter Example to the Theory of Simultaneous Localization and Map Building"](https://doi.org/10.1109/ROBOT.2001.933257). Provocative, and the content matched.

The core that secondary literature summarizes when citing this paper is that EKF-SLAM is asymptotically *overconfident*. The actual estimation error grows, while the covariance (uncertainty) the filter computes converges below its true value. This is inconsistency.

The cause sits in linearization error. The EKF approximates nonlinear motion and observation models by a first-order Taylor expansion. When this approximation error accumulates step by step, the covariance begins to underestimate the real error. Once the robot becomes overconfident that "I am here," the filter trusts subsequent measurements less, and errors pile up without correction.

In 2007 [Shoudong Huang and Gamini Dissanayake](https://doi.org/10.1109/TRO.2007.903811) dissected the cause of this inconsistency more precisely. The paper's central diagnosis was that basic constraints among the Jacobians evaluated at the current state estimate break down, and this is the main cause of EKF-SLAM's inconsistency; as a result, the variance of the robot's heading angle (yaw) can wrongly converge to zero when it should be maintained. Later observability-based analyses, in which the system's observable degrees of freedom change with the point of linearization and the filter injects spurious information into unobservable directions, take this paper as their starting point.

> 📜 **Prediction vs. outcome.** After Julier and Uhlmann's 2001 counterexample, attempts to design a consistent estimator kept coming. Variants in the filter family (the Unscented Kalman Filter (UKF), Invariant EKF, robust covariance) were proposed for nearly a decade. Looking back from 2026, though, the practical resolution of this problem came *not from the filter but from optimization*. [iSAM](https://www.cs.cmu.edu/~kaess/pub/Kaess08tro.pdf) (Kaess et al., 2008), [g2o](http://ais.informatik.uni-freiburg.de/publications/papers/kuemmerle11icra.pdf) (Kümmerle et al., 2011), and GTSAM effectively replaced the filter. Refreshing the Jacobian linearization through iterative optimization rather than freezing it at the current estimate avoids the inconsistency structurally. The seat the counterexample reserved for a "new filter" was eventually filled not by a filter but by a different structure. `[abandoned]`

---

## 4.6 FastSLAM — divide and conquer

What attacked EKF-SLAM's $O(N^2)$ wall from a different direction was [FastSLAM](https://cdn.aaai.org/AAAI/2002/AAAI02-089.pdf). Michael Montemerlo, Sebastian Thrun (Stanford), Daphne Koller, and Ben Wegbreit presented it at AAAI 2002.

The key observation is Rao-Blackwellization. Given the robot path $x_{0:t}$, the position estimates of each landmark become *mutually independent*. So one can represent the path with a particle filter (each particle standing for one possible path) and run a separate landmark EKF independently for each particle.

With $K$ particles and $N$ landmarks the per-step complexity is $O(K \log N)$, sublinear in $N$ unlike EKF-SLAM's $O(N^2)$ (when using KD-tree-based landmark search). As landmark count rises, per-particle EKFs stay mutually independent, so there is no need to keep the full $N \times N$ covariance. $K$ is fixed at tens to hundreds, and the practical gain was large.

FastSLAM worked. In indoor environments it held real-time up to a few hundred landmarks, and the technology transferred quickly. But problems accumulated. Particle depletion was the first: as the map grows, most particles come to represent bad paths, and effective sample count drops sharply. Reweighting paths in a loop-closure event is hard, and adding more particles did not solve drift accumulation in large-scale environments.

[FastSLAM 2.0](https://www.ijcai.org/Proceedings/03/Papers/165.pdf) (Montemerlo et al. 2003) improved the proposal distribution, but as long as the methodology stayed inside the filter paradigm there was a ceiling to scalability. The method that ultimately got around that ceiling was not in the filter family.

---

## 4.7 The EKF's exit

Graph-based approaches became real from 2005 onward, and EKF-SLAM stepped back from the main line. Once [Feng Lu and Evangelos Milios's 1997 graph idea](https://doi.org/10.1023/A:1008854305733) combined with [Olson-Leonard-Teller's (2006)](https://april.eecs.umich.edu/pdfs/olson2006icra.pdf) efficient solver, and then with the real-time factorization techniques of g2o, GTSAM, and iSAM2, the EKF's strength of "incremental update" was no longer a differentiator.

The difference showed up at loop closure — correcting map error when the robot returns to its starting point. The EKF has to update the entire covariance matrix at that moment. Cost: $O(N^2)$. Graph optimization adds one new edge to the pose graph and refactors a sparse matrix. Using the sparse structure, the cost is far lower.

Around 2010, choosing the EKF as a backend for a new SLAM system became uncommon. It survived only under special constraints — very limited compute resources, real-time filter requirements.

> 📜 **Prediction vs. outcome.** Durrant-Whyte and Bailey's [2006 IEEE Robotics & Automation Magazine tutorial](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/Durrant-Whyte_Bailey_SLAM-tutorial-I.pdf) discussed SLAM's scalability and projected submap decomposition and the information filter as the solutions for large-scale environments. The information filter (the EKF's inverse-covariance form) was expected to keep computation from slowing with landmark count, using a sparse information matrix. The actual development ran differently. The information-filter family (SEIF and so on) accrued marginalization error in the course of forcing sparsity. Submaps were absorbed into some systems but did not become the mainstream solution. What dominated the 2010s was factor graph plus iterative optimization. `[diverted]`

---

## 4.8 🧭 Still open

Filter vs. optimization coexistence. The EKF stepping back from backend primacy does not mean it disappeared. As of 2026, some autonomous-driving implementations still prefer filter-based backends. Optimization-based SLAM needs iterative convergence, and real-time guarantees are sometimes hard. In low-cost embedded systems, sparse EKFs and UKFs reappear. The filter did not die; use case and constraint decide the mix.

Non-Gaussian uncertainty. The most basic assumption of the EKF is that uncertainty follows a Gaussian distribution. Real-world sensor errors are often multi-modal or heavy-tailed. A single Gaussian severely oversimplifies actual uncertainty, especially under asymmetric perceptual aliasing (different places looking the same). Particle filters can, in theory, represent non-Gaussian distributions but are impractical in high-dimensional state. Stein particles, normalizing flows, and learning-based uncertainty estimation are being tried, but as of 2026 the forms validated inside real-time SLAM are limited.

---

While EKF-SLAM was hitting its real-time ceiling around 100 landmarks, at Imperial College Andrew Davison was using that small number to prove something else. A single camera, no other sensors, in real time. The numerical limit stayed the same; the way it was handled changed.

---

# Ch.5 — MonoSLAM → PTAM: The Real-Time Daydream and the Split Revolution

The previous chapter showed how EKF-SLAM completed a probabilistically consistent recipe for building maps, and how its covariance matrix ran into a structural wall that grew as $O(N^2)$ in the number of landmarks $N$. Not because the theory was wrong — that is just how the design was shaped. From there Davison and Klein walked in different directions.

In 2003, Davison plugged a single webcam into a laptop in an Imperial College lab. He carried over the probabilistic spatial-relations math that Smith and Cheeseman had laid down in 1988 and the EKF-SLAM scaffolding Leonard and Durrant-Whyte had stacked on top of it, but the sensor was a single camera. With no IMU and no stereo rig, he bolted on the Shi-Tomasi 1994 corner detector and a Kalman predict-update loop and ran the whole thing in real time. By the standards of the day, it was a reckless combination.

Four years later, in 2007, Klein and Murray at Oxford put forward a different answer the same year. They split tracking and mapping into two threads. That split became the backbone of Visual SLAM for the next ten years.

---

## 1. The 2003 demo

At ICCV 2003, Davison's [Real-Time Simultaneous Localisation and Mapping with a Single Camera](https://doi.org/10.1109/ICCV.2003.1238654) stirred the room. Not because the result was astonishing. What was shocking was *that it was possible at all*.

The mainstream of SLAM at the time was laser sensors. LiDAR delivered 2D ranges directly, and stereo cameras recovered depth at the pixel level. A monocular camera had no depth information to begin with. Estimating 3D structure from a single camera required at least two frames, and the uncertainty of the initial depth estimate propagated through the entire EKF state vector. Possible in theory, but running it in real time was a different question.

Davison chose monocular for practical reasons. An IMU was extra hardware, and stereo carried a calibration burden. What he wanted was "to prove it with one camera." If the proof went through, the rest could be stacked on top. That logic was right. What was wrong was whether the EKF was the kind of structure that could actually take on that "rest."

---

## 2. The beauty and the wall of the EKF

[MonoSLAM](https://doi.org/10.1109/TPAMI.2007.1049), published in IEEE PAMI in 2007 with Davison, Ian Reid, Nicholas Molton, and Olivier Stasse as co-authors, was the finished-paper form of the ICCV 2003 demo.

MonoSLAM's state vector transplanted the formulation of [Smith-Cheeseman (1988)](https://arxiv.org/abs/1304.3111) and [Leonard-Durrant-Whyte (1991)](https://ieeexplore.ieee.org/document/174711/) (Ch.4) directly onto a monocular camera. A camera state $\mathbf{x}_v \in \mathbb{R}^{13}$ — position 3, quaternion orientation 4, velocity 3, angular velocity 3 — together with the set of landmarks $\mathbf{y}_i \in \mathbb{R}^3$ were packed into a single vector $\mathbf{x} = (\mathbf{x}_v^\top, \mathbf{y}_1^\top, \ldots, \mathbf{y}_N^\top)^\top \in \mathbb{R}^{13+3N}$, and the full $(13+3N)\times(13+3N)$ covariance $\mathbf{P}$ was maintained frame by frame in a predict-update loop. The predict step propagated the covariance through the Jacobian $\mathbf{F}$ of the camera motion model $f$ ($\mathbf{P}^- = \mathbf{F}\mathbf{P}\mathbf{F}^\top + \mathbf{Q}$); the update step computed the Kalman gain from the Jacobian $\mathbf{H}_i$ of the projection function and refreshed state and covariance. The EKF predict-update equations themselves are identical to those in Ch.4 §4.3. What changed was that the state vector now carried camera velocity and angular velocity along with pose (a camera is a moving body and needs its dynamics modeled).

The dominant cost in the covariance update $(\mathbf{I} - \mathbf{K}_i\mathbf{H}_i)\mathbf{P}^-$ came from a $(13+3N)^2$ matrix multiplication — $O(N^2)$ in the number of landmarks $N$. §III of the paper states that the upper limit on the number of features sustainable at 30 Hz real-time processing was "about 100."

> 🔗 **Borrowed.** MonoSLAM's EKF state vector structure transplants the Smith-Cheeseman-Durrant-Whyte (1988-1991) probabilistic spatial-relations representation directly onto a monocular camera. The Kalman filter itself had been around since 1960, but the practice of putting robot pose and landmarks into the same "augmented state vector" was established in the Leonard-Durrant-Whyte 1991 style.

That number exposed the system's ceiling. Davison knew it. In the paper he suggested extensions to a sub-mapping strategy as a future direction. Building hierarchical structure inside an EKF was hard, though. The covariance matrix carried every correlation between every pair of landmarks, with none omitted.

Reading [Shi-Tomasi (1994)](https://doi.org/10.1109/CVPR.1994.323794) corners as MonoSLAM's visual feature of choice fits the same context. The selection criterion in "Good Features to Track" was to pick points that are good to track. If only corners unlikely to fail tracking in the first place enter the state vector, the EKF update stays more stable. The PAMI paper states that map management is configured so that about 12 features stay stably visible per frame with a wide-angle lens. As long as that bounded set of features all tracked well, the EKF ran.

> 🔗 **Borrowed.** The Shi-Tomasi 1994 corner detector was already in use in MonoSLAM, not first in PTAM. The design philosophy of "select good features, then track them" is the direct Shi-Tomasi → MonoSLAM → PTAM lineage.

---

## 3. 2007, the same year

That bounded number exposed the ceiling of the EKF. The person at Oxford looking up at that ceiling was Klein.

In 2007, Klein and Murray put [Parallel Tracking and Mapping for Small AR Workspaces](https://doi.org/10.1109/ISMAR.2007.4538852) into ISMAR. The same year, PAMI carried the finished version of Davison's MonoSLAM. The two papers appearing in a single year was not coincidence.

Klein was then a doctoral student in Murray's group. The Murray group was the direct descendant of the Oxford Active Vision Laboratory — the very room where, a few years earlier, Davison had been a doctoral student. Murray had been Davison's advisor. Klein could not have not seen MonoSLAM. What he saw was not the EKF but the bare fact that a monocular camera ran in real time.

The possibility had been confirmed, and what remained was "how to scale it up." Klein decided to throw out the EKF.

---

## 4. The split

PTAM's core idea was one thing. Separate tracking (camera pose tracking) and mapping (3D map construction) and run them in two parallel threads.

In the EKF the two were tangled inside the same loop. One predict-update cycle per frame: predict the state when the camera moves, then update again once landmarks are found in the image.

PTAM unwound this. The tracking thread does nothing but estimate the camera pose every frame. It matches the 2D projections of the 3D points visible from the current keyframe set against the actual observations and computes the pose in real time. The mapping thread runs bundle adjustment every time a new keyframe is added. Because the tracking thread runs independently, it did not matter if mapping was slow.

The bundle adjustment on the mapping thread minimized the sum of reprojection errors over a keyframe set $\mathcal{K}$ and a 3D point set $\mathcal{P}$:
$$\min_{\{\mathbf{T}_k\}, \{\mathbf{p}_j\}} \sum_{k \in \mathcal{K}} \sum_{j \in \mathcal{P}_k} \rho\!\left(\left\|\mathbf{z}_{kj} - \pi(\mathbf{T}_k,\, \mathbf{p}_j)\right\|^2_{\mathbf{\Sigma}_{kj}}\right)$$
where $\mathbf{T}_k \in SE(3)$ is the pose of keyframe $k$, $\mathbf{p}_j \in \mathbb{R}^3$ is a 3D point, $\pi$ is the camera projection function, $\mathbf{z}_{kj}$ is the observed pixel coordinate of point $j$ in keyframe $k$, $\mathbf{\Sigma}_{kj}$ is the measurement covariance, and $\rho$ is a robust kernel such as the Huber function. The mapping thread solved this optimization iteratively with Levenberg–Marquardt. Running asynchronously, it did not affect the real-time behavior of the tracking thread.

> 🔗 **Borrowed.** The bundle adjustment run on PTAM's mapping thread is a direct application of [Triggs et al. 1999 "Bundle Adjustment — A Modern Synthesis"](https://doi.org/10.1007/3-540-44480-7_21). The hundred-year photogrammetry tradition covered in Part I finally took its proper place in a SLAM backend here. In the EKF, full BA had been impossible because of the size limits of the covariance matrix. Splitting threads lifted that limit.

The split looks simple, but the outcome was different. Because the mapping thread ran bundle adjustment asynchronously, the number of landmarks that could enter the map stepped outside the EKF's $O(N^2)$ constraint. PTAM used keyframes in the hundreds. Each keyframe held hundreds of patch features. That was a different world from MonoSLAM's tens of landmarks.

The initial map-building method was different too. In PTAM's initialization, with the user slowly moving the camera, the system estimated the essential matrix with a 5-point algorithm along the [Nistér 2004](https://doi.org/10.1109/TPAMI.2004.17) line (the PTAM paper cites the follow-up Stewénius·Engels·Nistér 2006) and recovered the initial 3D structure from the first keyframe pair. That too was borrowed.

The essential matrix $\mathbf{E}$ is a $3\times 3$ matrix capturing the pure geometric relation between two camera frames, satisfying ${\mathbf{p}'}^\top \mathbf{E}\, \mathbf{p} = 0$ for corresponding point pairs $(\mathbf{p}, \mathbf{p}')$. $\mathbf{E}$ decomposes internally as $\mathbf{E} = \mathbf{t}_\times \mathbf{R}$ ($\mathbf{t}_\times$ is the skew-symmetric matrix of the translation, $\mathbf{R}$ is the rotation), so it has 5 degrees of freedom. Five point correspondences therefore suffice to obtain a unique solution (up to ten real solutions). Nistér's contribution was to solve this 5-point system efficiently using a Gröbner basis, making it fast enough to run inside a RANSAC loop in real time. PTAM used this solver together with RANSAC during initialization to estimate the relative pose between the first two keyframes and triangulated the initial 3D point cloud.

> 🔗 **Borrowed.** PTAM's 5-point essential-matrix initialization follows the minimal-solver lineage opened by David Nistér 2004 "An Efficient Solution to the Five-Point Relative Pose Problem" (the PTAM paper directly cites the follow-up Stewénius·Engels·Nistér 2006 ISPRS). The 5-point solver was a minimal solver that used the smallest number of correspondences needed to build a monocular camera's initial map, and PTAM put this solver inside a RANSAC loop to estimate the relative pose of the first two keyframes at near-real-time speed.

> 🔗 **Borrowed.** PTAM's keyframe structure traces back to the Leonard-Durrant-Whyte submap idea. The notion that "if the full map is hard to optimize at once, break it into regions" was expressed in PTAM as a set of keyframes. The covisibility graph of the subsequent ORB-SLAM is a more refined version of this keyframe management.

---

## 5. The diffusion of the new architecture

PTAM was designed for AR (augmented reality) workspaces. The paper's title states "Small AR Workspaces" explicitly. The tracking thread had good reproducibility and solid real-time behavior, so it could be dropped into AR applications directly.

Commercial absorption was fast. In the early 2010s Metaio (a German AR startup, acquired by Apple in 2015) and Qualcomm's Vuforia SDK adopted a tracking/mapping split structure similar to PTAM's. Stable planar AR ran on consumer smartphones for the first time.

The academic effect was more direct. [ORB-SLAM](https://arxiv.org/abs/1502.00956), published in 2015 by Raul Mur-Artal, J.M.M. Montiel, and Juan D. Tardós, inherited PTAM's structure. It swapped patch features for ORB descriptors, refined keyframe management with a covisibility graph, and added loop closure on top. Without PTAM, ORB-SLAM's blueprint would have been different.

Qin, Li, and Shen's [VINS-Mono](https://arxiv.org/abs/1708.03852) (2018) also has a two-threaded structure of sliding-window optimization plus loop closure. A case of the tracking/mapping-split lineage extending into Visual-Inertial Odometry (VIO).

---

## 6. Davison vs Klein & Murray — a view comparison

Two papers came out in 2007. MonoSLAM PAMI was the finished version of the 2003 demo. PTAM came out the same year, with a new structure that broke through MonoSLAM's limits.

The reason MonoSLAM stayed with the EKF lay in probabilistic consistency. The EKF managed state uncertainty explicitly through the covariance matrix. The math tracked how uncertain each landmark in the map was and how covariances between landmarks were connected. From this viewpoint, bundle adjustment was a least-squares optimization — a trade that gave up uncertainty representation in exchange for scalability.

Klein & Murray paid that price willingly. What mattered in AR applications was real-time tracking of the camera pose. There was no need to track map uncertainty at the centimeter level. Refining the map periodically through bundle adjustment was enough.

After that, the field tilted toward this trade. From the 2010s on, graph-based optimization and bundle adjustment became mainstream, and EKF-SLAM mostly stepped back from the front except in applications where compute resources are extremely limited. The probabilistic concern MonoSLAM held on to did not vanish, however. Instead of jumping across into the PTAM lineage, Davison's lab shifted direction over several steps toward factor-graph-based estimation and then toward Gaussian Belief Propagation (GBP) and the Robot Web. Twenty-three years later in SLAM Handbook Ch.18, Davison reads the same flow as a sequence of representation changes going EKF → BA → factor graph → GBP. There is no passage where he names MonoSLAM and evaluates it directly; he recasts the whole thing into the general principle that each change of representation triggers a redesign of the system.

---

## 📜 Prediction vs. outcome

> **Davison 2007 PAMI MonoSLAM**: In the Conclusion, Davison named larger indoor and outdoor environments, faster motion, and complex scenes with occlusion and lighting changes as the next tasks. As concrete means he raised a sub-map strategy and CMOS cameras running above 100 Hz, and he also mentioned room for extending the sparse map to a dense representation of "higher-order entities" (surfaces, etc.).
>
> These predictions met different fates. The sub-map idea was partially absorbed via PTAM's keyframe structure and ORB-SLAM's covisibility graph. No system, however, achieved hierarchical scaling while keeping the EKF — the layering arrived together with the shift to BA-based architectures. High-frame-rate cameras took concrete form along a different path in 2010s event-camera research. Robustness to dynamic scenes is still open as of 2026. Attempts like DynaSLAM and FlowSLAM exist, but no "solution built into the baseline pipeline" has appeared yet. IMU integration was not flagged directly by Davison in Future Work (though related work is referenced in the body), and the 2010s VIO research boom took that direction. The probabilistic-consistency concern itself was not scrapped but migrated toward factor graphs and GBP — twenty-three years on, in Handbook Ch.18, Davison himself describes this migration as "a sequence of representation changes" and relocates MonoSLAM as one step in the lineage. `[in progress]`

> **Klein & Murray 2007 PTAM**: In §8 (Failure modes / Mapping inadequacies), Klein and Murray listed the system's limitations: corner-based tracking's vulnerability to motion blur, the geometric poverty of a point-cloud-centric map, and "not designed to close large loops in the SLAM sense." They stated plainly that global consistency across large loops was outside PTAM's design scope.
>
> In 2015 ORB-SLAM aimed squarely at those limitations. [DBoW2](http://doriangalvez.com/papers/GalvezTRO12.pdf)-based appearance loop closure and covisibility-graph-based keyframe management were added, and the features were swapped from patches to ORB descriptors. Where PTAM drew a line saying "not our problem," ORB-SLAM picked up the map-scaling task. Klein & Murray themselves did not write that "appearance-based loop closure is the answer" in so many words, but their flagging of the limit points landed exactly as the starting point of the subsequent lineage. `[partial hit]`

---

## 🧭 Still open

**Monocular scale recovery.** From MonoSLAM to PTAM, every monocular system carries scale ambiguity. That absolute distance cannot be known from a single image is a geometric fact. Add an IMU and scale becomes observable through gravity direction and accelerometer readings. In pure monocular systems without an IMU, however, scale recovery remains unsolved even in 2026. Learning-based monocular depth estimation ([MiDaS](https://arxiv.org/abs/1907.01341), [Depth Anything](https://arxiv.org/abs/2401.10891)) estimates relative depth from a single image, but converting it to metric scale still requires an external reference (a ground-plane assumption, a known object size, and so on).

**Environmental generality of a single VO system.** MonoSLAM handled only indoor desktop scenes. PTAM self-limited its scope to "Small AR Workspaces." ORB-SLAM2 later tried to span indoor, outdoor, and RGB-D, but tracking failure still occurs in scenes with extreme lighting changes or low-texture spaces. As of 2026 no single pipeline robustly handles indoor corridors, outdoor downtowns, nighttime environments, and textureless white walls all at once. Multi-modal fusion (camera + LiDAR + IMU) covers some of this, but the generality of a camera-only system is still unsettled.

**Feature tracking in low light and dynamic scenes.** What MonoSLAM required was sufficient lighting and a static scene. PTAM in 2007 was the same. As of 2026 these two assumptions still hold implicitly in most feature-based SLAM systems. ORB features fail to detect at all in low light, and in scenes crowded with moving people, dynamic points get misclassified as static. Attempts to route around the problem with learning-based optical flow or semantic segmentation exist, but no system has settled in as a real-time, general-purpose solution.

---

What PTAM established as a tracking/mapping split did not solve one thing. As keyframes piled up, the accumulated error blew up at loops. The map grew; the drift grew with it. Closing a loop — correcting the accumulated error when the camera returned to a known place — was declared out of scope in the PTAM paper itself. That problem had been in preparation elsewhere for a decade.

---

# Ch.6 — The Graph SLAM Revolution

A basement corridor at Carnegie Mellon, 1997. Feng Lu and Evangelos Milios were wrestling with the problem of aligning multiple laser scans into a globally consistent map. EKF was the default choice, but the two of them took a different path. They modeled the relative measurements between poses directly as a graph, and ran least-squares optimization on that graph. The result was the kind of global consistency the Kalman family had never reached. Lu-Milios were not the sole founders of this direction. More than a decade earlier, [Chatila and Laumond (1985)](https://www.semanticscholar.org/paper/Position-referencing-and-consistent-world-modeling-Chatila-Laumond/c34a678e40a7d80cb3683f07fc837179fd9bf3ee) at LAAS had already discussed reference frames and consistent world models for mobile robots in the language of smoothing; in 1999, [Gutmann and Konolige](https://www.semanticscholar.org/paper/Incremental-mapping-of-large-cyclic-environments-Gutmann-Konolige/3c1bda51b8ca59f1836ed1b96c485d905804989a) applied pose graph matching to incremental mapping of large cyclic environments; and in the early 2000s Thrun's group formalized the approach as the *full SLAM* problem and put it on a commercial trajectory. [Folkesson and Christensen (2004)](http://www.hichristensen.net/hic-papers/folkesson-icra2004.pdf), Konolige, and Dellaert followed with formulations of their own. The reason Lu-Milios 1997 is the most cited today is that it presented a complete pipeline ("laser scan matching plus batch least-squares") in finished form, not that it opened the direction alone. If Smith-Cheeseman laid the mathematical foundation of probabilistic mapping and Davison proved the feasibility of real-time monocular SLAM, these parallel contributors were, nearly simultaneously, making several different moves that redefined SLAM as a graph inference problem. Klein and Murray's PTAM (2007) had split tracking and mapping to achieve real-time performance, but as hundreds of poses accumulated, the $O(N^2)$ update cost of the EKF backend became the bottleneck. The answer to that problem had been in preparation for a decade, in its own forms, in basement corridors at CMU and in labs at LAAS, Stanford, and KTH.

---

## 6.1 From Laser Scans to Pose Graphs: Lu-Milios 1997

Before [Lu & Milios 1997, "Globally Consistent Range Scan Alignment"](https://doi.org/10.1023/A:1008854305733) appeared, alignment of successive laser scans was often handled by stitching together local matches from the ICP (Iterative Closest Point) family. ICP aligned two scans well locally, but as drift accumulated the map twisted after tens of meters. When the robot came back to close a loop, the starting point and the map no longer matched.

The Lu-Milios idea was simple. Represent the robot's pose sequence $x_1, x_2, \ldots, x_n$ as nodes, and the relative measurement between each pose pair as an edge; then the map-building problem becomes an energy minimization on the graph. Each edge carries the relative transform $\hat{z}_{ij}$ between two poses and its uncertainty $\Omega_{ij}$. The full cost function is

$$F = \sum_{(i,j) \in \mathcal{E}} e_{ij}^T \Omega_{ij} e_{ij}, \quad e_{ij} = z_{ij} - h(x_i, x_j)$$

where $h(x_i, x_j)$ computes the expected relative transform from the two poses, $z_{ij}$ is the actual measured relative transform, and $\Omega_{ij} = \Sigma_{ij}^{-1}$ is the information matrix, the inverse of the measurement uncertainty.

The core of this formulation is the natural inclusion of loop closures. When the robot later revisits the same place and obtains a new relative measurement, adding it to the graph as an edge makes the full optimization adjust all poses to reflect that constraint. In EKF, closing a loop was a heavy operation that updated the covariance at $O(N^2)$ cost. In a pose graph, adding a single edge is enough.

> 🔗 **Borrowed.** The Lu-Milios formulation of pose graph optimization rests on the nonlinear least-squares algorithms of [Levenberg (1944)](https://www.ams.org/qam/1944-02-02/S0033-569X-1944-10666-0/) and [Marquardt (1963)](https://www.stat.cmu.edu/technometrics/70-79/VOL-14-03/v1403757.pdf). A numerical optimization technique developed decades earlier for nonlinear parameter estimation arrived at the backend of indoor laser mapping.

The Lu-Milios solution of the time was a batch linear system that solved for all poses at once. As the number of scans grew, the size of the linear system grew with it. So it was closer to a proof of concept than a field-ready system. It did, however, show two things clearly. Global consistency is achievable. And the tool is optimization, not a filter. In the same period Gutmann-Konolige emphasized incrementality, Folkesson-Christensen robustness of data association, and Thrun's group application at real-world scale — each carving a different facet of the same conclusion.

---

## 6.2 Discovering Sparsity: The Information Matrix and the Pose-Graph Extension

In the five years after the Lu-Milios idea was published, several groups pushed extensions in the same direction. The common discovery was the **sparsity** of the information matrix ($\Omega = \Sigma^{-1}$).

The covariance matrix $\Sigma$ of EKF-SLAM is dense. Every time the robot observes a new landmark, its correlation with every existing landmark is updated. With the robot pose marginalized and $n$ 2D landmarks, $\Sigma$ is a $2n \times 2n$ matrix and the update cost is $O(n^2)$. That is why real-time performance collapsed around 100 landmarks.

The information matrix of a pose graph is different. A nonzero term appears in the $(i,j)$ block of $\Omega$ only when poses $x_i$ and $x_j$ are in a direct measurement relation. Under continuous motion only nearby poses are connected by edges; distant poses are not connected directly. $\Omega$ has a banded sparse structure that reflects the graph topology. In a pure driving scenario without loop closures the structure is nearly exactly tridiagonal.

Sebastian Thrun's group's [Sparse Extended Information Filter (SEIF)](http://www.cs.cmu.edu/~thrun/papers/thrun.tr-seif02.pdf) and Edwin Olson's work started exploiting this sparsity explicitly. With a sparse linear solver, the computational cost could drop far below $O(n^2)$. The actual complexity depended on the graph structure, but in realistic scenarios where the robot moves within a bounded region, $O(n \log n)$ became reachable.

> 🔗 **Borrowed.** Thrun's sparse information filter (SEIF) and [Eustice's exactly sparse delayed-state filter](https://web.mit.edu/2.166/www/handouts/eustice_et_al_ieeetro_2006.pdf) showed that the sparsity of the information matrix was usable even in filter-based form. This sparsity insight frames the context that leads into Dellaert's factor graph formulation and the Bayes tree data structure.

At ICRA 2006, [Olson, Leonard, and Teller](https://april.eecs.umich.edu/pdfs/olson2006icra.pdf) presented a method for optimizing pose graphs with stochastic gradient descent. There was no convergence guarantee. It still ran fast enough on graphs of hundreds of nodes, and Olson's implementation spread across the community afterward.

---

## 6.3 Factor Graphs and Square Root SAM

In 2006, Dellaert and his then-PhD-student Kaess published [Square Root SAM](https://doi.org/10.1177/0278364906072768), which rewrote the way SLAM backends were understood. Dellaert had been working on probabilistic graphical models at Georgia Tech. He saw SLAM as a Bayesian inference problem and judged that performing that inference on a factor graph was the most natural form.

In a **factor graph** (a bipartite graph that represents variables and constraints as nodes and edges), variable nodes are the robot poses and landmark positions, and factor nodes are observations or priors. A factor $f_k(x_{i_1}, x_{i_2}, \ldots)$ expresses a probabilistic constraint among the variables it connects. The full joint probability is

$$p(X) \propto \prod_k f_k(X_k)$$

and MAP estimation finds the $X^*$ that maximizes this probability. Under Gaussian factors, this becomes a nonlinear least-squares problem.

Dellaert's insight came from the structure of this least-squares problem. Applying QR decomposition to the Jacobian matrix $J$ leaves an upper triangular matrix $R$. $R^T R = J^T J = \Omega$, and $R$ is the "square root information matrix". The sparse structure of this $R$ is determined not by the Jacobian itself but by the variable elimination order and the factor graph topology. Choosing a good ordering (e.g., AMD, COLAMD) minimizes fill-in and yields a sparse $R$.

This formulation is numerically more stable than the EKF covariance update. The full map of landmarks and poses can be optimized together in a consistent way, and a loop closure is expressed as the addition of a new factor.

---

## 6.4 iSAM and iSAM2: Online Incremental Inference

Square Root SAM was a batch method. Redoing the full decomposition of $J^T J$ every time a new observation came in cost $O(n^3)$. That was not practical in an online robot system.

In 2008, [Kaess, Ranganathan, and Dellaert published **iSAM** (incremental Smoothing and Mapping)](https://www.cs.cmu.edu/~kaess/pub/Kaess08tro.pdf), which approached this problem with Givens rotations. When a new variable and factor were added, instead of redoing the QR decomposition from scratch, only the new rows were appended and $R$ was updated via Givens rotations.

The intrinsic limit of iSAM1 was the relinearization schedule. The $R$ obtained by linearizing nonlinear factors is only a first-order approximation near the current estimate. As the robot moved and the estimate drifted from the linearization point, approximation error accumulated. iSAM1's response was **periodic full relinearization**. Every few dozen steps, the entire factor graph was relinearized from scratch and the QR decomposition was redone from scratch. Fill-in in $R$ caused by loop closures damaging the sparse structure was the visible symptom that triggered this schedule, but the root of the cost was the schedule itself — "periodically re-solve the whole thing". An algorithm that looked incremental reverted to a batch algorithm every cycle.

In 2012 [iSAM2](https://doi.org/10.1177/0278364911430419) solved this problem with a data structure called the Bayes tree. The Bayes tree is a tree built from the chordal Bayes net obtained by applying variable elimination to the factor graph. Its nodes are the cliques of the Bayes net, and its edges are the separators (shared variables between cliques). When a new factor is added, the cliques affected in the Bayes tree are identified, and only that subtree is turned back into a factor graph, relinearized, and reoptimized. The core is **fluid relinearization**. Only factors whose linearization error exceeds a threshold are selectively relinearized, and the effect propagates through Bayes tree separators only as far as needed. iSAM1's "everything, every cycle" schedule was replaced by "only the necessary factors, only the affected cliques". Even when a loop closure occurred, the set of connected cliques was often locally bounded, and full recomputation could be avoided.

> 🔗 **Borrowed.** The data-structural idea of the Bayes tree extends the junction tree (join tree) algorithm lineage in the probabilistic graphical models literature — the elimination-order and chordal-graph-based inference techniques covered in standard textbooks like Koller-Friedman's [*Probabilistic Graphical Models*](https://mitpress.mit.edu/9780262013192/probabilistic-graphical-models/) are representative. A line of technique from the AI inference community was transplanted into real-time robot SLAM.

iSAM2 was packaged as the [GTSAM (Georgia Tech Smoothing and Mapping)](https://gtsam.org) library. A C++ core with Python bindings on top. GTSAM development did not stop even after Dellaert later moved to Google. As of 2026 it is in effect the standard SLAM backend in areas ranging from autonomous driving and drones to robot arm calibration.

---

## 6.5 g2o: The Standard of the ROS Ecosystem

While the Georgia Tech group concentrated on refining the theory, Rainer Kümmerle, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard — at TUM (Munich) and Freiburg — concentrated on a practical open-source implementation. At ICRA 2011, they presented [g2o](https://doi.org/10.1109/ICRA.2011.5979949) (general graph optimization), designed around the principle of "handle any kind of graph optimization in a plug-in manner". The authorship itself was already hybrid. The Freiburg robotics tradition of Burgard and Grisetti, Strasdat's monocular SLAM experience, and the industrial engineering sensibility Konolige brought merged into a single framework.

The g2o design separates three concepts. Vertices (variable nodes) and edges (factors or constraints) form the graph, and a solver handles the sparse linear system. The user defines vertex types and the error function and Jacobian of edges, and g2o then runs the full optimization with Gauss-Newton or Levenberg-Marquardt. The sparse solver can be chosen among Cholmod, CSparse, and Eigen, or swapped out for an external library.

As ROS (Robot Operating System) established itself in the early 2010s as the standard platform for mobile robot research, g2o became the de facto SLAM backend standard. gmapping, Cartographer, ORB-SLAM, and LSD-SLAM adopted g2o or a g2o-like interface. A researcher starting a new pose graph SLAM implementation would consider g2o as the first option.

---

## 6.6 Why the Field Converged Here

Chatila-Laumond (1985), Lu-Milios (1997), Gutmann-Konolige (1999), Folkesson-Christensen (2004), Thrun's group, Dellaert (2006), and Kaess (2012) — several groups, each with their own tools, at different times, arrived at the same conclusion that the EKF backend alone was not enough.

The core of the shift was not a swap of algorithms but a shift in problem modeling. EKF-SLAM maintains the best estimate of the current state and its uncertainty while marginalizing away the past. In this filter paradigm, past poses disappear and accumulated error hides inside the current estimate. Closing a loop requires a heavy update on the current covariance.

Graph SLAM does not throw the past away. Poses, landmarks, and observations all remain in the graph, and a loop closure is expressed as the addition of a new edge. Reoptimization adjusts the full trajectory consistently (for lines that reformulate the graph as a time-continuous trajectory rather than discrete keyframes, see Ch.7c Continuous-Time SLAM). That past poses themselves remain revisable is the essential difference from filters.

The computational cost was also different. EKF's update cost is $O(N^2)$ (in the number of landmarks $N$), and information storage is $O(N^2)$. Graph methods, using sparse Cholesky (or QR) decomposition, can cut the complexity substantially. In a realistic scenario where the robot moves within a bounded region — a sparsely connected graph — updates at the level of $O(N \log N)$ are generally possible. In large-scale long-term SLAM, this gap is hard to close.

> 📜 **Prediction vs. outcome.** The limitation of the batch form presented by Dellaert's Square Root SAM (2006) led directly to the incremental direction in the same group. iSAM in 2008 handled it with Givens-rotation-based incremental updates, and iSAM2 in 2012 raised the efficiency even for loop closure situations using the Bayes tree. GTSAM, Ceres, and g2o all compete on the same structure. In that the three papers form a lineage that resolves the same problem statement in stages, this line is close to a trajectory realized almost as predicted. `[hit]`

The flexibility of marginalization played a part as well. When an old pose in the graph is marginalized, its information is preserved as a linking factor among the remaining variables. The filter threw information away; the graph can compress while retaining it. Engineering trade-offs like sliding-window optimization and keyframe selection come in here.

---

## 6.7 Nonlinearity and Robustness: The Layer of Practical Engineering

There is a gap between the theoretical elegance of graph optimization and the actual implementation. Closing that gap took up a good part of 2010s SLAM engineering.

The first problem is initial-value dependence. Gauss-Newton or LM optimization converges to a local minimum when the initial pose estimate is far from the true value. A wrong correspondence in a loop closure corrupts the initial values. That is why loop closure verification and outlier rejection became a core task of the pre-backend stage. The line that bypasses this local-minimum problem itself through convex relaxation (SDP) and solves it in a form that provably certifies global optimality is treated separately in Ch.6b (Certifiable SLAM).

Standard least-squares is fragile to outliers, as practice quickly revealed. Using a robust kernel like Huber or Cauchy cost reduces the influence of wrong matches. Both g2o and GTSAM make robust kernels selectable. Which kernel to use depends on environment and sensor characteristics, and as of 2026 this choice still rests on the engineer's experience.

The third problem is marginalization approximation. iSAM2's Bayes tree provides exact incremental inference, but as the variable count grows, the tree keeps growing. In real systems, old poses are marginalized to keep the tree size manageable. The fill-in generated during this marginalization can make the information matrix dense. How to truncate it, and how to approximate with a prior factor, separates implementation quality.

> 📜 **Prediction vs. outcome.** The "handle any graph optimization problem as a plug-in" generality that g2o claimed has been partly extended by systems that internally use complex geometric constraints like lines and planes (OpenVINS, the VINS-Fusion family, and so on) (for preintegration, the standard way to place IMU factors in the graph, see Ch.7b Preintegration). As of 2026, however, the g2o library itself puts more weight on interface stability and compatibility with existing users than on broad extension of the built-in factors, and new factor types are commonly stacked on from the user side via inheritance, forks, or wrappers. `[in progress]`

---

## 🧭 Still open

Which robust kernel to choose. Huber, Cauchy, Geman-McClure, DCS, and others are available, but there is no principled method for deciding in advance which kernel is optimal for a given environment and sensor. The choice still rests on the engineer's intuition and experience. There is research on optimizing the cost function itself in a learning-based way, but integrating it into an online incremental system is an unresolved problem.

Representing non-Gaussian situations inside a factor graph is still open. At present, factors in GTSAM and g2o nearly all assume Gaussian noise. Accurately representing situations like loop-closure mismatch probability and multi-hypothesis poses is hard both theoretically and computationally. Attempts like the max-mixture model exist, but there is no general solution.

The Bayes tree is efficient when the number of loop closures is small. In scenarios where a vehicle drives tens of kilometers over hours and generates thousands of loop closures, the tree structure becomes complex and memory efficiency drops. This bottleneck has been reported in GTSAM's actual application to autonomous driving data, and combining hierarchical tree management or submap partitioning is one of the current research directions.

---

Entering the 2010s, the backend debate quieted down. As g2o and GTSAM became de facto standards, researchers' attention moved to what to put on top of the backend. Rather than "how to close a loop", the question became "with which features, and from how far away, to recognize a loop". The front end was the new competitive stage.

One thread left unresolved here is whether the solutions g2o and GTSAM return are actually the global minimum. That question — certifiability — is the subject of Ch.6b, which can be read as a supplement before continuing to Ch.7. The main line proceeds to Ch.7 regardless.

---

# Ch.6b — Certifiable SLAM: Past the Local Minimum

The lineage Ch.6 recorded, from Lu-Milios to g2o and GTSAM, left one thing unresolved. Pose graph optimization is non-convex. The solutions Gauss-Newton and LM return may be local minima. Practitioners lived with a folk observation: "with an odometry initial guess, it usually solves fine" — but in some deployments the backend converged at the wrong spot and no alarm went off. In 2015 Luca Carlone at MIT began to replace that folklore with mathematics. From Carlone's Lagrangian duality attempt to Rosen's SE-Sync in 2019, then Briales-Gonzalez-Jimenez's Cartan-Sync, Yang-Carlone's TEASER, and Papalia's CORA, this lineage rewrites the SLAM backend from "a non-convex optimization that empirically solves well" to "a convex surrogate whose global optimality is provably certifiable." The tools all came from outside SLAM: Shor relaxation from operations research, Burer-Monteiro factorization from mathematical optimization, Riemannian optimization from differential geometry, Kirchhoff's Matrix-Tree from graph theory. The names of the people who gathered these onto one table over ten years are the body of this chapter.

---

## 6b.1 The Old Anxiety of Local Minima

Ch.6 §6.7 flagged initial-value dependency as the first problem of the graph SLAM backend. The cost function is non-convex on rotation variables $\boldsymbol{R}_i \in \mathrm{SO}(3)$, so when the initial estimate is far from the truth, Gauss-Newton gets pulled into the wrong basin. The parking garage example in Handbook §6.1 shows the symptom starkly: four random initializations, one sticks to the global minimum SE-Sync reaches, the other three settle at twisted local minima with the garage floor visibly folded.

Through the late 2000s the community's response ran along two lines: trust odometry for initial-value quality, or do loop closure verification and outlier removal thoroughly at the front end. Both worked, but neither was a tool for judging whether the converged value was the true minimum. The issue Huang and Dissanayake pointed out around 2010 was simple. No matter how good the initial guess, if the data itself is ambiguous the optimizer can stop at a wrong answer. That PGO is NP-hard was formalized around the same time. And yet in the field g2o usually solved fine. That gap — theory speaking to the worst case while practice looks at the average — is where the backend theorists of the mid-2010s dug in. Every point where Gauss-Newton halts is *locally* optimal. The gradient is zero and the Hessian positive definite. The answers, however, are completely different. The moment the backend signals "converged" is also the moment when failure is least visible.

> 🔗 **Borrowed.** Ch.6's robust kernels (Huber, Cauchy) and this chapter's GNC share a root in [Black & Rangarajan (1996)](https://cs.brown.edu/people/mjblack/Papers/ijcv1996.pdf)'s robust statistics and duality theorem. One branch changed cost weights to reduce outlier influence; the other diverted the same principle to avoid non-convexity.

---

## 6b.2 Shor Relaxation — A Weapon From Outside

PGO's non-convexity comes from the rotation constraint $\boldsymbol{R}_i \in \mathrm{SO}(d)$. That constraint can in fact be written as quadratic equations — orthogonality $\boldsymbol{R}^\top \boldsymbol{R} = \boldsymbol{I}$ and $\det(\boldsymbol{R}) = +1$. The objective is quadratic too. PGO falls exactly onto a **QCQP** (Quadratically Constrained Quadratic Program). And QCQP had a convex relaxation tool vetted in operations research since 1987: [Naum Shor's 1987 relaxation](https://link.springer.com/article/10.1007/BF01582220).

Using the identity $\boldsymbol{x}^\top \boldsymbol{M}\boldsymbol{x} = \mathrm{tr}(\boldsymbol{M}\boldsymbol{x}\boldsymbol{x}^\top)$, Shor lifts the lifting variable $\boldsymbol{X} \triangleq \boldsymbol{x}\boldsymbol{x}^\top$ so the original QCQP becomes a linear-objective problem under "$\boldsymbol{X} \succeq 0$ and rank-1," then drops the rank-1 constraint to leave a convex **semidefinite program (SDP)**. Search space grows from $n$ to $n(n+1)/2$ in exchange for convexity.

$$d^* = \min_{\boldsymbol{X}\in\mathbb{S}^n} \mathrm{tr}(\boldsymbol{C}\boldsymbol{X}) \;\; \text{s.t.} \;\; \mathrm{tr}(\boldsymbol{A}_i\boldsymbol{X})=b_i,\; \boldsymbol{X}\succeq 0.$$

The usefulness sits in the duality inequality $d^* \le p^*$. The SDP minimum is a lower bound on the original QCQP minimum. Given any candidate $\hat{\boldsymbol{x}}$, the quantity $f(\hat{\boldsymbol{x}}) - d^*$ upper-bounds how far that candidate sits from optimality. This is where the name "certifiable" comes from. Even without globally solving the original problem, one can bound how bad the solution is. If the SDP solution $\boldsymbol{X}^*$ happens to be rank-1, then $\boldsymbol{X}^* = \boldsymbol{x}^*\boldsymbol{x}^{*\top}$ and $\boldsymbol{x}^*$ is the global minimum of the original QCQP. How often this "favorable situation" occurs in SLAM becomes the subject of the papers that follow.

Carlone's two papers at IROS and ICRA in 2015 ([Carlone et al. 2015 "Lagrangian duality in 3D SLAM"](https://arxiv.org/abs/1506.00746) and [Carlone & Dellaert 2015 "Planar pose graph optimization"](https://doi.org/10.1109/ICRA.2015.7139264)) are the starting point. They showed empirically that the duality gap is mostly zero in 2D PGO and suggested extension to 3D. Carlone had just finished his 2014 TRO survey of g2o/GTSAM initialization techniques and had seen how often, when odometry conflicted with loop closures, the optimizer halted at the wrong point. The 2015 paper reported "the empirical fact that the duality gap is typically zero" without giving a closed condition for when it holds.

In the same period [Briales & Gonzalez-Jimenez (2017)](https://arxiv.org/abs/1702.03235)'s Cartan-Sync pushed the program through SO(3) synchronization. On the math side Boumal-Absil-Sepulchre were refining Riemannian optimization, and Burer-Monteiro's low-rank SDP factorization had been around since 2003. The scattered materials got assembled in one paper in 2019.

---

## 6b.3 SE-Sync — What Rosen 2019 Assembled

[Rosen, Carlone, Bandeira, Leonard's SE-Sync (IJRR 2019)](https://arxiv.org/abs/1612.07386) is the canon of certifiable SLAM. Rosen finished his doctorate with John Leonard at MIT, and Leonard was the MIT researcher who, with Ch.4's Durrant-Whyte, helped settle the name "SLAM" in the early 1990s. Co-author Afonso Bandeira, a math-side expert on SDP and synchronization, took the theoretical section proving the globality of rank-deficient second-order critical points. The four authors' differing backgrounds — robotics, SLAM, mathematical optimization, applied mathematics — tell you what the paper assembled. What this paper did was not invention but assembly: Shor relaxation, translation elimination, Burer-Monteiro low-rank parameterization, Boumal's Riemannian staircase, ingredients aged ten-odd years each in different lineages, meshed together on the single problem of PGO.

The assembly runs in three steps. First, from the observation that translation becomes linear least squares once rotation is fixed, $\boldsymbol{t}$ is eliminated in closed form (Problem 6.2). Carlone had pointed this out in his 2014 TRO survey, and Rosen slotted it as the first step of the convex relaxation. Second, apply Shor relaxation to the remaining rotation-only problem $\min_{\boldsymbol{R}\in\mathrm{SO}(d)^n} \mathrm{tr}(\tilde{\boldsymbol{Q}}\boldsymbol{R}^\top\boldsymbol{R})$ and lift to an SDP (Problem 6.3). Third, the $dn \times dn$-dimensional SDP would crush interior-point methods at a few thousand poses, so reparameterize with Burer-Monteiro $\boldsymbol{Z} = \boldsymbol{Y}^\top \boldsymbol{Y}$ and turn it into a low-dimensional unconstrained problem on the Stiefel manifold (Problem 6.4).

Two theorems justify the assembly. Theorem 6.1 is **exact recovery**: if measurement noise is below some constant $\beta$, the SDP relaxation's unique solution carries the global minimum of the original MLE as a rank-1 object. It was the first quantitative answer to "how much noise can we tolerate." The caveat: $\beta$ depends on the ground-truth matrix, so it is unknown before seeing the instance. Theorem 6.2 brings over a result of Boumal et al., guaranteeing that a second-order critical point on the Stiefel manifold, if rank-deficient, is the global minimum. These enable the Riemannian Staircase: start at small rank, find a second-order critical point, check rank-deficiency; if it fails, raise rank by one. Once rank hits $dn + 1$, every $\boldsymbol{Y}$ becomes row rank-deficient, so the process must halt in finite steps. On practical datasets, one step usually suffices.

On standard benchmarks like sphere, torus, and garage, SE-Sync converged at g2o/GTSAM speed while returning an a posteriori certificate. g2o and GTSAM were fast but silent on when to trust the answer; Rosen's algorithm spits out one more number at the end — a suboptimality bound. When it is zero, the solution is provably global-optimal. Twenty years after Lu-Milios, backend researchers could mark yes or no on "is this solution actually the minimum."

> 📜 **Prediction vs. outcome.** In §8.2 of the IJRR 2019 paper, Rosen wrote that "the algebraic simplification we have shown could be extended to anisotropic noise, outliers, and a variety of sensor modalities." The prediction partially held. Holmes-Barfoot's 2023 landmark-SLAM extension, Papalia's 2024 CORA range-measurement extension, and Yang-Carlone's TEASER line actually followed. But the most ambitious extension — "SE-Sync covers the perspective projection of visual SLAM" — did not arrive even in 2026. A structural barrier surfaced: projection is a rational function and does not fold easily into polynomial optimization. `[diverted]`

> 🔗 **Borrowed.** The Burer-Monteiro factorization at the heart of SE-Sync is the low-rank SDP method of [Burer & Monteiro (2003)](https://link.springer.com/article/10.1007/s10107-002-0352-8), later sharpened by [Boumal-Voroninski-Bandeira (2016)](https://arxiv.org/abs/1605.08101)'s Riemannian proof of the globality of second-order critical points, which Rosen brought into SLAM. From pure math to the robot backend, sixteen years.

---

## 6b.4 The Unexpected Equivalence of Graph Laplacian and Fisher Information

§6.2 asks a different question. Once the global minimum is found, how close is that estimate to the truth? The answer is the Cramér-Rao Lower Bound and the Fisher Information Matrix. In a simplified PGO model with rotations fixed, Rosen-Khosoussi-Barfoot showed, strikingly, that FIM falls out exactly as a Kronecker product of the graph's weighted reduced Laplacian.

$$\mathcal{I} = \boldsymbol{J}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{J} = \boldsymbol{L}_w \otimes \boldsymbol{I}_3.$$

Knowing the graph structure alone yields an approximation of estimation accuracy without actual measurements. By Kirchhoff's Matrix-Tree Theorem the determinant of the reduced Laplacian equals the number of weighted spanning trees, which corresponds to D-optimality (determinant of the information matrix). Algebraic connectivity (Fiedler value) corresponds to E-optimality (worst-case variance). A theorem Kirchhoff proved in 1847 for electrical circuits had become, 180 years later, the theoretical basis for measurement selection and active SLAM. In active SLAM "maximize FIM" translates into spectral manipulation of the Laplacian.

[The post-2014 work of Kasra Khosoussi and Timothy Barfoot](https://arxiv.org/abs/1709.08601) established this connection. Khosoussi did his doctorate in Sydney under Dissanayake and Huang, then went through MIT and Toronto. In the form generalized to 3D PGO, the Kronecker combination of the Laplacian and SE(3) adjoint representation appears, letting topological and geometric information be handled separately. That the "measurement selection criterion" can be approximated by a Laplacian six times smaller than the full FIM provides the mathematical basis for "loop closure selection," which Ch.6 left in place without developing.

The EKF-SLAM consistency problem Ch.4 §4.8 flagged touches this result too. The over-confidence phenomenon Julier-Uhlmann pointed out in 2001, re-read through CRLB, says the linearization approximation over-estimates Fisher information. That is why Handbook §6.2 places the FIM chapter next to convex relaxation. Finding the global minimum and knowing how accurate that minimum is have to be treated as a pair.

> 🔗 **Borrowed.** [Kirchhoff's Matrix-Tree Theorem (1847)](https://en.wikipedia.org/wiki/Kirchhoff%27s_theorem), born as an analysis tool for electrical circuits, was transplanted through combinatorics into measurement-design literature and, in the 2010s through Khosoussi, became the language of SLAM active perception. A 180-year migration path.

---

## 6b.5 Extensions and Limits — TEASER, CORA, and Lasserre's Wall

After SE-Sync the front widened in two directions. First, certifiable estimators robust to outliers. Second, extended measurement models such as range, landmark, and anisotropic noise.

The outlier side pressed first. As Ch.6 §6.7 noted, real-world pose graphs have mismatches when loop closure verification is imperfect, and even with Huber or Cauchy kernels, above a certain outlier ratio the optimization collapses. By around 2017 the pressure on the certifiable lineage to say something about outliers was clear.

The representative is [Yang, Shi, Carlone's TEASER (TRO 2020)](https://arxiv.org/abs/2001.07715), which finds the global optimum in 3D point cloud registration with up to 99% outliers. It solves truncated least squares inside a GNC wrapper with SDP relaxation on the rotation subproblem, returning a certificate alongside. The trick was splitting scale, translation, and rotation into separate certifiable subproblems, each passed along with a global-optimality guarantee. The follow-up [Yang & Carlone (2022)](https://arxiv.org/abs/2109.03349) generalized this through Lasserre moment relaxation as "certifiably robust estimation."

Range-aided SLAM is [Papalia et al. CORA (2024)](https://arxiv.org/abs/2403.09295)'s position. The range measurement $(\|\boldsymbol{t}_j - \boldsymbol{t}_i\| - \tilde r_{ij})^2$, used as is, becomes quartic and escapes QCQP, so Papalia introduced an auxiliary unit vector $\boldsymbol{b}_{ij} \in S^{d-1}$ and through bearing lifting put it back inside QCQP. CORA showed that while relaxation is tight in the single-robot case, in multi-robot settings it is generally not exact, narrowing "when does Shor work."

On the landmark side, [Holmes & Barfoot (2023)](https://arxiv.org/abs/2308.05631) used Schur complement to eliminate landmarks in advance, leaving a PGO SE-Sync takes directly. That Holmes, Khosoussi, and Rosen co-authored Ch.6 of the Handbook is evidence this lineage gathered at one table in 2025.

Walls also appeared. Generalizing anisotropic noise and truncated-quadratic outliers to POP (Polynomial Optimization Problem) calls for Lasserre's moment relaxation, but the derived SDP is **degenerate**: constraint qualification fails and the Riemannian Staircase's convergence breaks. Yang's 2022 sparse monomial basis offers a workaround, but the specialized solver is still slower than a general local solver. An algorithm holding both speed and provability has yet to appear. Visual SLAM and VIO face a deeper wall — the structural incompatibility of perspective projection and IMU preintegration — treated in the 🧭 section.

> 📜 **Prediction vs. outcome.** At ICRA 2015 Carlone wrote that "theoretical explanation for why most instances have a tight Lagrangian dual is needed." Ten years on, only part of the answer has arrived. The exact recovery theorem of Rosen-Carlone-Bandeira-Leonard gave a sufficient condition "noise below $\beta$," but no way to compute $\beta$ in advance on an actual SLAM instance. An **a priori** condition for when tightness breaks is, as of 2026, still replaced by per-instance certificates. `[in progress]`

---

## 🧭 Still open

**The boundary where tightness breaks.** SE-Sync's exact recovery theorem offered the sufficient condition "noise below $\beta$," but there is no way to compute $\beta$ on an actual instance. Judging in advance when tightness holds is what would advance algorithm design. Systematic study of how relaxation fails under heavy outliers or extremely sparse graphs is still early.

**Integration with visual SLAM and VIO.** Perspective projection $\pi(\boldsymbol{X}) = [X/Z, Y/Z]$ is rational, not polynomial. Multiplying through the denominator adds a new variable and auxiliary constraint per feature, and ORB-SLAM3's thousands of map points push the SDP outside real-time size. Forster's 2015 IMU preintegration tangles the exponential map with bias drift, resisting POP incorporation. As of 2026, Ch.7, Ch.8, and Ch.13's visual/VIO mainstream sit outside certifiable guarantees. The largest unsolved position for the lineage.

**Online certification and scale.** SE-Sync is batch. Incremental certifiable SLAM, re-solving the SDP at each new measurement, is not mature. The incrementalization that iSAM2 solved for batch SAM has to be solved again on the certifiable side. Warm-starting, incremental rank increases, composing partial certificates — all open, and city-scale moment-relaxation solver speed is still a problem.

**Outlier-majority.** Current certifiable robust estimators rest on the "minority outlier" assumption. When the majority is contaminated, multi-hypothesis certification such as list-decodable regression is needed, and that direction has just begun on the statistics side. Follow-up work by Cheng-Shi-Carlone around 2024 tried to extend in this direction, but no standard tool has settled in the way TEASER did.

---

This whole chapter is a footnote to the one-line "local-minimum convergence" sentence Ch.6 passed through (§6.7) — that folk observation replaced by a ten-year theoretical program. The fact that Ch.6 of *The SLAM Handbook*, co-authored by Carlone-Khosoussi-Rosen-Holmes-Barfoot-Dissanayake, treats this subject across 34 pages is itself evidence of the lineage's present weight. Over the same ten years, Ch.12, Ch.13, and Ch.16's learning-based SLAM moved along a different path: one side toward proving the globality of the solution, the other toward having a neural network predict it. Whether the two lineages will meet has no answer even in 2026. In Ch.19 the 🧭 items of this chapter are harvested as "gaps in backend theory."

The main line resumes in Ch.7, where the backend is taken as settled and the question shifts to what runs on top of it.

---

# Ch.7 — Feature-based Lineage: The ORB-SLAM Trilogy

Ch.6's graph SLAM revolution locked in pose graph optimization as the standard language of SLAM. Kümmerle's g²o (2011) and Kaess's iSAM2 (2012) made iterative optimization feasible on large maps and brought the cost of loop closure down to a practical level. That current, with optimization theory racing toward completion, is the starting line of Part 3. Where Part 2 asked "how do we reduce error," Part 3 begins with the premise that the question already has an answer. What remained was the front end. Which features do we extract, and how do we track them.

When Klein and Murray split tracking and mapping into two threads with PTAM in 2007, it was a lab demo. The idea had been proven in depth, not breadth, and it broke down beyond small indoor scenes. When Raúl Mur-Artal carried that structure over at the University of Zaragoza in 2015, he brought three things along. Rublee's ORB descriptor (2011), Gálvez-López's DBoW2 visual vocabulary (2012), and Strasdat's Essential graph idea (2011). PTAM was a fast prototype; ORB-SLAM was a ten-year standard.

---

## 7.1 ORB-SLAM (2015): A Tripod of Design Choices

[Mur-Artal, Montiel & Tardós 2015. ORB-SLAM](https://doi.org/10.1109/TRO.2015.2463671) is a paper in IEEE Transactions on Robotics. The title is plain. SLAM that uses ORB features. Pull the paper apart, though, and every choice is a design judgment.

The system's skeleton is three threads: Tracking, Local Mapping, Loop Closing. PTAM too had two threads (Tracking and Mapping). Mur-Artal added Loop Closing as a third thread. Loop Closing recognizes places with DBoW2, optimizes the pose graph through the Essential graph, and finally runs a global bundle adjustment (BA). This separation lets Tracking hold real time without waiting for map edits.

> 🔗 **Borrowed.** The Tracking–Mapping split of PTAM (Klein & Murray, 2007) carried directly into ORB-SLAM's Tracking–LocalMapping. Mur-Artal named the debt in §3 of the paper. ORB-SLAM extended the two-thread structure into three threads and isolated loop closure as an independent module.

Mur-Artal had a reason for picking the ORB (Oriented FAST and Rotated BRIEF) descriptor at the front end. SIFT and SURF came with patent trouble, and BRIEF was fast but weak under rotation. ORB adds rotation invariance on top of FAST keypoints, and Rublee et al. presented it at ICCV 2011. Computation is tens of times faster than SIFT, and since it is binary, matching runs on Hamming distance. It runs in real time on CPU.

The way ORB gets scale invariance is an image pyramid. The original image is shrunk by a scale factor s (1.2 in ORB-SLAM) over 8 levels to build a pyramid, and FAST keypoints are detected independently at each level. The orientation of a keypoint is defined by the intensity centroid: the first-order moment of pixel intensity inside the patch gives the center, and the orientation angle θ is applied to the BRIEF bit-comparison pairs to produce a rotation-invariant descriptor. The result is a 256-bit binary vector. Similarity between two descriptors is computed by XOR followed by popcount, that is, Hamming distance.

> 🔗 **Borrowed.** The descriptor from [Rublee et al. 2011. ORB](https://doi.org/10.1109/ICCV.2011.6126544) became the name of the system itself. ORB was not designed by the Zaragoza team. Mur-Artal took the tools that existed and assembled a pipeline. It is a case where a front-end choice named the system for the next ten years.

The keyframe selection policy departs from PTAM's. PTAM added keyframes aggressively. ORB-SLAM removes redundancy based on a covisibility graph. The **covisibility graph** is a graph whose edge weights are the number of landmarks shared between keyframes. A pair of keyframes connects when they share 15 or more landmarks. Local Mapping uses this graph to pick a local window and runs BA only inside that window.

On KITTI sequence 00 (a full 4.5 km loop) ORB-SLAM recorded 1.2% translation drift. PTAM, the comparison target at the time, cannot close the loop. It has no scale to begin with. That ORB-SLAM closed the loop on the same sequence and absorbed the drift was thanks to the Essential graph and DBoW2.

The **Essential graph** is a subgraph of the covisibility graph. It keeps only edges with 100 or more shared landmarks, the spanning tree, and the loop-closure edges. When a loop is detected the whole graph is optimized as a pose graph. Even with thousands of keyframes the edges of the Essential graph stay sparse. Optimization finishes within seconds.

> 🔗 **Borrowed.** The Essential graph idea came from the hierarchical optimization structure of [Strasdat et al. 2011. Double Window Optimisation](https://doi.org/10.1109/ICCV.2011.6126517). Strasdat separated a local window from a global window to cut optimization cost. Mur-Artal generalized this into a sparse pose graph called the Essential graph.

Place recognition for loop closure is handled by DBoW2. [Gálvez-López & Tardós 2012. DBoW2](https://doi.org/10.1109/TRO.2012.2197158) is a vocabulary tree for binary descriptors. ORB descriptors are hierarchically clustered with k-medians (k-means++ seeding) to build a tree-structured vocabulary. Once the branching factor $k_w$ and depth $L_w$ are fixed, the number of leaf nodes (words) becomes $k_w^{L_w}$. The DBoW2 paper reports an example with $k_w=10$, $L_w=6$ trained into a vocabulary of one million words, and the public ORB-SLAM implementation uses a vocabulary of similar size. Each word carries a TF-IDF (Term Frequency–Inverse Document Frequency) weight: the more frequently a given word appears across the entire keyframe database, the lower its IDF weight, so discriminative words carry more influence. A keyframe is represented by this weighted BoW vector and stored in an inverted index. When a new frame arrives, descending the vocabulary tree to determine the word takes O(log(k^L))=O(L), and the inverted index pulls up candidate keyframes directly. The whole map is never traversed.

The Tracking thread estimates the current pose every frame. With the previous frame's pose as an initial value, feature matching runs, and then **EPnP** (Efficient Perspective-n-Point) computes the pose $\mathbf{T}_{cw} \in SE(3)$. EPnP minimizes reprojection error over 3D–2D correspondences $\{(\mathbf{X}_i, \mathbf{u}_i)\}$:

$$\mathbf{T}^* = \arg\min_{\mathbf{T}} \sum_i \left\| \mathbf{u}_i - \pi(\mathbf{T}\mathbf{X}_i) \right\|^2$$

Here $\pi$ is the camera projection function, $\mathbf{X}_i$ is the world coordinate of a map point, and $\mathbf{u}_i$ is the image coordinate. After the initial estimate, RANSAC removes outliers, and a g²o-based local BA on inliers alone jointly optimizes the pose of the current keyframe and its covisibility-graph neighbors, along with the map points.

---

## 7.2 ORB-SLAM2 (2017) — Stereo/RGB-D

ORB-SLAM (2015) was mono-only. A single camera cannot know scale. There is no way to read out of image pixels whether "this corridor is 10 m or 100 m." That was the problem that sent Mur-Artal and Tardós back to work in 2016.

[Mur-Artal & Tardós 2017. ORB-SLAM2](https://doi.org/10.1109/TRO.2017.2705103) solves the problem by adding stereo and RGB-D. Stereo knows the baseline, so depth is triangulated directly. RGB-D has a depth sensor giving a measurement. In both cases scale arrives.

The structure is the same three threads as mono. Only the front end changes with the sensor type. Stereo extracts ORB from a rectified image pair and computes depth by left-right matching. Feature points near the baseline are classified as **stereo landmarks**, while far points where depth estimation is impossible are classified as **monocular landmarks**. This mixed approach uses the strengths of stereo and mono at once.

**Stereo initialization**, unlike mono, runs immediately from the first frame. Mono initialization builds the map through the Essential Matrix or Homography between two frames and leaves scale ambiguity behind. Stereo computes depth at the first keyframe from the horizontal disparity $d$ between left and right images together with the baseline $b$ and the focal length $f$:

$$Z = \frac{b \cdot f}{d}$$

Feature points with depth $Z$ below the threshold $Z_{\max}=40b$ are registered as 3D map points at once. RGB-D initialization works on the same principle. The depth value $Z$ at pixel $(u, v)$ is read from the depth image, and back-projection yields the 3D coordinate. In both cases, because scale is fixed, Local BA can run right after the first frame.

On the Machine Hall 01 sequence of the EuRoC MAV (Micro Aerial Vehicle) dataset, ORB-SLAM2 (stereo) recorded an absolute translation error of 0.035 m in Table II. The same table has Stereo LSD-SLAM as the comparison target, so the precision advantage of the feature-based line at the time shows up as numbers.  ORB-SLAM2 also ranked near the top among then-published methods on KITTI odometry.

On the day in May 2017 when the paper appeared in IEEE TRO, Mur-Artal and Tardós pushed the source to GitHub alongside it. Two people in the Zaragoza team released mono, stereo, and RGB-D modes on a single codebase. GitHub stars passed several thousand afterward, and ROS wrappers came out of the community.

---

## 7.3 ORB-SLAM3 (2021): Atlas and Visual-Inertial

[Campos et al. 2021. ORB-SLAM3](https://doi.org/10.1109/TRO.2021.3075644), published in IEEE Transactions on Robotics in 2021, has a different author list. The first author is not Mur-Artal but Carlos Campos. Mur-Artal is listed as a coauthor alongside Tardós. Campos had done his PhD at the University of Zaragoza under Tardós. The lineage moved down a generation.

The core extensions of ORB-SLAM3 are two. **Atlas** (multi-map) and **Visual-Inertial** mode.

Atlas is a structure that holds several separate maps simultaneously. When tracking fails, the existing map is closed and a new map is started; later, when the same place is revisited, the two maps are merged. In ORB-SLAM and ORB-SLAM2, a tracking failure was fatal. Lose it once and you started over. Campos pointed to this as the limitation he had run into most often through his PhD, and Atlas was the answer. ORB-SLAM3 re-initializes after a failure and remembers the previous map.

Visual-Inertial (VI) mode integrates IMU data. Campos carried over the formulation that Forster et al. proposed at RSS 2015 as "IMU Preintegration on Manifold" and extended in IEEE TRO 2016 as [On-Manifold Preintegration for Real-Time Visual-Inertial Odometry](https://doi.org/10.1109/TRO.2016.2597321). The IMU fills in the tracking that Visual SLAM tends to lose in fast motion. VI-SLAM also resolves the scale ambiguity of a monocular camera. The accelerometer measurements of the IMU provide absolute scale together with the direction of gravity.

The core of preintegration is that IMU measurements between keyframes $i$ and $j$ are integrated once and stored as relative increments ($\Delta\mathbf{R}_{ij}$, $\Delta\mathbf{v}_{ij}$, $\Delta\mathbf{p}_{ij}$) on the SO(3) manifold. When bias shifts during BA, a first-order Jacobian correction is applied without re-integration. ORB-SLAM3 adds these preintegrated terms as inertial edges in the factor graph and optimizes them jointly with the visual reprojection residual. (Ch.7b traces the full derivation — from Lupton's Euler-angle attempt to Forster's manifold formulation — for readers who want the mathematics behind this single line of ORB-SLAM3.)

> 🔗 **Borrowed.** Campos took the Forster et al. On-Manifold Preintegration formulation (TRO 2016, originating at RSS 2015) as the core of the ORB-SLAM3 Inertial integration. Forster's formulas give a way to integrate continuous IMU measurements on the SO(3) manifold with bias correction. ORB-SLAM3 hooked this formulation into factor graph optimization.

On the mean RMSE ATE (Absolute Trajectory Error) across all 11 EuRoC MAV sequences, ORB-SLAM3 (mono-inertial) is reported at 0.043 m in Table II. In the same table VINS-Mono comes in at 0.110 m, and Kimera (stereo-inertial) at 0.119 m.

When VI mode and Atlas are combined, a UAV or a handheld device can return to a previous map even through lighting changes or lost tracking. It was not a version number change so much as a change in the character of the system.

---

## 7.4 Why It Is Still the Baseline in the 2020s

In 2023, conference papers still put ORB-SLAM3 in the comparison table. When a new method is announced, "how much better than ORB-SLAM3" is the reference line. The algorithm stopped moving in 2021 and the benchmark role did not.

Robustness comes first. ORB features hold up reasonably under illumination change, the binary descriptor is fast to compute, and many of them can be extracted in real time to reduce tracking failures. Learned features are more accurate on certain datasets, but can fall apart in new environments. The behavior of ORB is predictable.

Reproducibility is there too. The code is public, ROS integration is solid, and thousands of real-world use cases are documented. Running ORB-SLAM3 as the first step when evaluating a new system in a lab has been standard practice for a long time. Because a single codebase supports mono, stereo, RGB-D, and IMU, one can line up "our method vs ORB-SLAM3 (stereo)" or "our method vs ORB-SLAM3 (mono-inertial)" in parallel. One baseline covers several settings.

The learned alternatives do not beat it consistently. DROID-SLAM (Teed & Deng, 2021) beats ORB-SLAM3 on several sequences. But as the paper itself reports, large sequences such as EuRoC and TartanAir need a 24 GB-class GPU, and on TartanAir it runs at 8 fps, not real time. ORB-SLAM3, by contrast, runs CPU-only, and community reports confirm basic operation on ARM and embedded platforms.

---

## 📜 Prediction vs. outcome

> 📜 **Prediction vs. outcome.** Mur-Artal laid out two Future Work directions in Section IX-C of the 2015 ORB-SLAM paper. One was "Points at Infinity," the idea of using distant points that lack parallax, and so cannot be folded in as ordinary map points, for rotation estimation. The other was "Dense Map Reconstruction," a suggestion that a compact keyframe selection would provide a good skeleton for dense reconstruction. Looking back ten years later, the first direction was partly absorbed by VI-SLAM and follow-on work, and the second direction ended up realized by the NeRF-SLAM and Gaussian Splatting line of the 2020s, which implement a "sparse skeleton + dense overlay" structure with different materials. The follow-on modality extensions the authors pointed to — RGB-D, stereo, IMU — were added not in this Section but in ORB-SLAM2 (2017) and ORB-SLAM3 (2021), under separate problem statements. `[partial hit]`

> 📜 **Prediction vs. outcome.** Campos et al. in the Conclusions of the 2021 ORB-SLAM3 paper admitted that the main failure mode of ORB-SLAM3 is low-texture environments, and pointed to the development of photometric techniques suited to the four data association problems as the next direction (citing endoscopic imagery as one example). Between 2023 and 2025 the more prominent current ran ahead of that direction: research that transplanted learned front ends such as SuperPoint and LightGlue into ORB-SLAM3, while integration of the photometric line continued separately in the DSO and LDSO context. The main branch of the official ORB-SLAM3 repository still holds the traditional ORB descriptor as of 2026. The authors' central axis (photometric) and the actual attention of the community (learned feature) rolled on past each other. `[in progress + diverted]`

---

## 🧭 Still open

Long-term map reuse. Atlas made multi-map maintenance possible, but in environments with large lighting changes map merging still fails. The goal is to recognize the place of a morning map and an evening revisit as the same place, but when appearance shifts are large, DBoW2's place recognition misses. Research groups that need long-term autonomy in outdoor settings with seasonal change are holding on to this problem. As of 2024 there is no full answer.

The place of the pure-vision baseline. Learned-feature systems have begun to beat ORB-SLAM3 on standard benchmarks. The SuperPoint + SuperGlue combination, LightGlue, and DINOv2-based features show lower error on particular sequences. Generalization is a different problem. Cases are reported where, outside the training distribution, learned features produce worse results than traditional ORB. Making the claim "consistently outperforms" will need broader experiments than exist today.

Drift at large outdoor scale. ORB-SLAM3 still lags LiDAR SLAM on urban driving and paths beyond several kilometers. Urban-scale localization in GPS-denied environments with a pure camera remains unsolved as of 2026. When changes in visual conditions, dynamic objects, and textureless stretches combine, drift accumulates. The gap to LiDAR survey precision is narrowing but has not closed.

---

The same years that the ORB-SLAM trilogy set the standard of the feature-based lineage, Newcombe and Engel were making the opposite choice. Do not pull out feature points; use the brightness information of the whole image directly. The two lineages grew side by side through the 2010s, and by taking each other as the reference they exposed each other's limits. While ORB-SLAM3 led the EuRoC benchmark in 2021, DSO beat ORB-SLAM2 in the TUM corridors. Same timetable, different starting point.

---

# Ch.7b — From a Shaking Sensor to a Constraint: The Invention of IMU Preintegration

Sydney, 2009. At ACFR (the Australian Centre for Field Robotics), a doctoral student named Todd Lupton was working through a problem in front of his advisor Salah Sukkarieh. When a drone moves aggressively, the IMU spits out measurements at 200Hz, and a factor graph has no place to put all of them. Keyframes arrive a few times per second; how do you bundle the dozens or hundreds of IMU measurements that fall between them into a single unit? The answer Lupton submitted to IROS was the seed of preintegration. Six years later, at RSS 2015, Christian Forster, together with Davide Scaramuzza, Luca Carlone, and Frank Dellaert, carried that seed onto the SO(3) manifold — and the IMU became a first-class citizen of the factor graph. The stage of this chapter is the inside of the equations that Ch.7's ORB-SLAM3, Ch.8's VI-DSO, and Ch.17's LIO-SAM and FAST-LIO dispatch with the line "we used Forster 2016."

---

## 7b.1 MEMS and the "democratization of sensors"

Preintegration became necessary because IMUs became cheap.

Strapdown inertial navigation has its roots in 1950s aerospace. Submarine and missile ring laser gyros were tens-of-thousands-of-dollars hardware, and the robotics community had no occasion to use them. What shifted the current was MEMS (Micro-Electro-Mechanical Systems). Analog Devices' ADXL and InvenSense's MPU series pulled a six-axis IMU down to a few dollars. The iPhone got an IMU in 2007, and by the early 2010s research drones and handheld devices carried a MEMS IMU as a matter of course. The moment when billions of smartphones drove the unit price down overlapped with the moment when Visual SLAM started taking monocular scale ambiguity (Ch.5 §🧭) seriously.

The measurement model is simple. The accelerometer gives the specific force $\tilde{\mathbf{a}} = \mathbf{R}_w^b(\mathbf{a}^w - \mathbf{g}^w) + \mathbf{b}^a + \boldsymbol{\eta}^a$ with gravity baked in, and the gyroscope gives the angular velocity $\tilde{\boldsymbol{\omega}} = \boldsymbol{\omega}_b^b + \mathbf{b}^g + \boldsymbol{\eta}^g$. Here $\mathbf{b}$ is bias and $\boldsymbol{\eta}$ is white noise. The facts the equations forced were more important than the equations themselves. Gravity is always mixed in, bias drifts slowly with time (random walk), and MEMS noise is high-frequency. The IMU was a fussy companion that forced a gravity-aligned world frame and whose bias changed a little with every shift in temperature and power state.

---

## 7b.2 First attempt — Lupton & Sukkarieh (2009 / 2012)

The problem was the time axis of the factor graph. Kaess's iSAM2, which Ch.6 covered, takes keyframe-rate poses as nodes. But the IMU throws dozens of measurements between keyframes. Making every measurement a node blows up the graph; discarding them loses information.

The answer from Lupton and Sukkarieh's [Visual-Inertial-Aided Navigation for High-Dynamic Motion (IROS 2009, TRO 2012)](https://doi.org/10.1109/TRO.2011.2170332) was a detour. Numerically integrate the IMU measurements between keyframes $i$ and $j$ *once* to build a relative increment. Treat that increment as a single factor, and the raw IMU measurements never need to enter the graph. The name "pre-integration" came from here.

The idea was right, but the implementation had two obstacles. The rotation representation was Euler angles — which have gimbal lock and are not a manifold. The more crippling issue was bias. Every time BA runs, the bias estimate shifts, and when the bias shifts, so does the increment. Lupton's scheme required re-integrating the IMU sequence at every BA iteration. The cost of redoing hundreds of measurements per keyframe at every pass cut into real time. That limitation delayed the idea's spread by six years.

---

## 7b.3 The decisive turn — Forster-Carlone (2015 / 2017)

At RSS 2015, Christian Forster, a doctoral student at ETH Zürich, together with Scaramuzza (UZH), Carlone (Georgia Tech, later MIT), and Dellaert (Georgia Tech, creator of GTSAM), released [IMU Preintegration on Manifold for Efficient Visual-Inertial Maximum-a-Posteriori Estimation](https://www.roboticsproceedings.org/rss11/p06.pdf). The extended version appeared in IEEE TRO 2017 as [On-Manifold Preintegration for Real-Time Visual-Inertial Odometry](https://doi.org/10.1109/TRO.2016.2597321). The author list itself is a lineage. UZH's agile drone experiments, Georgia Tech's GTSAM factor-graph language, and Carlone's optimization theory met in one paper.

There were three redefinitions. First, they defined $\Delta\mathbf{R}_{ij}$ rigorously as a relative rotation on the SO(3) manifold, and redefined $\Delta\mathbf{v}_{ij}, \Delta\mathbf{p}_{ij}$ so as to be *independent of gravity and the initial state*. These quantities are not physical increments but quantities constructed to be state-independent in the math. Because of this, the IMU factor could be evaluated knowing only the poses and velocities at the two ends. Second, they propagated the covariance $\boldsymbol{\Sigma}_{ij}$ analytically with a right-Jacobian trick that pushes noise to the tail of the exponential map.

The decisive move was the third. **Linear correction via the bias first-order Jacobian.** When the bias shifts a little during BA iteration, do not re-integrate the whole increment — apply a first-order correction using precomputed partial derivatives. The same idea as Lupton's Euclidean linearization, but operating on SO(3). Compute it once when you first integrate between keyframes, and no matter how many hundreds of times graph optimization iterates, the Jacobian does not have to be touched again. A re-integration of several milliseconds shrank to a Jacobian-vector product of several microseconds. This was where the IMU factor entered real-time BA.

The final wedge was Forster's implementation landing in GTSAM as a reference. Later systems did not rewrite the equations. They `#include`d `ImuFactor`.

> 🔗 **Borrowed.** Forster's manifold preintegration uses, as is, the SO(3) right-Jacobian apparatus organized by [Barfoot 2017. *State Estimation for Robotics*](https://doi.org/10.1017/9781316671528). The Lie-group calculation that substitutes small rotational variations with exponential maps and Jacobians was the common language of state estimation in robotics, and Forster rewrote IMU preintegration in this language. Where Lupton had been stuck in Euler angles, the same problem, moved into the SO(3) dialect, came loose.

> 🔗 **Borrowed.** The bias first-order Jacobian idea itself was first put forward by [Lupton & Sukkarieh 2012](https://doi.org/10.1109/TRO.2011.2170332). Forster et al. TRO 2016 §VIII-B acknowledges the debt explicitly: "we follow [Lupton-Sukkarieh] but operate directly on SO(3)." Carrying the Euclidean approximation onto the manifold, the same math became real-time.

---

## 7b.4 The three schools of practical VIO

Once Forster's formulation settled in, Visual-Inertial Odometry (VIO) systems branched into three lines between 2017 and 2022.

The first line is the filter family, and its root goes back further than Forster. The starting point was UC Riverside's Anastasios Mourikis and Stergios Roumeliotis at ICRA 2007 with [MSCKF (Multi-State Constraint Kalman Filter)](https://doi.org/10.1109/ROBOT.2007.364024). The scheme puts past camera poses on the filter state and uses stochastic cloning to marginalize out observed 3D points. It was the first case of running Visual-Inertial in real time on an EKF skeleton without preintegration. In 2021, the estimator that NASA JPL's Mars helicopter Ingenuity ran on Mars was from the MSCKF family. Guoquan Huang's group at the University of Delaware open-sourced [OpenVINS](https://doi.org/10.1109/ICRA40945.2020.9196524) in 2020.

The second line is the optimization family. Its signature work is HKUST's Shaojie Shen group and his doctoral student Tong Qin with [VINS-Mono](https://doi.org/10.1109/TRO.2018.2853729) in TRO 2018. They took Forster's formulation as is, planted it as an IMU factor inside a sliding-window tightly-coupled BA, and laid out a procedure for separately estimating scale and gravity direction during initialization. The code was released, and from 2019 to 2022 it became the VIO baseline at conferences. When Ch.7's ORB-SLAM3 was reported at an average ATE of 0.043m over the eleven EuRoC sequences, the side compared at 0.110m in the same table was VINS-Mono.

The third line is the direct family. VI-DSO (2018), Basalt (2019), and [DM-VIO (2022)](https://doi.org/10.1109/LRA.2021.3140129), covered in Ch.8, belong here. TUM's Cremers group stacked Forster's inertial factor on top of DSO's photometric BA. DM-VIO added *delayed marginalization*. If you marginalize rashly before IMU initialization converges, a wrong prior locks in and causes long-term drift; the method maintains two marginalization priors in parallel and merges them into the final prior after gravity and scale are observed.

---

## 7b.5 Observability — what cannot be seen

Visual-Inertial systems do not see everything.

The analyses Huang's group organized from the early 2010s converged on one conclusion. The null space of a clueless visual-inertial system is **four-dimensional**. Three dimensions of global position and one dimension of yaw around gravity. Absolute coordinates and rotation about the gravity axis cannot ever be known from IMU and camera alone. Add GPS, and position comes back; add a magnetic field or external anchor, and yaw comes back. Pure VIO structurally cannot see this four-dimensional subspace.

Interesting is that *roll and pitch are visible*. The accelerometer reads the horizontal through gravity. This is also where the monocular scale ambiguity Ch.5 pointed out gets resolved by attaching an IMU.

The trickier side is degenerate motion. Under pure straight-line motion, global orientation is unobservable; under pure rotation, feature depth is unobservable; under constant acceleration, monocular scale in its entirety is unobservable. This is why VIO scale wavers when a drone hovers or a car goes straight at constant speed. Practitioners know it from experience. At the moment of takeoff, the moment of braking, the moment of cornering, scale "locks in."

> 📜 **Prediction vs. outcome.** Forster et al. in TRO 2017 §IX named three directions: integrating time synchronization with online extrinsic calibration, validating the bias random-walk assumption under long-term operation, and extending to asynchronous sensors such as event cameras and rolling shutter. As of 2026, the first has been standardized as VINS-Mono, Kalibr, and OpenVINS put the time offset onto the state vector; the second holds for navigation-grade IMUs but remains affected by temperature and power-supply variation on consumer MEMS; the third has found one branch of the answer in Le Gentil's GP continuous-time preintegration. The predictions were largely on target, but instead of the single extension the authors sketched, the line split into three. `[partial hit + diverted]`

---

## 7b.6 The branch into continuous-time

At RSS 2021, Cédric Le Gentil and his advisor Teresa Vidal-Calleja at UTS (University of Technology Sydney) in Sydney released [Continuous Integration over SO(3) for IMU Preintegration](https://roboticsproceedings.org/rss17/p075.pdf). Sydney again. A few kilometers from Lupton's ACFR, the same problem was looked at again from a different angle.

Forster's preintegration is discrete. It assumes the IMU measurements are piecewise-constant between samples and does Euler integration. The assumption breaks when asynchronous sensors such as LiDAR or event cameras get mixed in. It is ambiguous which discrete bin to attach a LiDAR point that arrived in the middle of a scan, and interpolation error accumulates. Le Gentil's answer was to model the IMU as a **Gaussian Process**, treating the angular velocity as a continuous function. The state can be evaluated at an arbitrary time $\tau$, so asynchronous measurements enter naturally. This direction, which meets the B-spline, STEAM, and GPMP lineages, deserves its own separate lineage.

---

## 7b.7 Terrain of borrowings

> 🔗 **Borrowed.** The skeleton that evaluates and optimizes an IMU factor on a factor graph is the [Dellaert GTSAM](https://gtsam.org/) tradition from Ch.6 unchanged. Forster's `ImuFactor` plugs into GTSAM's `NoiseModelFactor` interface and is optimized inside a single `Values` object alongside visual reprojection factors. The inheritance was of software structure, not of mathematics.

> 🔗 **Borrowed.** The practice of treating bias as a random walk comes from the Kalman-filter state-propagation convention Ch.4 recorded. Well before Lupton, the navigation community used a model that "puts the bias in the state and gives it small process noise," and in the preintegration era this was reinterpreted as the bias random-walk factor.

---

## 🧭 Still open

**Real-time detection of visual-inertial observability.** The four-dimensional null space and the table of degenerate motions are theoretically settled, but the mechanism by which a running system decides "I am currently in a degenerate regime" is unfinished. The FEJ (First-Estimate Jacobian) line of Hesch, Li, and Huang preserves the null space at the linearization point, but as of 2026 there is no standard method to catch the start and end of degenerate conditions at runtime and feed them back into the control loop. In systems where drone control and VIO estimation run on the same CPU, this gap leads to actual crashes.

**Unification of preintegration and continuous-time.** Forster's discrete increments and Le Gentil's GP continuous representation solve the same problem in different mathematical languages. When mixing LiDAR, event, and frame cameras, which representation to lay at the bottom is still a matter of engineering choice. B-spline continuous-time BA has given a partial answer, but most deployed systems still use Forster's discrete factor.

**Learning-based IMU bias models.** The bias random-walk assumption holds on navigation-grade IMUs, but on consumer MEMS it drifts because of temperature hysteresis and power-supply transients. The TLIO and RoNIN line learned the bias of IMU-only odometry with LSTMs and Transformers, and more recently attempts have appeared that model the bias distribution itself with conditional diffusion. How this approach fits inside a Forster factor, and how much of preintegration's mathematical elegance will hold up once learning is doing the propping, is the next question.

---

The idea Lupton started in Sydney sat for six years against the wall of Euler angles; Forster moved it to SO(3) and picked the lock of the bias Jacobian; Le Gentil, in Sydney again, branched it into continuous time. Three generations of work are stacked behind one line in ORB-SLAM3, one line in VI-DSO, one line in LIO-SAM. The continuous-time branch that Le Gentil opened is the subject of Ch.7c; readers following the main visual lineage continue at Ch.8, where the direct methods (DSO and VI-DSO) receive the preintegration factor from Forster's hand.

---

# Ch.7c — When Time Must Flow Smoothly: Continuous-Time Trajectory

The preintegration Ch.7b organized was an engineering move that compressed IMU measurements into a relative factor between discrete keyframes. That compression presumes the unit called "keyframe." For the hundred inertial samples arriving between two keyframes to fold into one factor, both endpoints of the factor must carry a definite timestamp. In a system where the camera shutter opens and closes globally once per frame, the assumption is harmless. The trouble sat elsewhere.

In 2012, in Toronto, [Paul Furgale, Timothy Barfoot, and Gabe Sibley](https://asrl.utias.utoronto.ca/~tdb/bib/furgale_iros12.pdf) formalized the question in an IROS paper. In one image captured by a rolling shutter, each row is projected from a different pose in time. While a spinning LiDAR completes one rotation, the vehicle travels several meters. The IMU pours out samples at 1 kHz while the camera runs at 30 Hz. The most natural way to tie these sensors into a single optimization was to treat the pose not as a "frame" but as a "function of time $t$." Furgale, Barfoot, and Sibley picked the B-spline, and that choice became the official starting point of the branch known as continuous-time trajectory estimation.

Ten years later, the Handbook places this branch, alongside manifolds, as one of the "two fundamental tools" of SLAM. The reason earlier chapters of this book have never pulled this tool out is that the grammar of Visual-Inertial was largely completed on discrete keyframes. This chapter fills that gap.

---

## 7c.1 Limits of discrete-time

What Ch.7b's preintegration solved was a single axis: "the IMU is faster than the camera." Four more axes it did not solve.

First, rolling shutter. A consumer CMOS camera reads one frame top-to-bottom over tens of milliseconds. In a fast-moving camera the first row and the last row are captured from different poses. When Ch.8's DSO and LSD-SLAM assumed photometric consistency, this distortion sat outside the model. That is why the Cremers group mounted a B-spline trajectory onto [Basalt](https://arxiv.org/abs/1904.06504) in 2019.

Second, spinning LiDAR motion distortion. As seen in Ch.17, the Velodyne HDL-64E completes one rotation at 10 Hz. If the vehicle runs at 10 m/s during those 100 ms, the points within a single scan are captured from poses 1 m apart. LOAM corrected this distortion indirectly inside the odometry loop, but the principled solution was a trajectory representation that allowed querying "the pose at the instant each point was captured."

Third, event cameras. The DVS recorded in Ch.18 pours asynchronous events at μs granularity per pixel. Events have no "frame." [Mueggler et al. 2015](https://arxiv.org/abs/1502.00796) formulated event SLAM on an SE(3) B-spline trajectory because no other option was available.

Fourth, fusing a high-rate IMU with several sensors of heterogeneous frequency at once. When a system ingests a 200 Hz IMU, a 20 Hz camera, and a 10 Hz LiDAR, placing a discrete state node at every measurement time is not realistic. The moment the number of states tracks the number of measurements, the factor graph swells.

The four problems share one structure. The measurement time $t_i$ is not controlled. Observations arrive whenever, and the estimator has to know the pose at that time. The decoupling of "measurement time / estimation time / query time" is the essential advantage of a continuous-time representation.

---

## 7c.2 Parametric spline: the Furgale line

The tool Furgale, Barfoot, and Sibley chose in 2012 was the B-spline. Write the trajectory as a sum of basis functions, $\mathbf{p}(t) = \sum_k \Psi_k(t)\,\mathbf{c}_k$, and take the coefficients $\mathbf{c}_k$ as the optimization variables. The core of the B-spline is local support. At any time $t$ only a handful of bases (usually four) are nonzero; the rest are exactly zero. Querying the pose at an arbitrary time $t_i$ costs a constant, and the sparsity of the factor graph is preserved.

> 🔗 **Borrowed.** The mathematical skeleton of the B-spline is the classic [de Boor (1978) *A Practical Guide to Splines*](https://link.springer.com/book/10.1007/978-1-4612-6333-3). What Furgale did was lift that skeleton onto SE(3) and place the coefficients as variable nodes in the factor graph. It is the path by which a tool from the numerical-analysis textbook was transplanted into SLAM optimization.

The weaknesses of the form were clear. Spacing the coefficients tightly overfits; spacing them widely misses fast motion. The spacing choice depended on experience. And laying a linear B-spline directly onto SE(3) makes the interpolation result leave the manifold.

In 2013 Oxford's [Steven Lovegrove et al.](https://www.roboticsproceedings.org/rss09/p11.html) proposed the cumulative B-spline. Rearranging the basis not as a sum but as a cumulative product, $T(t) = \prod_k \exp\bigl(\tilde\Psi_k(t) \log(T_k T_{k-1}^{-1})\bigr) \cdot T_0$, closes each factor on the Lie group. This form became the native language of subsequent rolling-shutter, event-camera, and VIO papers. Basalt, the [Mueggler event SLAM](https://arxiv.org/abs/1502.00796), and [Kerl et al. 2015 dense rolling shutter VO](https://doi.org/10.1109/ICCV.2015.172) all stood on the cumulative B-spline.

The parametric spline has been used steadily in real-time VIO and event systems for its light computation and simple code. What it lacks is a natural way to layer a prior (motion prior) over the trajectory. In stretches where observations are sparse, the spline is smooth but smooth without grounds. Another branch fills that gap.

---

## 7c.3 SDE-based GP: the Barfoot line and STEAM

In the same 2014, the Barfoot group in Toronto opened a second branch. [Barfoot, Tong, and Särkkä 2014, "Batch Continuous-Time Trajectory Estimation as Exactly Sparse Gaussian Process Regression"](https://www.roboticsproceedings.org/rss10/p01.pdf) — the title was already the claim. Treat the trajectory not as a basis sum but as a Gaussian process. The prior over the trajectory is given by a kernel $\mathcal{K}(t, t')$, and when observations arrive the posterior closes as a conditional Gaussian.

The pure form of GP has one problem. For a large observation count $N$, inverting the kernel matrix $K$ costs $O(N^3)$. What Barfoot, Tong, and Särkkä showed was that a family of kernels exists for which this cost can be avoided. When the trajectory is defined as the solution of a linear time-invariant stochastic differential equation $\dot{\mathbf{x}}(t) = A\mathbf{x}(t) + L\mathbf{w}(t)$, the inverse $K^{-1}$ of its kernel $K$ has a block-tridiagonal structure. Read as a factor graph: binary factors exist only between consecutive state nodes, and none between distant nodes.

> 🔗 **Borrowed.** The frame "reinterpret the GP posterior as a prior on the factor graph" is the SDE-GP connection laid out in [Särkkä 2013 *Bayesian Filtering and Smoothing*](https://users.aalto.fi/~ssarkka/pub/cup_book_online_20131111.pdf), which the Barfoot group pulled into SLAM. The Rasmussen-Williams GP textbook writes the kernel in closed form, but real-time SLAM wants a sparse inverse. Särkkä's SDE representation was the bridge.

The practical payoff is **STEAM** (Simultaneous Trajectory Estimation and Mapping). At RSS 2015, [Sean Anderson and Barfoot 2015, "Full STEAM Ahead"](https://www.roboticsproceedings.org/rss11/p45.pdf) formalized a constant-velocity-prior STEAM. Augment the state with pose $\mathbf{p}(t)$ and velocity $\mathbf{v}(t)$, and let the pose follow from the white-noise integral of velocity. That same year Anderson tightened the sparsity proof, and that piece became the backbone of every continuous-time paper out of the Barfoot group thereafter.

STEAM's second advantage was GP interpolation. Keep only a small number of control poses, and query the pose at any time between them as the posterior mean. Within one scan of a spinning LiDAR, even if 10,000 points are captured at 10,000 different instants, the control points number only one per scan. The computation scales with the number of control points, not with the number of observations.

In 2019, the release of Tang and Barfoot's [STEAM open source](https://github.com/utiasASRL/steam) gave academia and industry a directly usable library. The same year, the Dellaert group's GTSAM received a GP continuous-time factor in contrib. It was the convergence of the two paths.

---

## 7c.4 Continuous-time on the Lie group

Parametric or nonparametric, SLAM wants a trajectory on SE(3). Lifting a Euclidean spline or GP onto SE(3) is not technically simple. The common approach is to lean on the tangent space: linearly interpolate there, then lay the result back onto the manifold with the exponential map.

On the B-spline side, [Sommer, Demmel et al. 2020, "Efficient Derivative Computation for Cumulative B-Splines on Lie Groups"](https://arxiv.org/abs/1911.08860) compiled the SE(3) cumulative-spline Jacobian in closed form. This CVPR paper supplied a standard formulation for B-spline trajectories with real-time derivatives that rolling-shutter VIO, event cameras, and visual-inertial systems could all use. Basalt and the follow-up work from the Cremers group stood on this result.

On the GP side, Anderson and Barfoot proposed a "local variable" construction. Near each control pose $T_k$ define a local perturbation $\xi_k(t) = \log(T(t)\,T_k^{-1})$, and run the GP on it. Defining a GP directly on the global manifold is hard, but on the tangent space around each control point a Euclidean GP stands. When crossing between control points, an adjoint appears, and the reason that adjoint must be there shares the root of Ch.7b preintegration's on-manifold discussion. The fact that both tools share the same Lie-group grammar became clear from 2015 on.

> 🔗 **Borrowed.** The path of transplanting a GP into a Lie-group local variable was first systematized by [Anderson-Barfoot 2015 ICRA](https://doi.org/10.1109/ICRA.2015.7138984). Their trick ("run a GP only between two consecutive control points, and correct with the adjoint when stepping across control points") every continuous-time LiDAR and VIO paper inherited thereafter.

The practical difference between spline and GP is the presence or absence of a motion prior. The spline estimates coefficients directly, with no prior. The GP carries a prior derived from an SDE, built in as constant-velocity or white-jerk and so on. Where observations are sparse, the prior fills in for the GP; the spline is filled by its neighboring observations. Attempts to combine the two (Johnson et al. 2020) have appeared, but one is chosen by application.

---

## 7c.5 The line descends to applications: LiDAR and VIO

It took ten years for the theory to come down into applications. Starting around 2022, continuous-time became a de facto standard in three arenas.

First, LiDAR motion distortion. Paris's [Pierre Dellenbach et al. 2022, "CT-ICP"](https://arxiv.org/abs/2109.12979) parameterized each scan with two poses (a "start pose" and an "end pose") and linearly interpolated between them. A simple continuous-time model, yet it beat the accuracy of prior LOAM and FAST-LIO on the KITTI, NCLT, and Newer College benchmarks. The same year, Toronto's [Keenan Burnett et al. 2022, "Are We Ready for Radar to Replace Lidar?"](https://arxiv.org/abs/2206.05432) and [STEAM-ICP](https://github.com/utiasASRL/steam_icp) applied GP-based continuous-time to the Aeva FMCW LiDAR. The Aeva sensor outputs Doppler velocity alongside each point, and that velocity maps directly to STEAM's velocity state. Without the continuous-time representation, this information had no way of being used.

Second, rolling-shutter VIO. Basalt, the [Cremers group rolling-shutter VO](https://doi.org/10.1109/CVPR.2016.71), and the follow-ups to [OKVIS](https://doi.org/10.1177/0278364914554813) query each image row's capture time on a B-spline trajectory. Unlike the earlier VIO that assumed a global shutter and routed around the problem, the rolling shutter itself is handled inside the model.

Third, event cameras. After the 2010s frustrations recorded in Ch.18, the event SLAM of the 2020s nearly all stood on continuous-time trajectories. The μs timestamp of each event is queried against a B-spline or GP to obtain the pose at that instant, and the residual is computed with event-image consistency. The fact that an event is a "frameless sensor" and that continuous-time is "a representation that needs no frame assumption" locked together naturally.

> 🔗 **Borrowed.** CT-ICP is a combination that lays intra-scan continuous-time linear interpolation on top of the point-to-plane objective of [Besl and McKay 1992 ICP](https://graphics.stanford.edu/courses/cs164-09-spring/Handouts/paper_icp.pdf). Classic registration and Furgale's continuous-time spirit met inside one system, thirty years apart.

---

## 📜 Prediction vs. outcome

> In the Future Work of their 2012 IROS paper, Furgale, Barfoot, and Sibley wrote two expectations. One was that "continuous-time representation will become the natural language for unifying rolling shutter and high-rate IMU sampling"; the other was "follow-up work proving compatibility with a sparse factor graph." Both landed within a decade. Barfoot, Tong, and Särkkä 2014 closed the sparse GP proof, and the rolling-shutter VIO and event SLAM of the 2020s use the cumulative B-spline as their native language. One development the authors did not predict: in 2012 an implicit division of labor was assumed — "the discrete-keyframe-based ORB-SLAM will be the mainstream; continuous-time is for specialty sensors." What actually came out was a push in the opposite direction too. When Burnett released STEAM-ICP using the Doppler velocity of an FMCW LiDAR, continuous-time became not an appendix for handling specialty sensors but an active representation that draws out a sensor's capability. `[hit]`

---

## 🔗 Borrowed (summary)

Beyond the three boxes scattered above, one more gathering of other lines this chapter has leaned on.

Without Särkkä's SDE-GP textbook, Barfoot, Tong, and Särkkä 2014 would have had no anchor for its equations. Without de Boor's 1978 spline classic, Furgale 2012 would have had to stack basis functions from scratch. Without the local-variable technique of Anderson and Barfoot 2015, transplanting GP to the Lie group would have taken longer. Continuous-time trajectory estimation is the spot where three streams, numerical analysis, probability theory, and Lie-group differential geometry, converge on the narrow point called SLAM.

---

## 🧭 Still open

**Learning-based continuous-time prior.** The motion prior that an SDE provides embeds physical assumptions such as constant-velocity or white-jerk. Real driving, walking, and UAV trajectories often violate these assumptions. In 2023-2024, attempts appeared to learn data-driven priors with neural SDE or neural ODE and plug them into the continuous-time factor graph. A system that layers a learned prior while keeping real-time sparse structure is still at the validation stage.

**Integration of VIO and continuous-time.** Ch.7b's preintegration remains the de facto standard for keyframe-based VIO. Whether continuous-time trajectories can replace preintegration, or whether a hybrid where the two tools coexist is better, has no conclusion as of 2026. Le Gentil's [GP-augmented preintegration line](https://arxiv.org/abs/2007.04144) is trying to lay a bridge, but at the deployment-system level of ORB-SLAM3 and VINS-Fusion, discrete-time preintegration is still the workhorse.

**Online sliding window for edge deployment.** STEAM- and B-spline-based systems slow down as control points accumulate. The problem of marginalizing out past control points while preserving the consistency of the continuous-time posterior is technically tricky. If continuous-time SLAM is to be pushed as a ten-year standard on embedded platforms like cars and drones, this gap has to be filled first.

---

If Ch.7b was the engineering that squeezed out discrete-time efficiency to the end, this chapter traced a line that grew outside it — "when time must flow smoothly." The two tools do not compete. Configurations that place an IMU preintegration factor and a continuous-time LiDAR factor side by side in a single SLAM system have been reported steadily since 2024. Ch.8 picks up the visual line directly: it is where DSO and VI-DSO deploy the Forster factor, and where the direct photometric approach, independent of both preintegration supplements, comes to its own conclusions.

---

# Ch.8 — The Direct Lineage: From DTAM to DSO

Richard Newcombe was Andrew Davison's doctoral student. Having witnessed MonoSLAM's 30-landmark ceiling firsthand at Imperial College, in 2011 he made the opposite bet—use every pixel. Where Davison had proved real-time viability by leaning on the EKF's logic that "tracking a few points is enough," Newcombe strapped on a single GPU and showed that real-time was still possible even when the full frame was in play. DTAM is a direct descendant of MonoSLAM, but its methodological DNA is inverted.

The ORB-SLAM lineage from Ch.7 was a method of extracting features first and tracking only those features. Harris corners and ORB descriptors filtered the image down to a few hundred points, and the rest of the pixels were thrown away. The direct lineage refused this trade. There are no pixels to discard—the image itself is the measurement.

In Munich that same year, Daniel Cremers was walking a different path. He was porting computer vision's variational machinery (Gauss-Newton image alignment, the formal language of optical flow) into SLAM as a whole. Cremers's student Jakob Engel delivered LSD-SLAM in 2014 and DSO in 2016. The two papers posed the same question at different densities. What happens if, instead of extracting features, you compare pixel intensities directly?

---

## 1. Every Pixel: DTAM

[Newcombe, Lovegrove & Davison 2011. DTAM](https://doi.org/10.1109/ICCV.2011.6126513), presented at ICCV 2011, stands for "Dense Tracking and Mapping in Real-Time." As the name says, it does tracking and mapping at once, using every pixel, in real time.

The system has two parts. The tracking stage performs photometric alignment by comparing the whole current frame against a cost volume. No feature extraction, no descriptor matching—only pixel intensity differences are minimized. The mapping stage estimates a depth map via multi-baseline stereo and maintains a smooth dense 3D model with total variation regularization.

$$E(\mathbf{u}) = \sum_{i} \rho\left( I_i\bigl(\pi(KT_i\mathbf{p}(\mathbf{u}))\bigr) - I_r\bigl(\pi(\mathbf{p}(\mathbf{u}))\bigr) \right) + \lambda \,\text{TV}(\mathbf{u})$$

Here $\mathbf{u}$ is the inverse depth map, $\mathbf{p}(\mathbf{u})$ the 3D point obtained by back-projecting $\mathbf{u}$, $K$ the camera intrinsic matrix, $T_i$ the rigid body transform of frame $i$ relative to the reference frame, $\pi$ the perspective projection, $\rho$ the Huber loss, and $\text{TV}(\mathbf{u}) = \|\nabla \mathbf{u}\|_1$ the total variation regularizer. Running this optimization in real time requires a GPU. DTAM did not hide that premise. It ran on a single Nvidia GTX 480 (the commodity system configuration described in the paper's §3).

> 🔗 **Borrowed.** DTAM's dense volumetric approach took partial inspiration from depth-camera work, in particular the TSDF idea of [Curless & Levoy 1996](https://doi.org/10.1145/237170.237269), but the key difference was its application to a monocular camera. A reverse current then followed: [KinectFusion](https://doi.org/10.1109/ISMAR.2011.6092378) (2011, ISMAR), led by Newcombe himself, completed the same idea in depth-sensor form.

Video of an entire indoor scene being reconstructed in real time went up on YouTube right after the 2011 ICCV talk and racked up tens of thousands of views. The weaknesses were equally plain. It did not run without a GPU and it was fragile under lighting changes. Scaling it to large outdoor environments was out of reach.

---

## 2. Tracking the Edges: LSD-SLAM

[Engel, Schöps & Cremers 2014. LSD-SLAM](https://doi.org/10.1007/978-3-319-10605-2_54) gave up DTAM's dense formulation and dropped the GPU dependency along with it. "Large-Scale Direct Monocular SLAM" works in semi-dense mode, tracking only those pixels whose gradient magnitude exceeds a threshold. Flat regions of a wall are ignored; only pixels near edges with sufficient gradient survive. No corner detector is used—the strength of the intensity gradient is the sole criterion for pixel selection.

The tracking stage is a direct image alignment in SE(3). The current frame is warped directly onto a keyframe and the photometric residual is minimized by Gauss-Newton. The map is keyframe-based, and each keyframe carries its own semi-dense depth map. Keyframe connections are maintained as a pose graph, and loop closure finds candidates by appearance-based relocalization and then verifies them with a depth consistency check.

> 🔗 **Borrowed.** Gauss-Newton photometric registration is a classic of the image alignment field. The [Lucas & Kanade 1981](https://www.ijcai.org/Proceedings/81-2/Papers/017.pdf) tracker and its inverse compositional reformulation ([Baker & Matthews 2004](https://doi.org/10.1023/B:VISI.0000011205.11775.fd)) are the direct ancestors of LSD-SLAM's frontend. The Cremers group transplanted the language of the variational image-processing community into the entire SLAM pipeline.

The practical significance of LSD-SLAM was that it ran in real time on a CPU. The structure—keyframes only, pose graph optimization—resembled PTAM's tracking/mapping split on the surface, but underneath it was different. There are no binary descriptors like ORB or BRIEF; pixel intensity is the only measurement.

LSD-SLAM also released footage of operation in large outdoor environments. A demo in which a semi-dense map was built while riding a bicycle for tens of meters showed that the direct approach could scale. On the KITTI benchmark it was competitive with the top-tier feature-based methods of the time.

Lighting changes, however, were the problem. Entering a tunnel, backlight through a window, a sudden flash—the moment photometric consistency was assumed, scenes like these destabilized the system immediately.

---

## 3. Sparse Direct Perfected: DSO

[Engel, Koltun & Cremers 2018. DSO (PAMI)](https://doi.org/10.1109/TPAMI.2017.2658577) first appeared on arXiv in 2016. "Direct Sparse Odometry" carries its positioning in the name. More sparse than LSD-SLAM, but with far fewer pixels than DTAM, in exchange for doing photometric calibration thoroughly.

The system selects roughly 2,000 high-gradient pixels in each keyframe. That is more than ORB-SLAM2's default setting (nFeatures=1000), and far fewer than LSD-SLAM's semi-dense set (all pixels with gradient). On these pixels it performs a sliding window bundle adjustment in which the optimization variables include camera pose, inverse depth, and affine brightness parameters $(a_i, b_i)$. Frames that fall out of the window are removed by marginalization, and the Schur complement is used to keep the computational cost at O(N) throughout.

DSO separates the camera's photometric model into three layers. First, vignetting (the falloff in brightness toward the edges of the lens) is corrected through prior calibration. Second, the camera response function (gamma curve, the sensor's non-linear recording of light) is also inverted in advance to convert measurements into a linear intensity domain. Third, the per-frame varying exposure time and affine brightness change are estimated as real-time optimization variables $(t_i, a_i, b_i)$:

Here $t_i, t_j$ are exposure times, $(a_i, b_i)$ and $(a_j, b_j)$ the affine brightness parameters of each frame (gain and bias), and $\|\cdot\|_\gamma$ the Huber loss. Vignetting is corrected in the preprocessing step through photometric calibration, and the residual above is applied to the corrected intensities. Splitting the camera's exposure variation, vignetting, and response curve between a separate calibration stage and real-time optimization variables was something direct SLAM saw for the first time in DSO.

> 🔗 **Borrowed.** The formal basis of photometric camera calibration traces to the HDR-recovery work of [Debevec & Malik 1997](https://doi.org/10.1145/258734.258884). The photometric model they set up to recover a camera response function from multiple photographs was imported by DSO as real-time SLAM optimization variables.

On the TUM monocular dataset, DSO was reported to outperform ORB-SLAM2 across several sequences. In feature-poor environments (indoor corridors with large flat walls) DSO achieved a lower ATE than ORB-SLAM2. This was the empirical grounding for the claim that using photometric information is using more information.

> 📜 **Prediction vs. outcome.** DSO required prior photometric calibration, and that dependency soon became the target of follow-up work. One direction came in 2018 from Bergmann, Wang, and Cremers with [online photometric calibration](https://doi.org/10.1109/LRA.2017.2777002)—estimate exposure, response, and vignetting jointly during the SLAM run rather than in advance. Even so, the deployment barrier from the end-user perspective remained as of 2026. The process of reliably extracting photometric parameters from consumer cameras has not been fully automated and still requires per-camera presetting. `[in progress]`

> 📜 **Prediction vs. outcome.** DTAM was real-time dense SLAM that leaned on a single GPU, and widening access to dense reconstruction sat as the natural next task. The path to realization was not a straight line. Pure monocular dense did not arrive in real-time-deployable form until NeRF and 3DGS emerged in the 2020s. Instead, KinectFusion, led by Newcombe himself, completed GPU dense reconstruction using an RGB-D depth sensor right in 2011—routing around the problem by swapping the sensor. `[diverted]`

---

## 4. VI-DSO and the Lineage Extended

In 2018, von Stumberg, Usenko, and Cremers presented [VI-DSO](https://doi.org/10.1109/ICRA.2018.8462905), which combined DSO with an IMU, at ICRA 2018. The motivation was simple. In the failure mode that hurt photometric direct methods most—rapid lighting change—inertial measurements from an IMU could support pose tracking. The IMU could also resolve the scale ambiguity of the monocular camera.

VI-DSO adds an IMU preintegration factor to DSO's windowed photometric bundle adjustment. The IMU preintegration scheme was borrowed from [Forster et al.'s 2017 paper](https://doi.org/10.1109/TRO.2016.2597321). The result: scale was recovered, and robustness improved under extreme lighting.

Follow-up work from the Cremers group, [Basalt](https://arxiv.org/abs/1904.06504) (2019) and [DM-VIO](https://doi.org/10.1109/LRA.2021.3140129) (2022), continued in the same direction. The structure is a direct photometric frontend with a tightly coupled inertial backend on top. This lineage proceeded in parallel with feature-based VIO (VINS-Mono, OpenVINS), and each formed its own ecosystem.

> 🔗 **Borrowed.** VI-DSO's IMU preintegration uses the manifold preintegration formulation of [Forster et al. 2017. On-Manifold Preintegration (IEEE TRO)](https://doi.org/10.1109/TRO.2016.2597321) directly. It is a stacked structure: Forster's inertial layer placed on top of DSO's photometric layer.

---

## 5. Limits of the Direct Method

Direct methods use, in principle, more information. They pull the pixels that feature detectors throw away—regions where the gradient is low but consistent—into tracking. The photometric residual gives a continuous optimization landscape without the discretization that descriptor matching imposes.

And yet, as of 2026, the majority of deployed systems are feature-based. The reasons sit in several layers.

First, the dependence on photometric calibration. The vignetting correction, response curve correction, and exposure control that DSO assumes are not simply available from a consumer camera. Smartphone cameras apply HDR fusion, auto-exposure, and real-time white balance internally, and that pipeline is not exposed to the user. DSO's photometric model breaks its basic assumptions on these cameras.

Second, lighting change. Under auto-exposure or backlight—situations where inter-frame brightness changes sharply—the direct method's core assumption of photometric consistency collapses immediately. DSO's affine brightness model can only absorb gradual drift, so scenes where a cloud passes outdoors or a fluorescent tube flickers indoors remained a leading cause of tracking failure.

Third, even where DSO beat ORB-SLAM2 on controlled-dataset sequences, engineers putting something on an actual robot picked ORB-SLAM. ORB-SLAM runs on many camera models without separate photometric calibration. Swap the camera and it still works. DSO required per-camera vignetting and response-curve calibration.

Fourth, learned features such as [SuperPoint](https://arxiv.org/abs/1712.07629) (2018) and [LightGlue](https://arxiv.org/abs/2306.13643) (2023) blunted the direct method's central critique that "features discard information." They preserve far more information than handcrafted descriptors did while keeping the practical advantages of descriptor matching. At the very point where direct was attacking feature-based, learned features filled the gap.

---

## 🧭 Still open

**Direct tracking under rapid lighting change.** The foundational premise of the direct method—that the brightness distribution of a scene is preserved across frames—collapses immediately under auto-exposure cameras, strong backlight, or tunnel-to-outdoor transitions. VI-DSO's IMU assistance eases this partially, but a full solution that dynamically estimates the lighting model itself does not yet exist. Learning-based photometric correction is being explored as an alternative, but has not arrived in a real-time-deployable form.

**The double weakness of textureless + direct.** Feature-based methods fail in front of a wall with no corners. Direct methods see the residual vanish on surfaces without gradient. Both approaches are weak in indoor corridors, large warehouses, and homogeneous outdoor terrain. Semi-dense LSD-SLAM hedged by selectively using pixels that do have gradient, but it did not solve the degeneracy that arises when those pixels are not distributed densely enough.

**A possible transition to a learned photometric model.** Current direct SLAM's photometric model is expressed as a simple affine brightness correction or a fixed camera response function. Work in the neural radiance field family is exploring scene appearance as a neural network. Whether this can enter the photometric layer of real-time direct SLAM, and if it does, where the boundary between direct and learned will be drawn, are open questions as of 2026.

Meanwhile, running parallel to the direct lineage, a different exit had already been opened in 2011. Newcombe himself showed it via KinectFusion. Rather than keeping the photometric assumptions of a monocular camera, change the sensor. An RGB-D camera, which measures depth directly, enabled dense reconstruction regardless of brightness change. Where the direct method tried to uphold photometric consistency through its equations, RGB-D struck the assumption itself off the list of questions.

---

# Ch.9 — Dense/RGB-D: From KinectFusion to BundleFusion

In November 2011, when Richard Newcombe (Imperial College London) presented KinectFusion at ISMAR, the audience's attention gathered around the demo video more than the paper. A single handheld Kinect sensor filled an entire room with a 3D mesh in real time. It was the thing Newcombe's own DTAM, released earlier the same year, had been dreaming of with a monocular camera, now actually achieved with an RGB-D sensor. The lineage is clear: the TSDF representation that Curless and Levoy devised for the graphics community in 1996, the ICP tracker that Besl and McKay handed to robotics in 1992, and the Kinect sensor that Microsoft released at $150 in 2010. Where those three strands crossed, the short and intense age of dense SLAM opened. The same framework that Davison's MonoSLAM (Ch.5) used to track sparse landmarks with a monocular camera (real-time tracking, CPU only, no GPU) now reached a different conclusion in front of Kinect's depth stream. Newcombe's DTAM (Ch.8) had opened the door on GPUs by attempting dense reconstruction through direct photometric optimization, and KinectFusion closed the same door with an RGB-D sensor.

---

## 9.1 Dense reconstruction before Kinect

Dense 3D reconstruction was possible before 2011. What was not possible was *real-time*.

Offline pipelines merged point clouds acquired by stereo or structured-light scanners, given time. Indoor scanning rigs cost hundreds of thousands of dollars. No one outside the lab used the technology. The SLAM community was already getting practical results from sparse landmarks, and dense reconstruction was filed away as a graphics problem.

[Curless and Levoy's 1996 SIGGRAPH paper, "A Volumetric Method for Building Complex Models from Range Images"](https://graphics.stanford.edu/papers/volrange/volrange.pdf), represents the graphics-side approach of this era. The central idea was the **TSDF (Truncated Signed Distance Function)**. Partition 3D space into a uniform voxel grid, and at each voxel accumulate the signed distance to the nearest surface. The sign convention: moving from the sensor toward the surface, the front of the surface (free space) is positive, and the back (inside solid) is negative. Truncated means clipping that value within a threshold $t$ in absolute value, giving the form $\text{TSDF}(x) = \text{clip}(d(x), -t, +t)$. Each incoming depth frame updates the value as a weighted average, so that noise averages out and the surface sharpens over time. Surface extraction applies marching cubes to the TSDF's zero-crossing.

The method was accurate. But the voxel grid ate memory, and real-time update was impossible on the hardware of the day. Curless-Levoy's paper stayed in graphics textbooks for the next fifteen years.

In those fifteen years two things changed. GPUs entered the GPGPU era, and Kinect appeared.

---

## 9.2 KinectFusion and TSDF

Microsoft Research released Kinect for the Xbox 360 at $150 in 2010. The sensor measured depth via structured light and streamed VGA-resolution depth maps at 30Hz. Precision was below research-grade ToF (Time-of-Flight) cameras, but the price was one-hundredth. Hackers responded first. Within weeks of launch, open-source drivers appeared, and researchers followed.

Newcombe had by then moved to Microsoft Research Cambridge, where he was preparing a GPU-based dense SLAM with Shahram Izadi's team. By the time Kinect launched they already had the outline of a pipeline. The depth stream Kinect supplied filled in the rest. The result, presented at ISMAR 2011, was [Newcombe et al. 2011. KinectFusion](https://doi.org/10.1109/ISMAR.2011.6092378).

> 🔗 **Borrowed.** KinectFusion's core representation, the TSDF, was devised by Curless & Levoy (1996) for offline 3D scanning. Newcombe's team made it real-time via GPU parallel voxel updates.

The pipeline has four stages.

Depth preprocessing: denoise the raw depth map with a bilateral filter and compute surface normals.

ICP tracking: align the current frame's point cloud to the virtual surface ray-cast from the previous TSDF. A point-to-plane variant of [Besl & McKay (1992)](https://graphics.stanford.edu/courses/cs164-09-spring/Handouts/paper_icp.pdf)'s **ICP (Iterative Closest Point)** is iterated thousands of times on the GPU. The output is the camera's 6-DoF pose.

The point-to-plane ICP objective is as follows. Transform the current frame's point $\mathbf{p}_i$ by $T = (R, \mathbf{t})$ and take its correspondence $\hat{\mathbf{p}}_i$ (the ray-cast surface) with normal $\hat{\mathbf{n}}_i$; minimize

$$E(R, \mathbf{t}) = \sum_i \bigl(\hat{\mathbf{n}}_i^\top (R\,\mathbf{p}_i + \mathbf{t} - \hat{\mathbf{p}}_i)\bigr)^2$$

Unlike the original Besl-McKay point-to-point cost ($\|R\mathbf{p}_i + \mathbf{t} - \hat{\mathbf{p}}_i\|^2$), this measures only the error along the normal, so it is less sensitive to sliding along the surface. Applying the small-rotation approximation $R \approx I + [\boldsymbol{\omega}]_\times$, $E$ turns into a linear least-squares problem in the 6-DoF vector $(\boldsymbol{\omega}, \mathbf{t})$, which the GPU solves in one shot via parallel reduction.

> 🔗 **Borrowed.** KinectFusion's tracking stage inherits Besl & McKay (1992) ICP directly. A technique from the classical robotics literature, pulled back out at GPU density.

TSDF integration: project the depth map into the voxel grid at the estimated pose and update the TSDF values. The paper's headline setting is a 512³ voxel grid covering a room-scale volume about 3m on a side (§4.2, Fig. 13).

Surface rendering: find the TSDF's zero-crossing by ray marching and render the mesh in real time. The result becomes the reference surface for the next ICP step.

Newcombe presented DTAM, monocular-camera dense SLAM, the same year. KinectFusion is its sister work. Where DTAM used the GPU to optimize monocular photometric consistency, KinectFusion threw the same GPU at depth integration. Hence the overlap in the two papers' author lists.

> 🔗 **Borrowed.** KinectFusion and DTAM are two dense systems presented the same year by the same researchers. The GPU dense pipeline philosophy of DTAM transferred naturally into KinectFusion; only the sensor differed.

The 512³ TSDF updated at 30Hz, and a single indoor room could be reconstructed as a dense mesh within minutes. Tracking drift was far smaller than feature-based methods, because ICP converges to the absolute surface.

The limits were just as clear. A 512³ voxel grid handles only a fixed spatial extent. Leave the room and voxels saturate or overwrite each other. No loop closure. And Kinect's IR structured light did not work under sunlight. Outdoors was out of scope from the start.

---

## 9.3 Kintinuous — rolling volume

Right after KinectFusion appeared, Whelan at Imperial College attacked this limit. If the fixed-size TSDF volume was the problem, move it with the camera.

In July 2012, at the RSS workshop (RGB-D: Advanced Reasoning with Depth Cameras, Sydney), [Whelan et al. presented Kintinuous](https://www.cs.cmu.edu/~kaess/pub/Whelan12rssw.pdf), which introduced a "rolling TSDF volume." As the camera approached the boundary of the volume, slices on the far side were extracted as mesh and released, while new slices were attached in front. Memory stayed constant while the camera could move indefinitely.

The demo of walking down an entire indoor corridor showed what KinectFusion could not. But loop closure was still missing. When you walked a long corridor back to the origin, the mismatch between the two ends was unresolved. Reconstruction quality was also behind the submap alignment methods that sparse SLAM had accumulated.

---

## 9.4 ElasticFusion: Surfels and non-rigid deformation

Whelan changed direction after Kintinuous. Instead of TSDF voxels, he chose surfels.

A **surfel (surface element)** is a point carrying position, normal, radius, and color. In computer graphics, [Pfister et al. (2000)](https://www.merl.com/publications/docs/TR2000-10.pdf) proposed the concept as a rendering representation. Compared to a voxel grid, the structure is irregular and hugs the surface.

> 🔗 **Borrowed.** ElasticFusion's surfel representation carries Pfister et al.'s (2000) graphics rendering technique over as a SLAM map representation.

[Whelan et al. 2016. ElasticFusion](https://doi.org/10.1177/0278364916669237) has two core contributions. First, a surfel-based dense map. Second, loop closure via *non-rigid deformation*.

Loop closure in prior dense SLAM was hard. Editing a global mesh or voxel grid to match loop-closure information was expensive. ElasticFusion connected the surfel set to a deformation graph and, when a loop closure was detected, deformed the graph to distribute error across the whole map. It was non-rigid deformation at the mesh level.

Concretely, each node $g_k$ of the deformation graph carries a position $\mathbf{v}_k$ and a rotation $R_k$ and translation $\mathbf{t}_k$. A surfel $s$ lies within the influence of its $K$ nearest nodes, and the surfel's deformed position is computed as

$$\tilde{\mathbf{p}}_s = \sum_{k \in \mathcal{N}(s)} w_k \bigl(R_k (\mathbf{p}_s - \mathbf{v}_k) + \mathbf{v}_k + \mathbf{t}_k\bigr)$$

(the weight $w_k$ is a distance-based falloff). When a loop closure constraint is added, the $(R_k, \mathbf{t}_k)$ of the graph nodes are optimized by Gauss-Newton to distribute the error globally. That is why the whole dense map could be corrected consistently without rebuilding the TSDF from scratch.

Measured not on KITTI or TUM RGB-D but on indoor reconstruction quality itself, ElasticFusion was state of the art at the time. On the ICL-NUIM synthetic dataset, sequences kt0·kt1·kt2 recorded ATE RMSE below 1.4cm, with kt0·kt1 at 0.9cm (kt3, where the global loop closure fires, was an exceptional large value). No prior system had reached that level while staying real-time.

---

## 9.5 BundleFusion: offline-SfM quality, online

In 2017, Dai, Nießner, Zollhöfer, Izadi, and Theobalt published [Dai et al. 2017. BundleFusion](https://doi.org/10.1145/3072959.3054739) in ACM Transactions on Graphics, approaching the problem from a different direction. Where the KinectFusion line tried to raise quality without compromising real-time, BundleFusion's goal was to throw as much GPU compute as possible at the problem and run SfM-grade bundle adjustment even inside an online system.

The central idea is hierarchical optimization. At the fastest layer, dense depth alignment between the current and previous frame sets an initial pose. The layer above corrects it with sparse frame-to-frame alignment using SIFT features, and at the third layer a sliding-window global bundle adjustment re-optimizes the poses of the accumulated frames. As frames accumulate, bundle adjustment re-estimates past poses as well. Called "retroactive pose correction," this approach aimed to reach online something close to what an offline SfM pipeline achieves by aligning data after collecting it all. Updated pose sequences are back-projected into the TSDF for re-integration, so tracking errors do not pile up in the map as-is.

The numbers Dai's team reported on TUM RGB-D beat ElasticFusion. Visual reconstruction quality came close to the offline COLMAP pipeline by the standards of the day.

> 📜 **Prediction vs. outcome.** BundleFusion claimed real-time online global bundle adjustment at "unprecedented speed," presenting a route to raise offline SfM quality online. GPU compute kept climbing after that, but attention moved from dense SLAM to NeRF (Neural Radiance Field). For high-quality indoor reconstruction the de facto standard since 2021 became the COLMAP + NeRF pipeline. The route BundleFusion tried to open was routed around through different technology. `[diverted]`

---

## 9.6 Co-evolution of hardware and algorithm

One way to read the six years from KinectFusion to BundleFusion is as algorithmic progress. A more accurate way is as a process in which hardware and algorithms pushed each other.

The first-generation Kinect used structured light. Depth precision was a few millimeters in the meter range, but the IR pattern was not picked up under sunlight. The Kinect 2, released in 2013, switched to ToF. Precision went up and dynamic range improved. Intel's RealSense series followed. As sensor options grew, the depth quality an algorithm could assume changed, and researchers experimented with either exploiting smaller noise or tolerating larger noise.

On the GPU side, the CUDA ecosystem matured. Between the Tesla architecture at the time of KinectFusion in 2011 and the Pascal architecture at the time of BundleFusion in 2017, floating-point throughput grew more than tenfold. That Whelan in ElasticFusion and Dai in BundleFusion could run ever-heavier optimization in real time was not the algorithm's achievement alone.

Had Kinect been $15,000 rather than $150, this flow would have started five years later. A consumer-market sensor set the pace of the research.

> 📜 **Prediction vs. outcome.** The limits KinectFusion exposed in 2011 with its fixed 512³ volume — spatial extent, drift, outdoor unsuitability — became the roadmap of the following research. Volume extension was attacked in turn by Kintinuous, ElasticFusion, and BundleFusion. The outdoor question reached a different conclusion. IR structured light does not pick up a pattern under sunlight. RGB-D-based dense SLAM stayed bound to indoors, and outdoor was taken over by LiDAR. `[diverted]`

---

## 9.7 The exit of dense-only

Between 2011 and 2017, dense RGB-D SLAM looked like it would become the main direction of Visual SLAM. The actual unfolding did not go that way.

Sparse backends kept dominating. The post-2015 practical SLAM systems represented by [ORB-SLAM2](https://arxiv.org/abs/1610.06475) and [VINS-Mono](https://arxiv.org/abs/1708.03852) did not take the dense map as their default. The reasons compounded. A 512³ TSDF requires more than 512MB, hard to afford on mobile platforms or embedded systems. Octree or hash-map variants ([Voxblox](https://arxiv.org/abs/1611.03631), [OctoMap](https://www.hrl.uni-bonn.de/papers/wurm10octomap.pdf)) tried to ease this, but the gap against sparse efficiency remained. Real-time dense processing presupposed a GPU, and running a KinectFusion-grade pipeline on an autonomous vehicle's embedded processor or a drone's lightweight platform was hard. Kinect's IR depth not working outdoors also held things back. Most commercialization-heavy fields like autonomous driving and drones were outdoor environments.

In that same span, the lineage of dense map data structures themselves scattered KinectFusion's 512³ fixed volume in several directions. [Museth's VDB (2013)](https://doi.org/10.1145/2487228.2487235) proposed a structure combining block hashing with an internal tree so that sparse regions are left empty and only the neighborhood of the surface is refined hierarchically; released as OpenVDB, it became the backbone of autonomous-driving dense maps today (connecting to Ch.17 LiDAR's nvblox line). [Reijgwart et al. (2023)'s wavemap](https://arxiv.org/abs/2306.08125) compressed occupancy with a wavelet transform to re-tune the resolution-memory trade-off. Another line, led by Ramos and Ott, moved the representation to a continuous function altogether. [O'Callaghan and Ramos (2012)'s GPOM (Gaussian Process Occupancy Map)](https://doi.org/10.1177/0278364911435991) linked depth measurements through Gaussian Process regression to fill even unmeasured voxels probabilistically, and [Ramos and Ott (2016)'s Hilbert Map](https://doi.org/10.1177/0278364916684382) learned Hilbert-space features with logistic regression to provide streamable probabilistic occupancy. [Behley and Stachniss (2018)'s SuMa](https://www.ipb.uni-bonn.de/wp-content/papercite-data/pdf/behley2018rss.pdf) took the surfel representation ElasticFusion used for indoor RGB-D out to outdoor LiDAR and built a surfel-based SLAM that worked on KITTI (→ Ch.17). Where KinectFusion stopped at a single room, these lines advanced outward: outdoor, city-scale, and into probabilistic uncertainty.

Around 2020, when NeRF appeared, demand for high-quality dense reconstruction shifted to NeRF and 3D Gaussian Splatting. RGB-D SLAM, inside a structure separating localization and mapping, narrowed to using depth as an auxiliary tracking cue.

The dense era was short, but traces remained. The TSDF representation carried into autonomous-driving occupancy maps, and ICP became the standard tracking tool in LiDAR SLAM. The approach retreated, but the parts scattered into other systems.

---

## 🧭 Still open

Large-scale outdoor dense reconstruction. The sunlight weakness of IR structured light is a general problem of active depth sensors. LiDAR handles longer range but is poor on color and fine surface detail. As of 2026, there is still no way to densely process outdoor large-scale environments with RGB-D. Stereo depth estimation is advancing quickly on learned models, and some research is exploring alternatives, but the limits in dark regions, reflective surfaces, and long range are not resolved.

Dense reconstruction of dynamic scenes. Every system from KinectFusion to BundleFusion was designed for static scenes. Densely reconstructing a space where people walk around requires separating dynamic objects, and that requires combining real-time semantic segmentation with dense SLAM. [DynaSLAM](https://arxiv.org/abs/1806.05620) and [MaskFusion](https://arxiv.org/abs/1804.09194) tried, but neither compute cost nor robustness reached practical deployment.

Memory efficiency of the TSDF family. Voxblox's hash structure and OctoMap's octree compression reduced the memory cost of the voxel grid. Yet dense representation at the building-floor or city-block scale still runs to tens of gigabytes. An adaptive-resolution dense map that automatically decides which resolution to keep in which region still has no general solution. Implicit neural representations such as [Instant-NGP](https://arxiv.org/abs/2201.05989) are approaching this problem, but real-time update and query speed still trade off.

While dense SLAM was filling one indoor room with mesh, the problem of returning to that room was being handled in a separate lineage that had been running since before KinectFusion existed. Place recognition — the question of "have we seen this place before?" — had been developing at Oxford since 2003, parallel to and independent of the dense mapping track. KinectFusion had no loop closure. The researchers who had already been working on that question were not building denser maps; they were asking a different one.

---

# Ch.10 — The Parallel Line of Place Recognition: From FAB-MAP to NetVLAD, and on to AnyLoc

Around 2003, while Davison was proving real-time 3D tracking with a single webcam, Mark Cummins and Paul Newman at the Oxford Mobile Robotics Group were holding onto a different question. "How does a robot recognize a place it has visited before?" As long as visual odometry (VO) suffered from accumulated drift, no SLAM system could close the loop without an answer to that question. Place recognition developed through the 2000s in parallel with the other components of Visual SLAM, but along a lineage of its own. FAB-MAP transplanted Josef Sivic's bag-of-words (BoW) idea into robot space, DBoW2 made it practical, and NetVLAD broke through with learning. In 2023 AnyLoc pulled in features from a foundation model as-is.

While Ch.7 closed out the feature-based lineage with the ORB-SLAM trilogy, Ch.8 traced the direct lineage through DSO, and Ch.9 followed the arc of dense mapping from KinectFusion to BundleFusion, place recognition drew a line unlike any of them. It was neither tracking nor mapping, and derived from neither — an independent problem. Even so, all three lineages were incomplete without loop closure, and the "where have we seen this" judgment behind loop closure was supplied by place recognition.

---

## 10.1 Place recognition before BoW

To close a loop without GPS, whether indoors, in a tunnel, or in an urban canyon, a robot has to find, among thousands of candidate images, the one most similar to the current observation, and do so quickly. Pixel-level comparison is a linear scan, O(N), and once the image count passes tens of thousands, real time is out of reach.

In early-2000s computer vision, the first to touch this problem were Sivic and Zisserman. Published at ICCV 2003, ["Video Google"](https://www.robots.ox.ac.uk/~vgg/publications/2003/Sivic03/sivic03.pdf) applied TF-IDF from document retrieval to images. SIFT descriptors were clustered with k-means to form "visual words," and each image was represented as a frequency vector over those words. Retrieval, through an inverted index, became close to O(1). Place recognition researchers picked up the idea right away.

---

## 10.2 FAB-MAP — probabilistic BoW and the Chow-Liu tree (2008)

Mark Cummins and Paul Newman, at the Oxford Mobile Robotics Group, published [Cummins & Newman. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance](https://doi.org/10.1177/0278364908090961) in 2008.

The core question behind FAB-MAP (**Fast Appearance-Based Mapping**) is: "Is this scene a place already in the database, or somewhere entirely new?" A plain similarity score cannot settle that. If dozens of corridors all look alike, the highest similarity score does not guarantee the right answer.

Cummins and Newman framed this as a Bayesian inference problem. Given an observation $z_t$ (the set of visual-word occurrences), they computed the probability that the current location is each database place $\ell_i$:

$$P(\ell_i \mid z_t) \propto P(z_t \mid \ell_i) P(\ell_i)$$

The hard part is $P(z_t \mid \ell_i)$. Assuming visual words are independent gives a naïve Bayes model, but in practice visual words are correlated. If the word "door" appears, the word "doorknob" tends to appear along with it. The independence assumption distorts the probability.

FAB-MAP modeled this correlation with a **Chow-Liu tree**. A Chow-Liu tree is a tree-structured graphical model that maximizes pairwise mutual information among words. The mutual information between two words $e_i, e_j$ is defined as

$$I(e_i; e_j) = \sum_{e_i, e_j} P(e_i, e_j) \log \frac{P(e_i, e_j)}{P(e_i)P(e_j)}$$

and the Chow-Liu algorithm uses this as an edge weight to build a maximum spanning tree. Factorizing the joint likelihood through this tree gives

$$P(z_t \mid \ell_i) = \prod_k P(z_t^k \mid z_t^{\text{pa}(k)}, \ell_i)$$

where $z_t^k \in \{0,1\}$ is the occurrence of the $k$-th word and $\text{pa}(k)$ is its parent in the tree. Compared with naïve Bayes (independence), this reflects co-occurrence patterns across words, and so lowers false positives in places that look visually alike, such as corridors. During training, the vocabulary and the tree are learned together from a large image set.

FAB-MAP also handled, explicitly, the possibility that the current location is a new place not in the database. Adding the "new place" hypothesis cut false positives. In loop closure, a false positive leads to catastrophic failure. That was the practical heart of the matter.

> 🔗 **Borrowed.** FAB-MAP's visual-word approach was transplanted directly from Sivic & Zisserman's "Video Google" (2003). The inverted-index logic of document retrieval was applied to a robot's memory of places.

In 2011 Cummins and Newman published [FAB-MAP 2.0](https://www.robots.ox.ac.uk/~mjc/Papers/cummins_newman_ijrr_fabmap2_2010_preprint.pdf). The goal was to push the processable map scale to around 1,000 km. They showed experimentally that it ran on a city-scale dataset.

---

## 10.3 DBoW2 — binary descriptors and the vocabulary tree (2012)

FAB-MAP was built on floating-point descriptors like SIFT. Around 2012 the SLAM community was moving toward faster binary descriptors, in particular BRIEF, ORB, and BRISK. Keeping a SIFT vocabulary as-is was a compute-cost problem.

In 2012 Dorian Gálvez-López and Juan D. Tardós (Universidad de Zaragoza) published [Gálvez-López & Tardós. Bags of Binary Words for Fast Place Recognition in Image Sequences](https://doi.org/10.1109/TRO.2012.2197158). **DBoW2** is a vocabulary tree that uses binary descriptors, with Hamming-distance comparisons that made word assignment tens of times faster than SIFT.

DBoW2's structure is a vocabulary tree built by hierarchical k-means. The BoW vector representing an image is a TF-IDF–weighted binary-word frequency vector. Each leaf node $w_i$ of a tree with branching factor $k$ and depth $d$ carries the TF-IDF weight

$$\eta_i = \frac{n_i}{n} \cdot \log \frac{N}{N_i}$$

where $n_i$ is the word count of $w_i$ in the image, $n$ is the total word count, $N$ is the number of database images, and $N_i$ is the number of images containing $w_i$. The similarity between two images $a$, $b$ is given by the L1-norm

$$s(\mathbf{v}_a, \mathbf{v}_b) = 1 - \frac{1}{2} \left\| \frac{\mathbf{v}_a}{|\mathbf{v}_a|} - \frac{\mathbf{v}_b}{|\mathbf{v}_b|} \right\|_1$$

Lookup runs in O(log N) through an inverted index.

> 🔗 **Borrowed.** DBoW2's vocabulary-tree concept traces its lineage to Nistér & Stewénius's 2006 ["Scalable Recognition with a Vocabulary Tree"](https://people.eecs.berkeley.edu/~yang/courses/cs294-6/papers/nister_stewenius_cvpr2006.pdf) (CVPR). DBoW2 transplanted that structure into the binary-descriptor world and tuned the weighting scheme for SLAM.

What mattered about DBoW2 was deployment more than algorithm. Released as open source, the library was adopted as the loop-closure module of ORB-SLAM (2015), and ORB-SLAM2 and ORB-SLAM3 used the same DBoW2. From 2015 through the mid-2020s, place recognition in the SLAM community was effectively DBoW2's job.

The Gálvez-López–Tardós partnership is also worth noting. Tardós was the figure who, together with Mur-Artal and Campos, later led the ORB-SLAM trilogy. DBoW2 was, in effect, the place-recognition layer prepared in advance for that project.

---

## 10.4 NetVLAD — CNN-based VPR (2016)

The BoW family had one fundamental limit. The vocabulary was trained against a specific descriptor and a specific environment. When lighting changed, or the season shifted, or the viewpoint moved far enough, the distribution of visual words shifted too, and a pre-trained vocabulary broke.

At CVPR 2016, Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomáš Pajdla, and Josef Sivic published [NetVLAD: CNN Architecture for Weakly Supervised Place Recognition](https://doi.org/10.1109/CVPR.2016.572). Sivic, among the authors, is the same Sivic behind "Video Google" in 2003. The person who had introduced BoW to image retrieval at ICCV 2003 was a co-author, thirteen years later, on the paper that pushed past that approach's limits.

NetVLAD's idea was to make **VLAD (Vector of Locally Aggregated Descriptors)** aggregation differentiable.

VLAD is an aggregation scheme proposed in 2010 by [Jégou et al.](https://inria.hal.science/inria-00548637/file/jegou_compactimagerepresentation.pdf), which represents an entire image by accumulating how much each local descriptor contributes, as a "residual," to its nearest cluster center (visual word). The VLAD sub-vector for cluster center $k$ is

$$\mathbf{V}(k) = \sum_{\mathbf{x}_i : \text{NN}(\mathbf{x}_i)=k} (\mathbf{x}_i - \boldsymbol{\mu}_k)$$

and the full VLAD vector $\mathbf{V} = [\mathbf{V}(1)^\top, \ldots, \mathbf{V}(K)^\top]^\top$ is the concatenation across all clusters, L2-normalized. With $K$ clusters and $D$-dimensional descriptors, the final vector is $KD$-dimensional. The VLAD vector carries much richer information than BoW's binary assignment.

> 🔗 **Borrowed.** NetVLAD's aggregation design inherits directly from VLAD in Jégou et al.'s "Aggregating Local Descriptors into a Compact Image Representation" (CVPR 2010). What NetVLAD did was swap VLAD's hard assignment for a soft assignment and make the whole pipeline trainable end-to-end.

The NetVLAD layer softens the nearest-neighbor assignment of classical VLAD into a softmax:

$$\bar{a}_k(\mathbf{x}_i) = \frac{e^{\mathbf{w}_k^\top \mathbf{x}_i + b_k}}{\sum_{k'} e^{\mathbf{w}_{k'}^\top \mathbf{x}_i + b_{k'}}}$$

Here $\mathbf{x}_i$ is a local feature extracted by the CNN, and $\mathbf{w}_k$ and $b_k$ are learnable parameters. Accumulating the NetVLAD vector with this soft assignment gives

$$\mathbf{V}(k) = \sum_i \bar{a}_k(\mathbf{x}_i)\,(\mathbf{x}_i - \boldsymbol{\mu}_k)$$

and the full vector $\mathbf{V} = [\mathbf{V}(1)^\top, \ldots, \mathbf{V}(K)^\top]^\top$, after intra-normalization (L2 on each sub-vector) and a final L2-normalization, becomes the VPR descriptor. Unlike hard-assignment VLAD, the gradient back-propagates, so the CNN backbone can be trained end-to-end with it.

Training was also different. The authors used Google Street View Time Machine data, taking image pairs of the same place at different times as positive examples and images of different places as negatives, under a weakly supervised triplet loss. With just GPS positions, they could train without labels.

On the Pittsburgh 250k and Tokyo 24/7 benchmarks, NetVLAD was well ahead of the DBoW family and of earlier VLAD-based methods. It was much more robust across changes in lighting and season and had some tolerance to viewpoint differences. Still, NetVLAD was not immediately integrated into practical SLAM pipelines. Its inference speed and memory footprint were heavier than DBoW2, and the ORB-SLAM ecosystem had already been built around DBoW2.

---

## 10.5 Patch-NetVLAD, MixVPR, AnyLoc (2020–2023)

After NetVLAD, Visual Place Recognition (VPR) research scattered into improvements on generalization.

In 2021 Hausler et al. put out [Patch-NetVLAD](https://arxiv.org/abs/2103.01486). Instead of judging a place from a single global descriptor, as NetVLAD did, the image is split into patches and the NetVLAD representation of each patch is combined spatially. On Tokyo 24/7, it raised Recall@1 by about 10 percentage points over NetVLAD. Patch-level processing raised inference cost along with it.

In 2023 Ali-bey et al.'s [MixVPR](https://arxiv.org/abs/2303.02190) produced global features through Transformer-style feature mixing. The target was balance between being lightweight and performance. VPR papers from this period commonly took Mapillary Street Level Sequences (MSLS) and seasonal-change datasets such as Nordland as benchmarks. Extreme lighting and seasonal conditions emerged as a common barrier.

In 2023 Keetha et al.'s [AnyLoc: Towards Universal Visual Place Recognition](https://arxiv.org/abs/2308.00688) took a different route. Use DINOv2-based self-supervised features for place recognition, as-is, with no fine-tuning.

> 🔗 **Borrowed.** AnyLoc's feature extraction pulls pre-trained ViT representations from Oquab et al.'s [DINOv2](https://arxiv.org/abs/2304.07193) (Meta AI, 2023). AnyLoc layered VLAD aggregation on top. The BoW–VLAD lineage that began with FAB-MAP rejoins the story in the foundation-model era.

DINOv2 is a Vision Transformer (ViT) trained on large-scale internet images. It produces general-purpose features that are not biased to a particular city, a particular season, or a particular camera. What AnyLoc drew attention to is DINOv2's **facet** concept. Each attention head in a ViT outputs a query (Q), key (K), value (V) matrix and a final token (patch feature). Keetha et al. confirmed experimentally that, among these four kinds of facet, the value (V) facet gives the most semantically stable representation for place recognition. Q and K facets lean toward structural and geometric information, while the V facet concentrates more on semantics, which is useful for a consistent place representation across season and lighting. Keetha et al. showed that linking this V-facet representation to VLAD aggregation yields a single model that works across very diverse environments worldwide — indoor and outdoor, underground, aerial views. Across seven or more environments (Pittsburgh, Tokyo, indoor factories, underground parking garages, libraries), a single model was competitive with, or ahead of, earlier specialized methods.

Once generality came loose along one axis, the next branch moved toward crossing modality boundaries. Lee et al.'s [(LC)²](https://arxiv.org/abs/2304.08660) (RA-L 2023) projected camera imagery and LiDAR point clouds into a shared 2.5D depth image, attempting cross-modal retrieval in which a 2D query looks up places in a LiDAR map. The follow-up LC²++ took an LoRA-adapted global retrieval and chained MINIMA-based local matching and PnP onto it, connecting place identification all the way to 6-DoF pose recovery in a single pipeline. Such cross-modal evaluation became possible only because of datasets like Lee et al.'s [ViViD++](https://arxiv.org/abs/2204.06183) (RA-L 2022), which synchronize visible, thermal, event, LiDAR, inertial, and depth streams across indoor, outdoor, and underground settings.

---

## 10.6 Toward integrating place recognition and metric localization (2024–2025)

Place recognition research has run in parallel with the rest of SLAM's components since the early 2000s. ORB-SLAM embedded DBoW2, but the place-recognition module was a black box isolated from mapping and tracking. The input was an image, the output a loop-candidate ID.

Moving into 2024–2025, this boundary began to blur. Berton et al.'s [EigenPlaces](https://arxiv.org/abs/2308.10832) (2023) and Izquierdo & Civera's [SALAD](https://arxiv.org/abs/2311.15937) (2023 arXiv / CVPR 2024) explored pulling place-recognition descriptors directly into metric localization. Not stopping at "where have we seen this place," they tried to read 6-DoF pose out of the place-recognition representation itself.

Around 2024 there were also attempts to combine Gaussian-map representations with place recognition. It was a direction aligned with the rise of 3DGS (3D Gaussian Splatting) as a map representation.

> 📜 **Prediction vs. outcome.** In their 2011 FAB-MAP 2.0 paper, Cummins and Newman pushed the scale limit of place recognition, demonstrating appearance-only loop closure on a 1,000 km trajectory. Measured against the early FAB-MAP experiments that had run the Oxford campus and parts of the city, that was a two-digit-factor leap. Later city-scale experiments using DBoW2 and large vocabularies reproduced the same scale in practical SLAM. The scale problem was solved this way, but the failure mode Cummins and Newman left behind, vocabulary-based representations vulnerable to seasonal and lighting change, was crossed using a different tool that deep learning brought. `[diverted]`

> 📜 **Prediction vs. outcome.** In the introduction of the 2016 NetVLAD paper, Arandjelović et al. named three challenges for solving place recognition — a CNN architecture, enough training data, and an end-to-end training procedure — and laid out their contribution to each. The architecture and training procedure were answered directly by NetVLAD, but in the seven years that followed, a string of VPR papers targeted generalization across appearance conditions (season, lighting, viewpoint). In 2023, AnyLoc showed the possibility of a single multi-environment model with fine-tuning-free foundation-model features. Less a complete solution than the axis shifting from specialized models toward general-purpose ones. `[in progress]`

---

## 10.7 🧭 Still open

**Extreme season and lighting change.** The Nordland (Norwegian railway, summer vs. winter) and Oxford RobotCar (a year of seasonal change) datasets have reported the same barrier for over a decade. DINOv2-based methods have narrowed the gap, but no single model yet recognizes the same place at 99% accuracy between snow-covered winter and leaf-heavy summer. Place recognition in environments with heavy appearance change remains an open problem as of 2026.

**Integration of place recognition and metric localization.** In most SLAM pipelines today, place recognition answers only "where have we seen this," and actual pose estimation is handled by a separate PnP or descriptor-matching stage. Attempts to merge the two processes into a single representation appeared between 2023 and 2025, but no method has yet reached deployment-level precision and speed at the same time.

*Privacy of recognizable place representations.* The place representations a VPR system stores can be used, through reconstruction attacks, to recover the original images or the 3D structure. For commercial robots mapping the interiors of homes, hospitals, and offices, this becomes a real concern. A place-representation scheme that guarantees privacy without sacrificing performance does not yet exist.

---

The three lineages of Part 3 (the mature period) close this way. While ORB-SLAM standardized the feature-based pipeline, DSO completed the photometric theory, and the KinectFusion line laid out the possibilities and limits of dense mapping, place recognition sat somewhere different from all of them. It did not grow inside SLAM — it grew out of the image-retrieval problem in computer vision, and when SLAM needed loop closure, it took the supplier's seat. That distance turned into an advantage. When the deep-learning wave hit, place recognition absorbed the new tools faster than the rest of the SLAM pipeline did.

When AnyLoc appeared in 2023, Sivic's name was in the references, not the acknowledgements. The person who plugged BoW into image retrieval in 2003, and who was co-author on NetVLAD in 2016 when that approach was pushed past its limits. At the end of that lineage, AnyLoc pushed the door Sivic had opened over toward the foundation-model side.

Part 4 begins with a different kind of intrusion — not a place-recognition module borrowed from vision, but the geometry-first assumption at the core of every lineage in Part 3. The next chapter follows the crack from an unexpected direction: a graduate student at NYU, a depth-estimation CNN, and the slow unraveling of the premise that geometry must come from geometry.

---

# Ch.11 — The Return of Depth Estimation: From Eigen to Depth Anything

In Part 3 (Ch.7–10) the feature-based, direct, RGB-D, and place recognition lineages each reached maturity on their own terms. Geometry was everything: ORB-SLAM reconstructed the world with epipolar geometry, DSO relied on photometric consistency, KinectFusion stacked surfaces with ICP, and RGB-D fusion pipelines closed loops on geometric features. There was no room for learning to break in, or so it seemed. Part 4 is the story of that boundary collapsing, and the crack came from an unexpected direction — not from a SLAM researcher but from the computer vision side, from a single paper by a graduate student at NYU.

Monocular depth estimation was one of the oldest ill-posed problems in computer vision. Recovering depth from a single image is in principle impossible. A camera throws away depth information when it projects the 3D world onto 2D. Yet humans judge depth with one eye. Perspective, occlusion, texture gradient, surface shading. What if these could be learned statistically? In 2014, David Eigen at NYU put that question to a CNN. That one experiment was the start of a lineage that would rewrite the SLAM pipeline ten years later.

---

## 1. Eigen 2014 — the first CNN depth

There was monocular depth estimation research before 2014. Ashutosh Saxena (Make3D, Stanford) published [a system in 2005 that combined support vector machines (SVMs) with a Markov Random Field (MRF) to predict a depth map from a single image](https://papers.nips.cc/paper/2921-learning-depth-from-single-monocular-images). The results were coarse and barely worked in structured indoor environments.

[Eigen et al. 2014](https://arxiv.org/abs/1406.2283), by Eigen, Puhrsch, and Fergus, changed the approach itself. A two-stage CNN in which a coarse network predicted global structure and a fine network refined local detail. The training data was NYU Depth v2 — 120,000 indoor scenes collected with a Kinect RGB-D camera. The numbers improved on Make3D by the standards of the day, but the proof of concept mattered more. Depth is learnable.

One decisive weakness remained, though. **Scale ambiguity.** The network learns relative depth structure, but absolute scale is tied to the distribution of the training data. Point a model trained on NYU indoors at an outdoor scene and the scale is wrong. This limit stayed with the whole field as an open problem until 2024.

> 🔗 **Borrowed.** Eigen 2014 inherited the depth estimation task itself from Make3D (Saxena 2005). Replacing the SVM and MRF with a CNN was the core swap; the task definition and evaluation metrics (RMSE, threshold accuracy) carried over.

---

## 2. Garg → Godard — self-supervised depth

The bottleneck in supervised depth learning was data. The Kinect works well indoors, but outdoors, especially in sunlight, the infrared pattern washes out. Building a large-scale outdoor RGB-D dataset is expensive.

In 2016, [Ravi Garg (UCL) opened another path](https://arxiv.org/abs/1603.04992). Use stereo image pairs as the training signal. Predict depth from the left image, then use that depth and the camera baseline to reconstruct the right image. The right image already exists, so a photometric loss provides supervision. No labels needed.

Clément Godard (UCL) systematized the idea as **MonoDepth** in [Godard et al. 2017](https://doi.org/10.1109/CVPR.2017.699). Left-right consistency: a two-way constraint that the depth predicted from the left must match the depth predicted from the right. Adding structural similarity (SSIM) to the photometric loss raised stability in textureless regions. The key was that stereo pairs are only needed at training time. Inference runs on a single image. On the KITTI benchmark it was the best self-supervised method at the time.

> 🔗 **Borrowed.** Garg's and Godard's photometric loss comes out of the stereo matching literature. The intensity consistency constraint from disparity estimation, [as organized by Scharstein and Szeliski (2002)](https://vision.middlebury.edu/stereo/taxonomy-IJCV.pdf), was repurposed as the training signal for a depth network.

In 2019 Godard's *MonoDepth2* ([Godard et al. 2019, ICCV](https://arxiv.org/abs/1806.01260)) moved further into self-supervision, using monocular video instead of stereo pairs. A depth network and a pose network train together. The pose network predicts camera motion between consecutive frames, and the depth network's output warps the previous frame into the current one. The two networks jointly optimize to reduce the warping error. Two key devices were added. First, **minimum reprojection loss**: pick the source frame with the lowest photometric error to reduce errors in occluded regions. Second, **auto-masking**: automatically exclude pixels that move at the same speed as the camera (including a stationary camera plus stationary objects).

A clean design, but problems remained. Moving objects and reflective surfaces broke photometric consistency, and sky was worse because it has no texture. Scale was still ambiguous too — video supervision only resolves scale relatively between frames.

---

## 3. MiDaS — mixing datasets

[Ranftl et al. 2020](https://doi.org/10.1109/TPAMI.2020.3019967), **MiDaS** (Mixing Datasets for Zero-shot Cross-dataset Transfer), led by René Ranftl at Intel, asked a different question. What if you train on many datasets at once instead of one?

The problem was that depth units and scales differ between datasets. NYU is indoor in meters, KITTI is outdoor LiDAR points, ReDWeb is stereo from movies, MegaDepth is SfM reconstruction. Mixing them as-is confuses the network.

Ranftl's fix was an **affine-invariant loss.** Before training, normalize each image's depth prediction by an affine transformation (scale plus shift). Specifically, subtract the median from both prediction and ground truth to remove shift, then divide by the median absolute deviation (MAD) to remove scale, then compare. This scale-and-shift invariant normalization eliminates unit mismatches between datasets. The network then learns "which is farther, relatively" rather than "how far."

Trained on 12 datasets and over 1.9 million images, MiDaS showed practical cross-dataset generalization for the first time. It produced plausible relative depth on outdoor scenes, indoor scenes, historical photographs, and movie frames. No absolute scale, but depth ordering and structure held.

Ranftl's team later released [**DPT** (Dense Prediction Transformer)](https://arxiv.org/abs/2103.13413) separately in 2021, swapping the MiDaS backbone for a ViT-based one. From MiDaS v3 onward DPT became the default backbone, and v3.1 (2022) was its refinement. Performance jumped.

> 🔗 **Borrowed.** MiDaS v3 and later Depth Anything adopted CLIP, DINOv2, and ViT-family backbones as-is. A backbone swap alone producing a performance jump is a common pattern in the foundation model era, but its first large-scale confirmation in depth estimation was DPT (Ranftl 2021).

---

## 4. Depth Anything — foundation scale

In January 2024, **Depth Anything** by Lihe Yang's team at TikTok Research ([Yang et al. 2024](https://arxiv.org/abs/2401.10891)) solved the problem with scale. 1.5M labeled images (a merger of existing datasets) and 62M unlabeled images. Pseudo-labels were generated for the unlabeled set and included in training. To raise pseudo-label quality, semantic segmentation features served as auxiliary supervision.

The result surpassed MiDaS and all earlier methods across every major benchmark — KITTI, NYU, ScanNet, DIODE. The model size was 335M parameters on a ViT-L backbone. Inference speed was nowhere near real time, but quality came first.

[**Depth Anything v2**](https://arxiv.org/abs/2406.09414), released later the same year, added large amounts of synthetic data (Unreal Engine-based Virtual KITTI, Hypersim, and others). Synthetic data covers regions that are hard to annotate in real data, such as reflective and transparent surfaces. v2 visibly improved over v1 in edge detail and thin structures.

Depth Anything still produces relative depth, though. No scale.

[**ZoeDepth** (Shariq Farooq Bhat et al. 2023)](https://arxiv.org/abs/2302.12288) and [**Metric3D v2** (2024)](https://arxiv.org/abs/2404.15506) attacked this last problem from a different angle. Camera intrinsics (focal length, sensor size) are fed to the network as explicit input. The network learns that for the same scene a different focal length yields a different depth distribution. Metric depth results on in-the-wild data were qualitatively different from before. Not perfect, but usable in many practical scenarios.

---

## 5. Re-entry into SLAM

Around 2021, SLAM researchers started pulling monocular depth models into their pipelines. The entry point was initialization. Monocular SLAM is structurally awkward to initialize. Triangulating from two frames needs a sufficient baseline, and scale is ambiguous from the first step.

Injecting a depth prior into the first frame speeds initialization and roughly fixes scale. [DROID-SLAM, released in 2021 by Teed and Deng](https://arxiv.org/abs/2108.10869), ties recurrent optical flow to BA; follow-up work in that lineage experimented with bolting monocular depth priors onto the geometric initialization.

Scale recovery was more direct. Monocular visual odometry (VO) accumulates scale drift as it runs. Using depth network predictions as periodic scale anchors suppresses this drift. Not a perfect solution, a practical patch, but it held over much longer distances than pure VO.

> 📜 **Prediction vs. outcome.** Eigen mentioned in the 2014 paper that combining with 3D geometry information such as surface normals was a natural direction for extension. Joint multi-task learning was partly realized later in PAD-Net, VPD, and others. But as of 2024 the real impact arguably came less from combining tasks and more from sharing a ViT backbone. The predicted direction and the actual path diverged. `[diverted]`

> 📜 **Prediction vs. outcome.** MiDaS (2020) chose the detour of giving up absolute scale through its scale-and-shift invariant loss and focusing only on relative depth, which lined up with the sense that metric recovery is hard without camera parameters. In 2024 Depth Anything v2 and Metric3D v2 attacked this direction head-on by taking camera intrinsics as input, and in-the-wild metric depth came close to practical quality, though full camera independence is not there yet. `[in progress]`

---

## 🧭 Still open

**Depth on reflective and transparent surfaces.** With glass, water, and metallic reflections, what the camera captures is not the actual surface. This is a problem at the level of physical optics. Even with more synthetic training data, generalization on real-world reflective scenes remains unstable. Specialized approaches exist, such as [ClearGrasp (Sajjan et al. 2020)](https://arxiv.org/abs/1910.02550), but no general solution. Even foundation-scale models show structurally large errors in this regime.

**Separating ego-depth and object-depth in dynamic scenes.** In scenes with moving cars and pedestrians, photometric consistency breaks. Self-supervised methods mask out moving objects as a workaround, which avoids the problem rather than solving it. Jointly solving for the depth of moving objects separately from ego-motion has been attempted by several follow-up works including [Ranjan et al. (2019)](https://arxiv.org/abs/1805.09806), but remains a hard problem at the practical level.

**Generalization of metric scale.** Metric3D v2 and Depth Anything v2 have started producing metric depth conditioned on camera intrinsics. But situations where intrinsics are unknown are common. There are hundreds of smartphone models, and CCTV and historical archive photos do not even carry EXIF. Camera-independent metric depth is hard even at foundation model scale. As of 2025 this is the remaining core question in monocular depth.

---

In 2024, while Depth Anything was turning over the benchmarks, a paper from Cambridge had already spent nine years as an unfinished piece of homework in the SLAM community. Pull absolute pose straight out of a single image. No feature extraction, no optimization. No map to begin with. [PoseNet](https://arxiv.org/abs/1505.07427) was the name of that dream.

---

# Ch.12 — The End-to-End Frustration

Chapter 11 made it possible to recover depth from a single monocular camera. Eigen's network pulled metric depth out of pixels, and SfMLearner produced geometric supervision without labels. Learning had been shown to see shape, and the obvious next question followed: pose estimation, loop closure, could the whole of SLAM be finished off in a single network? Between 2015 and 2018, this question failed to find its answer.

In 2015, Alex Kendall, a PhD student at the Cambridge Computer Laboratory, working under Roberto Cipolla, completed a project that trained a neural network on Google Street View images and took a single photograph as input to output a 6-DoF pose. The paper, named [Kendall et al. 2015. PoseNet](https://doi.org/10.1109/ICCV.2015.336), drew immediate attention at ICCV in Santiago de Chile. What if SLAM's thirty-year equation — feature extraction, matching, optimization, map management — could be compressed into a single CNN? Between 2015 and 2018 this question produced dozens of papers that, with almost no exception, reached the same conclusion.

---

## 12.1 PoseNet

What PoseNet inherited was [AlexNet (Krizhevsky et al. 2012)](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks). Kendall took the finding that deep CNNs trained for classification on ImageNet form high-level visual representations and repurposed that feature hierarchy for pose estimation.

> 🔗 **Borrowed.** PoseNet's backbone is the [GoogLeNet (Inception, Szegedy et al. 2014)](https://arxiv.org/abs/1409.4842) architecture. Remove the classification head, attach a 7-dimensional regression head (x, y, z, and four quaternion components) — that was the whole of it. A direct transplant of the feature hierarchy learned from ImageNet into localization.

On the Cambridge Landmarks dataset Kendall collected himself — King's College Chapel, streets, a former hospital, and several other outdoor scenes — PoseNet reached position errors around 2 m and orientation errors of 5-8° depending on the scene (per §5 of the original paper). By 2015 standards these were impressive numbers. A single GPU returned an answer within 5 ms, without feature extraction, RANSAC, or map lookup.

The paper triggered immediate follow-ups. [Bayesian PoseNet (Kendall & Cipolla 2016)](https://arxiv.org/abs/1509.05909) tried to estimate pose uncertainty via Monte Carlo Dropout. LSTM PoseNet integrated sequence information. Variants with added geometric loss appeared. Kendall himself released a 2017 version combining a recurrent structure with photometric loss.

But as benchmark ceilings rose, the gap became visible. On the same scenes, [Active Search (Sattler et al. 2012)](https://www.graphics.rwth-aachen.de/media/papers/sattler_eccv12_preprint_1.pdf) and DenseVLAD achieved position errors on the order of 0.2 m. The PoseNet family rarely got below the several-meter range. Regressing an absolute pose from a single image had a principled limit.

---

## 12.2 DeepVO

If one of PoseNet's limits was single-image input, what about feeding in a sequence? Sen Wang (Heriot-Watt, Edinburgh) and co-authors presented [Wang et al. 2017. DeepVO](https://arxiv.org/abs/1709.08429) at ICRA in 2017. A CNN influenced by FlowNet extracted optical flow features from consecutive frame pairs, and an LSTM accumulated temporal context to output VO directly.

> 🔗 **Borrowed.** DeepVO's training labels are KITTI's GPS/IMU ground truth, and its feature extraction design is borrowed directly from the optical flow CNN architecture of [FlowNet (Dosovitskiy et al. 2015)](https://arxiv.org/abs/1504.06852).

With the LSTM handling temporal modeling, the hope was drift suppression. The paper showed results where DeepVO had lower drift than DVO-SLAM or VISO2-M on parts of the KITTI sequences. But there were conditions. Driving patterns similar to the training sequences, similar lighting, similar urban scenes, similar speed profiles. When conditions diverged, the "context" the LSTM had accumulated turned into bias instead.

[Zhou et al. 2017. SfMLearner](https://arxiv.org/abs/1704.07813), released the same year by Tinghui Zhou (UC Berkeley), came at the problem from a different angle. Self-supervised learning estimated depth and ego-motion jointly, using photometric reprojection loss as the training signal. The strength was that training needed no labels.

> 🔗 **Borrowed.** SfMLearner's photometric loss is mathematically identical to the intensity residual of classical direct SLAM, moving the photometric principle of [DSO (Engel et al. 2018)](https://arxiv.org/abs/1607.02565) into a differentiable learning framework — the self-supervision idea survived downstream through MonoDepth2 and into DROID-SLAM.

That said, SfMLearner-only VO finished on the official KITTI leaderboard at less than half of ORB-SLAM's performance.

---

## 12.3 Three causes of failure

Between 2019 and 2020, papers in this area began a shared self-criticism. Sudeep Pillai (MIT, later TRI) in a 2019 talk systematized the structural limits of the end-to-end approach along three axes.

**First: absence of inductive bias.** Classical SLAM had carved into its algorithmic structure the geometric constraints accumulated over decades — epipolar constraint, rigid body motion assumption, scale invariance, spatial continuity. CNNs had to learn these from data. ImageNet's cats and cars do not teach the metric geometry of 3D space. Even when a regression network appears to hit the right pose, it was hard to tell whether that was because it understood 3D space or because it had memorized specific combinations of lighting, color, and texture.

**Second: generalization failure.** Step outside the training set and performance collapsed. A PoseNet trained on Cambridge Landmarks was unusable on Oxford streets. A DeepVO trained on KITTI saw drift grow exponentially on other vehicle datasets without radar. Classical ORB-SLAM also failed (it lost tracking when feature detection failed or lighting changed drastically), but that failure was predictable and it could reinitialize. End-to-end tended to be quietly wrong, and wrong without any signal of how wrong.

**Third: absence of uncertainty quantification.** The reason SLAM does not end as a simple pose estimator is that downstream systems — path planning, obstacle avoidance — demand the covariance of the localization estimate. EKF and factor graphs propagate covariance naturally. Bayesian PoseNet tried to estimate variance through dropout, but it was hard to verify whether that variance held a calibrated relationship with actual position error. Especially on inputs outside the training distribution, Bayesian PoseNet returned confident wrong answers instead, which for a robotic system is worse than failing visibly.

---

## 12.4 A record of reflection

Kendall did not look away from this failure. After completing his PhD in 2019, he moved to Wayve and turned toward imitation learning and world model research for autonomous driving. He had not given up on learning-based localization; he had judged that the problem statement "regress absolute pose from a single image" was wrong.

Federico Tombari's group (TU Munich, later Google) attempted [CNN-SLAM (Tateno et al. 2017)](https://arxiv.org/abs/1704.03489) around the same time. The approach fused dense depth predicted by a CNN with the depth measurement of direct monocular SLAM. It was not fully end-to-end in the sense that the learned part was confined to dense depth, but it was one branch of the hope that "maybe CNNs can solve the scale and low-texture problems of monocular SLAM." The results were uneven across scenes, and accuracy did not consistently come out ahead.

> 📜 **Prediction vs. outcome.** Kendall, in the PoseNet paper (2015), pointed to uncertainty estimation, temporal integration, and extension to larger-scale scenes as the next tasks. All three directions were pursued — Bayesian PoseNet (2016), LSTM PoseNet (2016), multiple outdoor extension experiments. But each attempt hit a new wall, and researchers in the end abandoned the whole approach. A reasonable prediction means nothing when the platform itself is wrong. `[abandoned]`

Some attempts survived in other directions. SfMLearner's photometric self-supervision was absorbed into the training strategy of MonoDepth2 (Godard 2019) and further into DROID-SLAM (Teed & Deng 2021). The LSTM-based temporal modeling DeepVO demonstrated reappeared in a modified form in visual-inertial learning research. The ideas did not disappear; their use changed.

> 📜 **Prediction vs. outcome.** Zhou, in the SfMLearner paper (2017), flagged dynamic object handling and robustness to photometric noise as remaining tasks. Follow-up self-supervised work including [GeoNet (Yin & Shi 2018)](https://arxiv.org/abs/1803.02276) partially pushed in this direction. But the path of self-supervised VO alone replacing SLAM never joined the mainstream. Photometric self-supervision carried its lineage forward, but the goal of end-to-end VO was rejected by the field. `[diverted]`

---

## 12.5 Settling of the lesson

Around 2020 the field reached a consensus. "Geometry as algorithm, learning as feature and prior" — roughly along those lines.

> 🔗 **Borrowed.** This principle's realization takes shape in Chapter 13's CodeSLAM (Bloesch 2018) and DROID-SLAM (Teed & Deng 2021), both of which keep the geometric skeleton of factor graph or bundle adjustment and confine the learned part to feature extraction or depth prior formation — the skeleton PoseNet threw out turned out not to be discardable.

The classical pipeline was not uniformly superior to the learning-based alternatives. ORB-SLAM too often failed in textureless environments, at night, in the rain. The problem was not the sturdiness of classical SLAM, but that end-to-end's errors were more opaque and more unpredictable.

The failure was not a problem of dataset or architecture. The path running straight from image to pose was missing thirty years of geometric knowledge.

---

## 🧭 Still open

**Which inductive bias to inject, and how.** The principle "geometry as algorithm" is right, but which geometry at which level should be coded is still an open question. Rigid body motion? Epipolar constraint? In the foundation model era this boundary is blurring again. GaussianSLAM and 3DGS-based systems are experimenting with ways to dissolve geometry into the learned representation.

**Calibration of learned uncertainty.** Even after Bayesian PoseNet's failure this problem is unsolved. Whether deep learning-based uncertainty estimates hold a calibrated relationship with actual error — especially on out-of-distribution inputs — remains open as of 2026. Autonomous driving is applying practical pressure on this question.

**Redefining "end-to-end."** The end-to-end as PoseNet defined it (image → pose, by learning alone) failed. But since the arrival of foundation models in 2023, the meaning of end-to-end is shifting. Which modules of SLAM to fill with learning and which to keep as algorithm — that dividing line itself is being renegotiated.

The principle "geometry as algorithm, learning as feature" settled in this period. In 2018, out of Andrew Davison's lab at Imperial College London in Kensington, came the first substantive implementation of that principle: CodeSLAM.

---

# Ch.13 — The Hybrid Victory: From CodeSLAM to DROID-SLAM

When Michael Bloesch presented CodeSLAM at CVPR 2018, his affiliation was the Dyson Robotics Lab at Imperial College London. His advisor was Andrew Davison. In the same lab, Richard Newcombe had built DTAM in 2011; in the same lab, Jan Czarnowski would release DeepFactors in 2020, and Edgar Sucar and Tristan Laidlow would extend the lineage. That is why CodeSLAM is more than a single paper. The creed Davison had been building since 2002 — "SLAM is probabilistic inference" — met for the first time, in a substantive way, the impulse brought by mid-2010s deep learning that "representations can be learned," and the meeting took place in Bloesch's 2018 paper.

---

## 13.1 CodeSLAM — latent code and the map

In traditional monocular SLAM, depth was something to be estimated. Whether as a few hundred sparse landmarks or, as in [DTAM](https://www.doc.ic.ac.uk/~ajd/Publications/newcombe_etal_iccv2011.pdf) (Newcombe et al. 2011), every pixel, depth was ultimately an optimization variable. The dimension of that variable space scaled with image resolution. A dense depth map for a single keyframe at 640×480 means 307,200 independent variables. Optimization is heavy, initialization is sensitive, and priors are hard to inject.

The idea of [Bloesch et al. 2018. CodeSLAM](https://doi.org/10.1109/CVPR.2018.00271) was simple. Instead of optimizing the depth map itself, optimize a low-dimensional latent vector (**latent code**) that generates it. Train a variational autoencoder (VAE) on real depth distributions, and its bottleneck latent space approximates the manifold on which "realistic depth maps" live. The optimization moves only on that manifold. The variables drop from hundreds of thousands to hundreds.

> 🔗 **Borrowed.** CodeSLAM's latent depth representation borrows the encoder-decoder latent space structure established in [Kingma & Welling 2013. VAE](https://arxiv.org/abs/1312.6114). The training phase follows the VAE frame, but at SLAM inference time **z** is treated directly as a MAP optimization variable without stochastic sampling. A tool generative-model researchers devised for image synthesis re-emerged a decade later as the low-dimensional representation space for SLAM optimization.

The structure goes as follows. For each keyframe, a VAE encoder extracts a latent code **z** from the image. A decoder reconstructs a dense depth map from **z**. Camera pose and **z** are jointly optimized. A photometric loss enforces consistency, and a latent prior regularizes **z** toward the prior distribution.

Written as an equation, the objective is:

$$E(\mathbf{z}, T) = \sum_{i,j} \rho\bigl(I_j(\pi(T_{ij}, D_\mathbf{z}(u_i), u_i)) - I_i(u_i)\bigr) + \lambda \|\mathbf{z}\|^2$$

$D_\mathbf{z}$ is the decoder, $\pi$ is the projection, $\rho$ is a robust cost, and $T_{ij}$ is the relative pose between keyframes. The latent prior term $\lambda\|\mathbf{z}\|^2$ corresponds to the negative log-likelihood of the standard normal prior $p(\mathbf{z}) = \mathcal{N}(0, I)$, and is the regularizer that falls out naturally from MAP inference under a Gaussian prior.

> 🔗 **Borrowed.** The factor graph (Dellaert and Kaess's [GTSAM](https://gtsam.org/tutorials/intro.html)) provided the backend skeleton of DeepFactors. The pose–latent coupling that CodeSLAM handled with joint optimization was reformulated by Czarnowski as an explicit factor graph. It was only at DeepFactors that a learned latent variable took its place next to the traditional pose nodes as another graph variable. The interface between the two worlds was the graph's edge.

The ability to fill in geometry from sparse input surpassed prior methods. CodeSLAM itself, however, was not real-time. The VAE inference and optimization loop were slow. The paper said so plainly.

> 📜 **Prediction vs. outcome.** CodeSLAM showed that a compact learned representation could be brought inside dense SLAM, but left room on both speed and scale. The follow-up DeepFactors (2020), from the same Imperial group, pushed one step further toward real-time but did not reach deployment-grade performance, and the general-purpose coverage across monocular, stereo, and RGB-D was eventually achieved by a different team (Teed and Deng, Princeton) through a different design — learned frontend plus Dense Bundle Adjustment (DBA). `[in progress + diverted]`

---

## 13.2 DeepFactors — Imperial Dyson Lab, factor graph integration

In 2020, Jan Czarnowski, also supervised by Davison at the Imperial Dyson Robotics Lab, released [Czarnowski et al. 2020. DeepFactors](https://doi.org/10.1109/LRA.2020.2969036). Czarnowski's aim was to pull the CodeSLAM idea inside an actual SLAM pipeline.

DeepFactors kept CodeSLAM's factor graph plus latent depth structure, separated tracking and mapping explicitly, and introduced a keyframe selection criterion. On an NVIDIA GTX 1080, tracking ran at about 250Hz against keyframes, but the network Jacobian computation took several hundred milliseconds per keyframe and was the bottleneck of the whole pipeline. It pointed the direction but did not reach deployment-grade real-time.

What DeepFactors proved was a principle. A learned representation can enter the factor graph as a node, and geometry optimization can operate on that latent space. The conclusion Czarnowski reached was simple. The realistic path is not end-to-end replacement, but swapping parts of the pipeline for learnable modules.

Around the same time, Daniel Cremers's group at TU Munich reached the same principle. Their starting point differed from Imperial's. Where the Davison lineage stacked a factor graph on top of CodeSLAM's VAE latent, the Cremers group took their own 2016 direct sparse odometry ([DSO](https://arxiv.org/abs/1607.02565)) as the skeleton and injected neural prediction into it. [Yang, Wang, Stückler, Cremers 2018. DVSO](https://arxiv.org/abs/1807.02570) injected neural depth into monocular DSO as a "virtual stereo," hallucinating a second camera in a monocular setting; [Yang, von Stumberg, Wang, Cremers 2020. D3VO](https://arxiv.org/abs/2003.01060) added three kinds of self-supervised neural prediction — depth, pose, and uncertainty — as additional factors in DSO's factor graph. [Wimbauer et al. 2021. MonoRec](https://arxiv.org/abs/2011.11814) and [Wimbauer et al. 2023. Behind the Scenes](https://arxiv.org/abs/2301.07668) carried the same lineage toward dynamic scene dense reconstruction and single-view density fields. The human lineage is separate from the Imperial group, but the design principle, absorbing neural prediction into classical optimization structure, converged.

That principle appeared again in 2021, in another form, at Princeton.

---

## 13.3 RAFT — recurrent optical flow

Zachary Teed and Jia Deng (Princeton) presented [Recurrent All-Pairs Field Transforms (RAFT)](https://arxiv.org/abs/2003.12039) at ECCV 2020. RAFT was not a SLAM paper. It was an optical flow estimation paper.

Yet RAFT's design becomes the core of DROID-SLAM later. The structure splits into three parts.

1. Feature encoder: a CNN extracts feature maps from two images
2. Correlation volume: similarities between all pixel pairs are built into a 4D volume. 4-level pyramid
3. Update operator: Gated Recurrent Unit (GRU) based iterative refinement. Looks up the correlation volume and updates the flow field

The all-pairs in the name summarizes what sets this structure apart. Rather than looking only at specific neighboring pixels, it considers every candidate location at once and refines the flow field progressively at a fixed resolution. Unlike earlier coarse-to-fine methods (PWC-Net and others), it keeps the flow field at a single full resolution while looking up the correlation pyramid. It beat prior methods by 5%–15% on KITTI, Sintel, and FlyingThings3D.

RAFT is not an ancestor in the SLAM lineage. But Teed noticed that the same update operator structure is structurally similar to SLAM's iterative bundle adjustment. How different is a GRU refining a flow field from an optimization step refining pose and depth?

---

## 13.4 DROID-SLAM — the update operator and BA

NeurIPS 2021, [Teed & Deng. DROID-SLAM](https://arxiv.org/abs/2108.10869). The DROID in the title stands for "Differentiable Recurrent Optimization-Inspired Design."

Walking through the architecture reveals the intent of the hybrid design.

The frontend has the same structure as RAFT. A CNN encoder extracts feature maps, an all-pairs correlation volume is built, and a GRU update operator iteratively estimates dense flow. The difference is that flow is estimated not between a single pair of images but simultaneously on every edge of a keyframe graph.

The backend is DBA. Pose and inverse depth are the optimization variables. The 2D correspondences supplied by flow estimation are used as constraints to jointly optimize pose and depth. The Schur complement trick solves the linear system efficiently.

The connecting tissue is the **DBA layer**. The flow and uncertainty estimated by the GRU feed into the DBA. When the DBA updates pose and depth, the result refreshes the reference for the next GRU iteration. The two modules are connected in a loop.

> 🔗 **Borrowed.** The idea of DBA goes back ten years. Newcombe's DTAM (2011) was the precursor of photometric bundle adjustment using every pixel. DROID-SLAM married that idea to the more robust input of learned flow. Newcombe's and Teed's affiliations differ, but the logical lineage runs through.

> 🔗 **Borrowed.** The update operator in DROID-SLAM was ported directly from RAFT by the same authors (Teed and Deng). The insight was that all-pairs recurrent refinement, designed for optical flow, is structurally compatible with the iterative optimization of bundle adjustment. That the same person wrote both papers made the borrowing possible.

On the EuRoC MAV dataset, DROID-SLAM recorded a lower RMSE ATE than ORB-SLAM3, the then-state-of-the-art. On both TartanAir (synthetic) and real indoor and outdoor sequences. It was more robust than feature-based methods especially under lighting change and texture scarcity. The numbers Teed later reported on the EuRoC V1_02 sequence in his Handbook retrospective are striking. An ATE of 16.5cm with the frontend alone dropped to 1.2cm after global optimization. Classical BA converged to single-digit centimeters on the constraints supplied by learned correspondence.

Looking back at why Ch.12's pure end-to-end approach failed reveals why DROID-SLAM was different. PoseNet regressed pose directly without geometric constraints and failed to generalize. Teed and Deng divided the roles. Dense correspondence estimation was left to learning; geometric constraint enforcement was handled by BA. Neural networks played to their strengths in feature extraction and dense matching; the geometry optimizer played to its strengths in consistency enforcement and uncertainty propagation. Hand-designed features were replaced with learned features, but the optimization structure was kept. That is where the 2021 hybrid differed from the 2015 end-to-end.

---

## 13.5 The Imperial Dyson Lab lineage

The flow from CodeSLAM to DROID-SLAM is best seen by tracing the human lineage of the Imperial Dyson Robotics Lab.

Andrew Davison led SLAM research at Imperial for twenty years after MonoSLAM in 2002. His students and collaborators created branching points one after another.

- **Richard Newcombe** (Davison advisee, Imperial): DTAM (2011), [KinectFusion](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ismar2011.pdf) (2011). Later Oculus → Meta Reality Labs
- **Michael Bloesch** (Davison advisee, Imperial): CodeSLAM (2018), touch and inertial SLAM research
- **Jan Czarnowski** (Davison advisee, Imperial): DeepFactors (2020)
- **Edgar Sucar** (Davison group, Imperial): [iMAP](https://arxiv.org/abs/2103.12352) (2021), later extended to the NeRF-SLAM lineage
- **Tristan Laidlow** (Davison group, Imperial): dense 3D reconstruction, later extended to the neural implicit SLAM lineage

This lineage is one of the cases where "school" is not an exaggeration. The factor graph plus uncertainty philosophy carried across, changing shape from monocular sparse to dense latent, and from dense latent to implicit representation. In [FutureMapping](https://arxiv.org/abs/1803.11288) (2018) and [FutureMapping 2](https://arxiv.org/abs/1910.14139) (2019, with Ortiz), Davison sketched directly on the map the computational structure and representations a Spatial AI system should carry. The argument was to bind diverse geometric and semantic representations into one probabilistic graph. CodeSLAM and DeepFactors were the first experiments under that sketch.

Teed and Deng are outside this lineage. Princeton, independent path. But the logical predecessor of DROID-SLAM's DBA is DTAM, and DTAM is Newcombe and Davison's work. Lineages can extend logically without human connection.

---

## 13.6 2023–2025 — extensions after DROID

In the years after DROID-SLAM, research grew on top of it, and alongside it.

[GO-SLAM](https://arxiv.org/abs/2309.02436) (Zhang et al. 2023) extended DROID-SLAM's tracking with online loop closing and full bundle adjustment, and ran mapping on an Instant-NGP style neural implicit representation (multi-resolution hash encoding). Tracking is DROID-family dense flow plus BA; the map is an implicit representation. A second layer of hybrid.

[NICER-SLAM](https://arxiv.org/abs/2302.03594) (Zhu et al. 2023) took a different road. Instead of borrowing from DROID, it solved tracking and mapping simultaneously on a single hierarchical neural implicit representation. The goal of RGB-only dense SLAM is shared, but the path differs. It is a way of hitting the same problem from outside the DROID lineage.

[SplaTAM](https://arxiv.org/abs/2312.02126) (Keetha et al. 2024) swapped the map representation for 3D Gaussian Splatting, and rewrote tracking not as DROID-style dense flow but on silhouette-guided differentiable rendering. It is a coupling with the 3DGS lineage rather than a direct extension of the DROID lineage.

[DPV-SLAM](https://arxiv.org/abs/2408.01654) (Lipson, Teed, Deng 2024) came out of the same Princeton group as DROID-SLAM. Based on [DPVO](https://github.com/princeton-vl/DPVO) (Deep Patch Visual Odometry) rather than DROID, it added proximity-based loop closure and CUDA block-sparse BA to build a system roughly 2.5× faster and with a smaller memory footprint than DROID-SLAM. The core is not a feature swap but a patch-based sparse representation combined with efficient loop closure.

Outside the DROID lineage, extensions in 2024–2025 followed the path Naver Labs opened with [DUSt3R](https://arxiv.org/abs/2312.14132) (Wang et al. 2023). After DUSt3R redefined the procedure of SfM itself by directly outputting pointmaps from two images (covered in detail in Ch.16), the same Revaud group introduced symmetric multi-view extension and working memory in [Cabon et al. 2025. MUSt3R](https://arxiv.org/abs/2503.01661), extending the image-pair-based structure to many frames. It is an attempt to let a single network handle both offline SfM and online VO/SLAM. More interesting is that DROID-family tools are recycled inside this ecosystem. [Li et al. 2024. MegaSAM](https://arxiv.org/abs/2412.04463) pushed DROID-SLAM's differentiable DBA toward dynamic scenes and uncalibrated video, jointly optimizing camera intrinsics during inference as well. NVIDIA's [Huang et al. 2025. ViPE](https://arxiv.org/abs/2508.10934) combined three kinds of constraints — DROID-SLAM's dense flow network, cuvslam's sparse points, and a monocular depth network — in a single DBA, industrializing it into a wild-video annotation pipeline at YouTube scale. The 2021 DROID design of learned frontend plus classical backend is being repeated in 2025 under harder conditions: calibration-free and dynamic scenes.

The pattern is not a single path. The route of stacking a neural map on top of DROID tracking as in GO-SLAM, the route of lightweight redesign with patch odometry as in DPV-SLAM, the route of rewriting tracking on implicit or splatting representations as in NICER-SLAM or SplaTAM, and the route of pushing DROID's DBA into uncalibrated and dynamic regimes as in MegaSAM and ViPE, all ran in parallel. The learned frontend plus classical backend framework that Teed and Deng released in 2021 became the common starting point for those branches.

> 📜 **Prediction vs. outcome.** DROID-SLAM set the reference point for a hybrid combining differentiable DBA with end-to-end learning. DPV-SLAM, released three years later by the same group, carried that reference point forward on efficiency. In contrast, the GO-SLAM, NICER-SLAM, and SplaTAM line branched off toward swapping the map representation for implicit or Gaussian splatting. DROID's "learned frontend plus classical backend" design is being varied along several branches, and which branch becomes the general-purpose solution has not yet been settled as of 2026. `[in progress]`

---

## 🧭 Still open

Generalization of learned priors outside the training distribution is the first problem. CodeSLAM's and DeepFactors's VAEs learn the depth distribution of the training data. In fully different environments (outdoor open-world, non-uniform texture, nighttime), a learned prior can even pull the optimization in the wrong direction. DROID-SLAM's flow estimator also drops in performance outside its training domain. As of 2026, "learned SLAM that works in any environment" does not yet exist. Approaches that train on diverse synthetic data such as TartanAir exist, but a sim-to-real gap remains.

The real-time constraint is still there. DROID-SLAM runs on average at about 10–15 fps on an NVIDIA RTX 2080Ti. It gets slower as the keyframe graph grows. DBA is the bottleneck. For applications that need real-time (30Hz+) low-power deployment, such as mobile robots or AR/VR, it is still not practical as of 2026. Lightweighting attempts (reducing keyframe count, approximate BA) exist, but carry performance trade-offs.

Learned integration of loop closure also remains unresolved. DROID-SLAM does not handle loop closure explicitly. Teed himself admitted, calmly, in his later Handbook retrospective that "DROID-SLAM doesn't include any relocalization module, so large loops with lots of drift cannot be closed." The keyframe graph is maintained in a sliding-window fashion, and global consistency is limited. Some efforts have tried to integrate learned loop closure (the place recognition work in Ch.12) into DROID's factor graph, but they have not converged into a single system. The point where the Ch.10 NetVLAD lineage and the Ch.13 DROID lineage meet is still open.

---

One question remains once we are here. Must the map representation stay as points, lines, and planes? DROID-SLAM's inverse depth map was the best dense representation available in 2021. But in 2020, [NeRF](https://arxiv.org/abs/2003.08934) (Neural Radiance Field) suggested a wholly different possibility. What if a scene is represented not as points or meshes but as a continuous function? If rendering is differentiable, photometric consistency can be enforced in a new way.

In 2021, at Imperial College, Edgar Sucar asked the same question — not as a rendering exercise but as a SLAM problem. Could an MLP replace the TSDF voxel grid entirely? The answer, and the fourteen months it took to arrive at it, is where Ch.14 begins.

---

# Ch.14 — The NeRF Shock and Its Graft onto SLAM: iMAP → NICE-SLAM

What if a learned representation were used not for tracking but for *the map itself*? Edgar Sucar at Imperial College answered that question with iMAP in 2021, and the material for the answer came from outside SLAM. It was NeRF.

In March 2020, Ben Mildenhall and colleagues posted [Mildenhall et al. 2020. NeRF](https://arxiv.org/abs/2003.08934) to arXiv, and the paper synthesized new-viewpoint photographs from eight images. Those photographs carried the grain of light and shadow. The SLAM community first read this as a rendering problem. Not a way to *build* a map, but a way to *display* one. Shifting that reading took fourteen months. At ICCV 2021, when Sucar presented iMAP, it became clear that NeRF could serve not as a rendering tool but as the map representation itself. iMAP extended the line of KinectFusion (Ch.9). It was the first implementation of the hypothesis that an implicit neural field could replace a TSDF voxel grid.

NeRF did not come out of thin air. Over the course of 2019 three strands of coordinate-based MLP representation of 3D went off almost simultaneously. [Park et al.'s DeepSDF](https://arxiv.org/abs/1901.05103) described object surfaces implicitly with an MLP that took coordinates in and returned signed distance; [Mescheder et al.'s Occupancy Networks](https://arxiv.org/abs/1812.03828) made the same coordinate input emit occupancy probability; [Sitzmann et al.'s SRN](https://arxiv.org/abs/1906.01618) stored a scene feature vector at each coordinate and composed images through differentiable ray marching. The same mathematical frame: feed coordinates in, get a field value out. Mildenhall et al. 2020 NeRF added volume rendering integration and positional encoding to this frame and closed it onto view synthesis. What iMAP inherited was not one paper but that whole one-year lineage.

---

## NeRF: MLP-based spatial representation

NeRF's core idea is that a single MLP holds the entire 3D space implicitly. Input is spatial coordinate $(x, y, z)$ and viewing direction $(\theta, \phi)$. Output is the color $(r, g, b)$ and density $\sigma$ at that position. How does this represent a full scene.

Rendering uses the volume rendering equation. A ray leaving camera origin $\mathbf{o}$ in direction $\mathbf{d}$ is sampled along parameter $t$:

$$\hat{C}(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma\!\left(\mathbf{r}(t)\right) \mathbf{c}\!\left(\mathbf{r}(t), \mathbf{d}\right)\, dt$$

Here $T(t) = \exp\!\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))\, ds\right)$ is the accumulated transmittance — the probability the ray reaches $t$ unblocked. The integral is approximated by a piecewise Riemann sum.

> 🔗 **Borrowed.** The volume rendering equation comes from [Kajiya & Von Herzen (1984)](https://courses.cs.duke.edu/cps296.8/spring03/papers/RayTracingVolumeDensities.pdf), a classical graphics paper. For nearly forty years it was a physics-based tool for offline rendering; Mildenhall turned it into the loss function of a reverse-direction optimization.

The problem that MLPs fail to learn high-frequency spatial signals was solved in the Mildenhall et al. (2020) NeRF paper by positional encoding. Projecting coordinates $(x, y, z)$ through sine and cosine functions at multiple frequencies lets the network learn fine textures and sharp boundaries:

$$\gamma(p) = \left(\sin(2^0 \pi p),\, \cos(2^0 \pi p),\, \ldots,\, \sin(2^{L-1} \pi p),\, \cos(2^{L-1} \pi p)\right)$$

> 🔗 **Borrowed.** NeRF's positional encoding appeared in the original Mildenhall et al. (2020) paper. The same year, [Tancik et al. (2020)](https://arxiv.org/abs/2006.10739) "Fourier Features Let Networks Learn High Frequency Functions" explained why this technique works through neural tangent kernel (NTK) theory.

NeRF's training runs backward. Images taken from known camera poses are compared with rendered outputs, and the per-pixel L2 loss is minimized. Once optimization ends, the MLP weights themselves store the scene's geometry and appearance. No voxels, no mesh. Space sits inside the network parameters.

The original NeRF had clear weaknesses, though. Training took hours, and the model was specialized to one scene with camera poses obtained beforehand from external SfM such as COLMAP. Transplanting this to SLAM demands that pose estimation and map learning happen jointly, at something close to real time.

---

## iMAP: the first neural implicit SLAM

Edgar Sucar at Imperial College's Dyson Robotics Lab presented [Sucar et al. 2021. iMAP](https://doi.org/10.1109/ICCV48922.2021.00612) at ICCV 2021, and that was the attempt. **iMAP** (Implicit MAP) took RGB-D camera input and optimized poses while using a single MLP as the map.

The structure is two alternating optimization loops. The *mapping* loop samples rays from the current keyframe and a set of randomly sampled past keyframes and updates the MLP. The *tracking* loop freezes the MLP and optimizes the current frame's pose against rendering loss. Both loops operate on one shared MLP.

Two loss terms. A color loss $\mathcal{L}_{\text{color}} = \|\hat{C} - C\|_2^2$ and a depth loss $\mathcal{L}_{\text{depth}} = \|\hat{D} - D\|_2^2$. Because iMAP uses RGB-D, depth supervision is available and geometry learning stayed stable.

iMAP was a proof of concept. It ran on small indoor scenes but had two structural problems. First, a single MLP forgot earlier regions as new regions were added. The catastrophic forgetting problem of neural networks. Sucar partially mitigated this through keyframe replay, but that was not a root solution. Second, as scenes grew, the representational capacity of a single MLP ran short. An MLP's forward pass treats the whole space as one function regardless of parameter count.

> 📜 **Prediction vs. outcome.** Sucar wrote in the iMAP Conclusion that "future directions for iMAP include how to make more structured and compositional representations that reason explicitly about the self similarity in scenes." The structured and compositional direction did become the central stem of follow-up work. Five months later ETH Zürich's pre-release of NICE-SLAM cut space hierarchically with a multi-resolution voxel feature grid, and Wang et al.'s Co-SLAM (2023) fused hash grid and coordinate encoding to push near-real-time performance of 10–17 Hz on an RTX 3090Ti. The "self-similarity, explicitly reasoned about" branch, however, did not develop much in the main line of NeRF-SLAM, and the lineage that refined a single MLP also drifted from the center. `[partial hit]`

---

## NICE-SLAM: hierarchical grid and scalability

The direct answer to iMAP's single-MLP problem came from Zihan Zhu and Songyou Peng at ETH Zürich in [Zhu et al. 2022. NICE-SLAM](https://arxiv.org/abs/2112.12130), presented at CVPR 2022. **NICE-SLAM** (Neural Implicit Scalable Coding for SLAM) combined a multi-resolution voxel feature grid with a small MLP decoder in place of a single MLP.

The idea is to split space into an explicit voxel grid but place a learnable feature vector at each voxel. At rendering time, features from the voxels around the sample coordinate are combined by trilinear interpolation and pushed through the small MLP to produce color and occupancy. The MLP does not need to be large. Most of the spatial information lives in the grid.

NICE-SLAM stacked three resolutions of grid hierarchically: the coarse grid holds overall geometry shape, the middle grid holds structural detail, and the fine grid holds texture. When a new region is added, only the features of the corresponding voxels need updating, so catastrophic forgetting of other regions was largely reduced.

In tracking, NICE-SLAM, similarly to iMAP, froze the MLP and grid features and optimized pose. In mapping it updated the grid features. On Replica and ScanNet it handled larger spaces than iMAP and reached higher detail quality.

There were limits. The grid's own memory grew as the cube of resolution. A room or two indoors could be handled, but scaling to multi-story buildings or outdoors remained unsolved. Speed was also far from real time.

Thomas Müller's [Müller et al. 2022. Instant-NGP](https://nvlabs.github.io/instant-ngp/) attacked this bottleneck from another angle at SIGGRAPH 2022. A hash-table-based feature encoding solved the memory explosion of voxel grids and cut training time from minutes to seconds. Instant-NGP was not a SLAM paper, but subsequent NeRF-SLAM work almost uniformly adopted hash encoding.

> 🔗 **Borrowed.** NICE-SLAM's multi-resolution feature grid overlaps in time with Instant-NGP's hash encoding and was designed independently, but in actual NeRF-SLAM implementations Instant-NGP's hash grid quickly replaced the NICE-SLAM grid. It is also the logical succession from KinectFusion (Ch.9), which stored TSDF in a grid, to storing features in a grid.

---

## Co-SLAM and NeRF-SLAM: two integration directions

From late 2022 onward, several systems split after iMAP and NICE-SLAM. One direction was to make the implicit representation more efficient; the other was to couple the robust backend of classical SLAM with a NeRF map.

UCL's [Wang et al. (2023) **Co-SLAM**](https://arxiv.org/abs/2304.14377) belongs to the first. It used joint coordinate and parametric encoding, combining a multi-resolution hash grid with one-blob encoding. The two representations were designed to complement each other, aiming at fast convergence and surface completeness together. The hash grid quickly filled observed dense regions, while the coordinate encoding provided a smooth prior over unobserved regions. 15–17 Hz on Replica with an RTX 3090. The first point at which NeRF-based SLAM reached near-real-time territory.

In the same CVPR that year, Idiap/EPFL's [Johari et al.'s **ESLAM**](https://arxiv.org/abs/2211.11704) solved a similar problem from a different angle. Instead of a 3D feature grid it used multi-scale axis-aligned feature planes, cutting memory growth from $O(n^3)$ to $O(n^2)$, and took TSDF rather than volume density as the decoding target to accelerate convergence.

Antoni Rosinol (MIT) released [**NeRF-SLAM**](https://arxiv.org/abs/2210.13641) in 2023, taking a different approach. It kept classical SLAM tracking and backend (factor graph optimization) as they were and replaced only the map representation with NeRF. Poses and dense depth were both delivered by the DROID-SLAM frontend. Rosinol took those poses, depths, and uncertainties as input and built an Instant-NGP-based map in parallel.

> 🔗 **Borrowed.** The NeRF-SLAM backend operates on top of Dellaert's factor graph optimization (Ch.6). Even under the hypothesis that "NeRF can change the map," the core mathematics of pose estimation sat unchanged on the graph structure established after 2005.

Rosinol chose modularity. Rather than forcing NeRF into the whole pipeline, he swapped only the map representation layer. Classical SLAM functions such as loop closure stayed in place.

---

## The structural limit of iMAP and what it means

Looking back, iMAP's significance lies in the concept more than the performance. It was the first system to show that a single MLP could hold the whole scene and be updated in near real time while pose was also optimized.

The root problem of a single MLP is the absence of locality. Rendering any part of space goes through the entire MLP. Two consequences follow. First, learning a new region shifts all the weights and degrades the representation of old regions (catastrophic forgetting). Second, as the scene grows, the spatial variety a single MLP must carry increases, requiring a larger network and more iterations. Representation capacity is tied linearly to parameter count, while scene complexity scales with spatial volume. A representation without locality grows more unfavorable as scale increases.

NICE-SLAM's grid, Instant-NGP's hash encoding, Co-SLAM's dual encoding all answered this locality problem. Partitioning space locally so that each part remembers only its own region makes new information less intrusive to old memory, and decouples rendering cost for a given region from overall scene size.

---

## 🧭 Still open

**Real-time NeRF-SLAM.** As of 2023, iMAP and NICE-SLAM were far from real time, and Co-SLAM reached 10–17 Hz on an RTX 3090Ti to touch near-real-time, yet still fell short of real time on mobile and robotic embedded hardware. Gaussian Splatting (Ch.15) later handled the speed problem differently, through a return to explicit representations, but classical real-time SLAM with implicit neural fields themselves (30 fps or more, without a consumer GPU) remains unfinished. Even as Instant-NGP pushed rendering speed up dramatically, overall throughput of the joint tracking and mapping loop remains constrained.

**Large-scale outdoor environments.** Efforts such as [Block-NeRF](https://arxiv.org/abs/2202.05263) (2022, Tancik et al.) that partition space into many local NeRFs exist, but have not meshed cleanly with SLAM's loop closure and global consistency requirements. City-scale NeRF-SLAM is an open problem.

**Semantic and editable implicit maps.** A NeRF map is optimized for rendering, which makes semantic label insertion and post-hoc editing difficult. Operations like "remove this object from the map" or "classify this region for a different purpose" are far more inconvenient than on a TSDF or point cloud. Language-guided NeRF editing work (the [LERF](https://arxiv.org/abs/2303.09553), [Nerfstudio](https://arxiv.org/abs/2302.04264) ecosystem) is in progress, but real-time integration with SLAM pipelines is still at the research stage as of 2026.

---

While iMAP and NICE-SLAM pushed implicit fields to their limit, a part of the research community was looking the other way. Rather than confining the map implicitly inside MLP weights or a feature grid, scattering space with millions of small ellipsoids explicitly placed in it could give fast rendering and intuitive editing. Before Bernhard Kerbl's paper at SIGGRAPH 2023 it was still a hypothesis.

---

# Ch.15 — The Gaussian Splatting Era: From 3DGS to GS-SLAM

In Ch.14, iMAP and NICE-SLAM showed what it meant to remember space with an MLP. There was a price. The MLP was opaque. No one could tell which neuron stored which region, and every new observation touched the whole network. NICE-SLAM's sub-1 fps rate on an RTX 3090 was hard to reconcile with the phrase "real-time SLAM." The scene sat locked inside network parameters, with no window into it.

At SIGGRAPH in August 2023, Bernhard Kerbl (INRIA), Georgios Kopanas, Thomas Leimkuhler, and George Drettakis presented [their paper](https://arxiv.org/abs/2308.04079). Kerbl kept the implicit representation paradigm that NeRF had built up over three years, but made a different choice. While iMAP, NICE-SLAM, and Co-SLAM locked scenes inside MLPs and voxel grids, Kerbl scattered millions of small ellipsoids (Gaussian primitives) into space. The SLAM community tilted toward this representation within six months. Kerbl's choice was not a new invention. The roots lay in a twenty-year-old graphics technique, Matthias Zwicker's EWA splatting (2001), and the differentiable rendering spirit of NeRF carried over intact. The difference was in the form of the representation.

---

## The structure of 3DGS

Kerbl represented scenes as an explicit set of Gaussians. Each Gaussian has a position (mean) $\boldsymbol{\mu} \in \mathbb{R}^3$, a covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{3 \times 3}$, an opacity $\alpha \in (0,1]$, and a color expressed in spherical harmonics coefficients. For training stability the covariance is factored into a scale vector $\mathbf{s}$ and a unit quaternion $\mathbf{q}$:

$$\boldsymbol{\Sigma} = \mathbf{R}\mathbf{S}\mathbf{S}^\top\mathbf{R}^\top$$

Rendering alpha-blends the projected 2D Gaussians in depth order. Each Gaussian's effective opacity $\alpha_i$ is the product of the learnable opacity $\sigma_i$ and the 2D Gaussian density $G_i(\mathbf{x})$ evaluated at the pixel location. The pixel color $C$ is

$$C = \sum_{i \in N} c_i \alpha_i \prod_{j<i}(1 - \alpha_j), \quad \alpha_i = \sigma_i \cdot G_i(\mathbf{x})$$

Unlike NeRF, which numerically approximates a volume rendering integral, 3DGS drops directly onto a GPU rasterization pipeline. The tile-based rasterizer implements both the forward pass and the backward pass as custom CUDA kernels. Over 30 fps on a single RTX 3090. Compared to NICE-SLAM, which was running under 1 fps on the same card, the rendering is dozens of times faster.

Initialization uses a sparse point cloud from SfM. Training then iterates a **densification** procedure that splits, clones, and prunes Gaussians. When the view-space position gradient crosses a threshold, Gaussians with large scale split into two children, and Gaussians with small scale clone at the same position. Gaussians with low opacity are pruned periodically.

> 🔗 **Borrowed.** 3DGS's rasterization-based splatting descends directly from Zwicker et al.'s [EWA splatting (2001)](https://www.cs.umd.edu/~zwicker/publications/EWAVolumeSplatting-VIS01.pdf). Zwicker wrapped each point in an elliptical weighted-average kernel to render point clouds. Kerbl replaced that kernel with a learnable Gaussian and accelerated it with a GPU tile rasterizer.

---

## Structural fit between 3DGS and SLAM

Implicit representations did not suit SLAM. MLP-based NeRF had to retrain the whole network for every new observation, and catastrophic forgetting made incremental updates hard. Map expansion meant resizing the network. NICE-SLAM's voxel grid eased the problem but could not escape the resolution–memory trade-off.

3DGS solved this structurally. A Gaussian is an object explicitly sitting in space, so when a new keyframe comes in, one simply adds Gaussians to the corresponding region. The densification procedure meshed naturally with keyframe insertion, and rendering quality stayed at NeRF level while running in real time. GS-SLAM papers poured in through late 2023 on that arithmetic.

---

## GS-SLAM: the first attempt

Chi Yan (HKU) and collaborators posted [Yan et al. 2023. GS-SLAM](https://arxiv.org/abs/2311.11700) on arXiv in November 2023. It was the first system to integrate 3DGS into a SLAM pipeline.

GS-SLAM's structure followed the classical SLAM framework. Tracking estimates the pose of the current frame; Mapping updates the Gaussian map. Yan's contributions were twofold. First, adaptive Gaussian expansion: a mechanism that inserts Gaussians into low-coverage regions when a new keyframe is added. Second, geometry-aware Gaussian selection: during backpropagation of the rendering loss, only Gaussians with large contribution are selected for optimization, buying speed.

Tracking optimizes the pose against a rendered photometric loss. GS-SLAM's tracking loss is an L1 color loss over sampled pixels:

$$\mathcal{L}_{track} = \sum_m \|\mathbf{C}_m - \hat{\mathbf{C}}_m\|_1$$

At the mapping step, Yan uses a weighted sum of color L1 and depth L1. Meanwhile the 3DGS original paper's training loss, the $(1-\lambda)\mathcal{L}_1 + \lambda\mathcal{L}_{D\text{-}SSIM}$ combination ($\lambda=0.2$), is the default form inherited when training the Gaussian map. The reason this loss is differentiable with respect to pose is 3DGS's differentiable rasterizer.

On the Replica dataset it matched NICE-SLAM's PSNR while raising throughput. The limits were plain too. It assumed an RGB-D camera and was not validated on large-scale outdoor environments.

---

## SplaTAM: silhouette-based densification

Nikhil Keetha (Carnegie Mellon)'s [Keetha et al. 2024. SplaTAM (CVPR)](https://arxiv.org/abs/2312.02126) differed from GS-SLAM in design philosophy. Instead of a complicated selection mechanism, Keetha went with a simple silhouette-mask-based densification.

Keetha's central idea is the **silhouette mask**. New Gaussians are added in the region of the current view that the existing Gaussians do not explain — the empty part of the rendered mask. Rather than computing where Gaussians should be, look at where they are not, and fill those. A simple rule.

Tracking optimizes the pose; Mapping optimizes the Gaussian parameters. The strict separation of the two steps is the basis of stability. GS-SLAM's alternating tracking-and-mapping optimization causes interference, and Keetha's two-step structure sidesteps it.

> 🔗 **Borrowed.** SplaTAM's keyframe-based map management structure is a case of the PTAM (Klein & Murray, 2007) idea running again on a new representation. PTAM's way of selectively inserting keyframes to maintain the map became, in SplaTAM, the trigger for Gaussian densification.

On the Replica dataset SplaTAM recorded PSNR 34.11 dB. In the same paper's table, NICE-SLAM recorded 24.42 dB. The rendering-quality gap was clear.

In the Limitations & Future Work of his 2024 CVPR paper, Keetha listed sensitivity to motion blur, depth noise, and aggressive rotation, plus the direction of removing the dependence on known intrinsics and dense depth, as the next tasks. He also mentioned scalability improvements.

> 📜 **Prediction vs. outcome.** In the Limitations section of SplaTAM (2024) Keetha named sensitivity to motion blur, depth noise, and aggressive rotation, as well as removal of the known-intrinsics/dense-depth dependence, as explicit tasks. The depth-removal direction was answered the same year by Matsuki's MonoGS (2024 CVPR) with a monocular RGB setting. The intrinsics-free direction and the large-scale issue are still in progress as of 2024–2025. `[partial hit + in progress]`

---

## MonoGS: monocular RGB

Hidenobu Matsuki (Imperial College Dyson Robotics Lab)'s [Matsuki et al. 2024. MonoGS (CVPR)](https://arxiv.org/abs/2312.06741) removed one constraint. It runs 3DGS SLAM from a single monocular RGB camera, with no depth sensor.

The core difficulty in the monocular setting is scale. Recovering metric scale without depth is an unsolved problem even in SfM. Matsuki's answer was to optimize the Gaussian geometry directly. He added a geometric consistency loss between the rendered depth and neighboring Gaussians.

$$\mathcal{L}_{iso} = \sum_k \| \mathbf{s}_k - \bar{s}_k \mathbf{1} \|_1$$

Here $\mathbf{s}_k \in \mathbb{R}^3$ is the scale vector of the k-th Gaussian, and $\bar{s}_k = \frac{1}{3}\sum_j s_{k,j}$ is the mean of the three-axis scales. This isotropy regularization prevents Gaussians from degenerating into excessively thin plate shapes. Without depth supervision, monocular Gaussians tend to stick to the camera plane, and the regularizer suppresses that failure mode.

For tracking, Matsuki optimizes the pose directly with the Gaussian rendering photometric loss. At the first-frame initialization he uses a monocular depth prior to seed the Gaussian positions, and for later frames he runs a rendering-based refinement starting from the previous pose. Without a depth sensor, the core of keeping scale is the combination of isotropic regularization and geometry-consistency losses across keyframes.

> 🔗 **Borrowed.** MonoGS's use of a monocular depth prior descends from the Godard MonoDepth2 line covered in Ch.11. The insight of self-supervised monocular depth, that structure can be recovered without depth supervision, was absorbed here as a Gaussian initialization strategy.

Matsuki came out of Imperial's Dyson Robotics Lab. The same lineage as Sucar (iMAP) and Bloesch (CodeSLAM), both from the lab Davison supervised. MonoGS represents the lab's shift from implicit MLP to explicit Gaussian representations.

On the TUM-RGBD dataset, MonoGS recorded an average ATE RMSE of 4.44 cm monocular and 1.58 cm RGB-D. Rendering quality in the RGB-D setting on Replica reached an average PSNR of 37.50 dB, comparable to other same-generation GS-SLAM systems.

---

## RTG-SLAM and real-time processing

The next problem for the GS-SLAM lineage to solve was speed. GS-SLAM and SplaTAM were hard to call real-time. Peng Zhexi's group at Zhejiang University published [RTG-SLAM (SIGGRAPH 2024)](https://arxiv.org/abs/2404.19706) in 2024 with real-time as an explicit target.

RTG-SLAM's strategy is to control the number of Gaussians. Instead of optimizing all Gaussians equally, it selects those with large contribution from the current camera view and optimizes only those. Gaussians are initialized on a surfel (surface-element) basis, keeping geometry while reducing count. On the Replica dataset it approached real-time throughput.

---

## Vanished competitors

Starting in 2024, TSDF and occupancy grids were pushed out of the mainstream of SLAM mapping. They still see use in embedded systems or safety-critical environments, but on the research frontier they retreated to a supporting role. NeRF-based SLAM in the same year also moved to a supporting position, outpaced by 3DGS in rendering speed and update flexibility.

This is a shift in representation and in hardware affinity at the same time. GPU rasterizers are much better optimized than GPU ray marchers. The fact that 3DGS runs naturally on the existing graphics pipeline raised its adoption speed relative to NeRF.

> 🔗 **Borrowed.** The differentiable rendering spirit of 3DGS descends directly from NeRF. The idea of optimizing scene representations with gradients, and of linking observation and rendering through a photometric loss, is the legacy of Mildenhall et al. (2020). Kerbl swapped the representation (implicit MLP → explicit Gaussian) but inherited the paradigm.

> 📜 **Prediction vs. outcome.** Kerbl et al., in §7.4 Limitations of 3DGS (2023), named elongated artifacts and popping in under-observed regions, the absence of regularization, and memory consumption (over 20 GB during training, several hundred MB when rendering large scenes) as limits. As future work they proposed antialiasing, more principled culling, and borrowing point-cloud compression techniques. The memory axis (compression) was answered directly in 2024 by the [Compact 3DGS](https://arxiv.org/abs/2311.13681) line and [Niedermayr et al.](https://arxiv.org/abs/2401.02436). Dynamic scene extensions ([4DGS](https://arxiv.org/abs/2310.08528), [Deformable 3DGS](https://arxiv.org/abs/2309.13101)) and generation/editing ([DreamGaussian](https://arxiv.org/abs/2309.16653), [GaussianEditor](https://arxiv.org/abs/2311.14521)) are areas the original paper did not directly raise, but they branched off as separate lines around 2024. `[hit]`

---

## 🧭 Still open

Memory scaling. The number of Gaussians grows linearly with scene size. What was sufficient at a few hundred thousand on the indoor Replica dataset grows to tens of millions in an outdoor city block. Gaussian pruning and level-of-detail hierarchies are being studied, but there is no consensus yet on how to reasonably manage the memory–rendering-quality trade-off in large-scale environments. The Compact 3DGS line (Lee et al. 2024, Niedermayr et al. 2024) is exploring the compression direction.

Semantic integration. Attempts to attach semantic labels to Gaussians ([LangSplat](https://arxiv.org/abs/2312.16084), [LERF](https://arxiv.org/abs/2303.09553), and others) appeared in 2023–2024. But no method yet updates semantic Gaussians in real time within a SLAM pipeline while also preserving tracking quality. How to handle the interference that arises when semantic and geometry are jointly optimized is the core question.

Dynamic scene. 4DGS and Deformable 3DGS proposed the direction of adding a time dimension to Gaussians. In a SLAM setting, dynamic objects move differently from the background and must be handled separately. GS-SLAM (Yan et al. 2023), SplaTAM (Keetha et al. 2024), and MonoGS (Matsuki et al. 2024) all retain the static-world assumption. The full trajectory of how SLAM has dealt with moving objects, from mask-based outlier rejection through multi-object factor graphs to deformable reconstruction, is covered separately in Ch.15b.

But 3DGS left another question. Where do the Gaussians come from. From an SfM point cloud, or from a depth sensor. A pose must be known to place a Gaussian, and Gaussians must exist to estimate a pose. This chicken-and-egg kept the GS-SLAM lineage dependent on external initialization. DUSt3R and its successors, covered in Ch.16, picked a different starting point. Instead of putting geometry on top of a representation, they learn geometry from scratch.

---

# Ch.15b — Where the Static-World Assumption Breaks: Dynamic and Deformable SLAM

Ch.15 ended with Gaussian Splatting holding the static-world assumption intact. GS-SLAM, SplaTAM, and MonoGS all treated the scene as fixed. A parallel lineage had refused that assumption from the start.

In 2015, ETH Zürich's Javier Fuentes-Pacheco, together with Ruiz-Ascencio and Rendón-Mancha, published [*Visual simultaneous localization and mapping: a survey*](https://link.springer.com/article/10.1007/s10462-012-9365-8) in Artificial Intelligence Review. The last chapter of that survey was "Dynamic and Deformable Environments." Papers dealing with moving objects had appeared earlier, but most treated them as outliers for RANSAC to reject. The Fuentes-Pacheco survey was the first document to declare dynamic environments an independent field. Ten years later, in 2025, the SLAM Handbook devotes 37 pages to this topic, the largest allocation of any chapter. There are six authors: MIT's Lukas Schmid, TU München's Daniel Cremers, UTS's Shoudong Huang, and Zaragoza's Montiel, Neira, and Civera. Three schools across three continents converged on one chapter because of what happened between 2015 and 2025. The static world had been the starting point of every SLAM system, but the robots that actually existed — self-driving cars on the street, service robots in homes, endoscopes inside organs — had to work outside that assumption.

---

## 15b.1 Three axes

The frame Schmid et al. sketch in Handbook Ch.15 §15.1 rewrites the earlier definition of "dynamic SLAM." Whether an environment is dynamic or static is not a property of the environment but *a property of the observation*. The same physical motion becomes short-term dynamic for one robot and long-term dynamic for another. The ratio between observation rate $\text{Obs}$ and change rate $\text{Dyn}$ decides it. When $\text{Dyn} \ll \text{Obs}$, motion is visible between frames; when $\text{Dyn} \gg \text{Obs}$, the scene has changed between visits.

This perspective produces three axes. The observation axis splits short-term from long-term. The reconstruction axis decides whether to estimate pose only, scene geometry as well, or to go all the way to 4D spatio-temporal understanding. The time axis splits online from offline. Earlier accounts compressed the field into the single question "how do we remove dynamic objects," but seen in this three-axis space, that question occupies only one corner of eight octants. This is why the field fractured. Researchers standing in different octants had been using the same words differently.

---

## 15b.2 Short-term: from masking to multi-object SLAM

The first solution was simple. Erase what moves.

Berta Bescos, then a doctoral student at Zaragoza, published [DynaSLAM](https://arxiv.org/abs/1806.05620) in RA-L in 2018. The system inserted Mask R-CNN into the ORB-SLAM2 frontend and masked people and cars in advance. Masked regions were excluded from keypoint extraction. Simple, but it worked. On the TUM-RGBD walking sequence, ATE dropped to single-digit centimeters.

In the same period, UCL's Martin Rünz made a different choice. Do not erase the moving objects — track them separately. Under Lourdes Agapito's supervision, he released [Co-Fusion (Rünz & Agapito, 2017)](https://arxiv.org/abs/1706.06629) and, the following year, [MaskFusion (Rünz et al., 2018)](https://arxiv.org/abs/1804.09194) back to back. Each object was assigned its own surfel model, and camera trajectory and object trajectories were estimated jointly. Edinburgh's Raluca Scona and Imperial's Stefan Leutenegger took yet another route at ICRA 2018 with [StaticFusion](https://arxiv.org/abs/1806.05628). Without semantic segmentation, they separated dynamic regions using residual clustering alone. A direction that does not depend on segmentation errors.

At this point the idea shifts once more. What if we bring moving objects into the state and estimate them together? QUT's Jun Zhang led [VDO-SLAM (Zhang et al., 2020)](https://arxiv.org/abs/2005.11052), which promoted each dynamic object to a variable in the factor graph. Camera pose $T_i^w \in SE(3)$ and object $k$'s pose $T_{k,i}^w \in SE(3)$ coexisted in the same graph. A constant-velocity factor imposed continuity on the object's linear and angular velocities. Joint optimization ran over the product manifold of SE(3) and object SE(3). Zaragoza's Bescos implemented the same idea on an ORB-SLAM2 base in 2021 with [DynaSLAM II (Bescos et al., 2021)](https://arxiv.org/abs/2010.07820). CMU's Yuheng Qiu extended it to articulated bodies — objects with joints, like humans — in [AirDOS](https://arxiv.org/abs/2109.09903), published in RA-L in 2022.

> 🔗 **Borrowed.** VDO-SLAM's factor-graph extension inherits directly from the iSAM tradition Dellaert and Kaess established in Ch.6 graph SLAM. Adding one more variable and one more factor, in dynamic SLAM, became the act of putting a single moving car on the map.

A third angle came in from the inertial side. KAIST URL's Song, Lim, Lee, and Myung published [DynaVINS](https://arxiv.org/abs/2208.11500) in RA-L in 2022 without using either semantic masks or multi-object tracking. Observations that disagreed with the pose prior from IMU preintegration had their factor weights reduced during bundle adjustment, cutting off the path through which dynamic features could leak into the joint state. The same group's [DynaVINS++](https://arxiv.org/abs/2410.15373), RA-L 2024, reformulated this as adaptive truncated least squares, catching even the failure mode where dynamic features back-propagate into IMU bias estimation and diverge.

The Handbook organizes this lineage under §15.2.3 "Dense Dynamic SLAM" and places Schmid's own [Dynablox (Schmid et al., 2023)](https://arxiv.org/abs/2304.10049) as the current form of LiDAR MOS. The 2025 [AnyCam](https://arxiv.org/abs/2503.23282) pulls 4D directly from everyday video using a transformer backbone. It is the 2025 version of the "simultaneous tracking + reconstruction" line that Rünz opened in 2017.

---

## 15b.3 Long-term: maps across time

If short-term is motion between frames, long-term is change between visits. The chair I saw yesterday has been pushed aside today. This problem grew from a different lineage.

Sherbrooke's Mathieu Labbé, under Michaud's supervision, developed [RTAB-Map](https://introlab.github.io/rtabmap/) from 2013, borrowing directly from human memory models. It placed short-term, working, and long-term memory in a hierarchy and moved nodes according to time and observation frequency. Within a session a node stayed in working memory; if not frequently revisited it descended to long-term memory; if it lost meaning it was discarded. In a 2019 JFR paper, Labbé laid out how this structure scales to multi-session SLAM. At KAIST, under Ayoung Kim, Hyungtae Lim published [ERASOR](https://arxiv.org/abs/2103.04316) in 2021, taking a different angle. He turned the problem of making the map clean into scene differencing — finding points that had disappeared between two passes through the same place.

A frame that runs through all of Handbook §15.3: **absence of evidence vs evidence of absence**. One has to distinguish whether the chair is not there from whether one simply did not see it. Without this distinction, map cleaning erases legitimate objects and change detection misjudges occluded regions. Schmid's [Panoptic Multi-TSDF](https://arxiv.org/abs/2109.10165), RA-L 2022, addressed this with a submap structure. Each object was managed as an independent submap, and active and inactive were separated under local consistency. The same group's [Khronos](https://arxiv.org/abs/2402.13817), 2024, went one step further. It robustified association with graduated non-convexity, and even after loop closure it ran deformable geometric change detection, estimating the moment of change for each object. The point at which a metric-semantic map turns into a 4D spatio-temporal map.

> 🔗 **Borrowed.** The submap structure of Panoptic Multi-TSDF reweaves, in different material, the multi-map management Atlas introduced in Ch.7 ORB-SLAM. The keyframe submap has become a panoptic object submap, but the principle — split when a single map grows too large — remains.

The same question rolled along separately on the LiDAR side. KAIST URL's Jang, Lee, Nahrendra, and Myung released [Chamelion](https://arxiv.org/abs/2602.08189) in 2026, stacking scene-mixing augmentation on top of a dual-head network to run change detection without ground truth in transient environments (construction sites, frequently rearranged indoor spaces) where structure flips moment to moment. Where Khronos built 4D on the RGB-D and panoptic side, Chamelion carries the same question toward long-term map maintenance on point clouds.

Another axis in this lineage is research on recurrence. Örebro's Tomáš Krajník and Achim Lilienthal, in Sweden, developed **frequency maps** from 2014, modeling periodic events — commuter traffic flow, day-night lighting changes — on a Fourier basis. Stockholm Royal Institute of Technology's Martin Magnusson group consolidated this in 2019 as Maps of Dynamics (MoD), encoding *typical motion patterns* directly into the map. "People walk to the left in this corridor" becomes part of the map. The 2023 [Changing-SLAM (Schmid et al., 2023)](https://arxiv.org/abs/2301.09479) attempted to handle short-term with a Kalman filter and long-term with semantic class matching, simultaneously, on top of an ORB-SLAM extension.

---

## 15b.4 Deformable: when the shape itself changes

What happens when even the background moves? Civera and Montiel in Zaragoza stood in front of this question for a long time.

The start was elsewhere. The 2015 CVPR best paper was [DynamicFusion](https://grail.cs.washington.edu/projects/dynamicfusion/), by Microsoft Research's Newcombe, Fox, and Seitz. An embedded deformation graph was layered on KinectFusion's canonical TSDF, reconstructing non-rigid objects (faces, torsos) in front of the camera in real time. A deformation graph with rotation and translation assigned to each node was optimized every frame. In the same line, TU München's Matthias Innmann added color information with [VolumeDeform](https://arxiv.org/abs/1603.08161) in 2016, and in 2017 Miroslava Slavcheva introduced [KillingFusion](https://campar.in.tum.de/pub/slavcheva2017cvpr/slavcheva2017cvpr.pdf), bringing in Killing vector field regularization to allow topology changes — a hand parting from the torso. At MIT, under Tedrake's supervision, Wei Gao's 2019 [SurfelWarp](https://arxiv.org/abs/1904.13073) chose surfels over TSDF to gain exploration friendliness.

> 🔗 **Borrowed.** DynamicFusion's embedded deformation graph was lifted directly from the ED graph Sumner, Schmid, and Pauly published in computer graphics in 2007. A sparse control graph for mesh deformation became the variable representation for real-time non-rigid SLAM.

The monocular story played out in Zaragoza. Juan Lamarca, who finished his doctorate under Montiel, published [DefSLAM](https://arxiv.org/abs/1908.08918) in RA-L in 2021. He recomputed a template at each keyframe with isometric NRSfM and mixed an ORB frontend with Lucas-Kanade optical flow to maintain traces. A limitation that assumed planar topology. The same group's Juan J. Gómez Rodríguez removed that limitation in 2023 with [NR-SLAM](https://arxiv.org/abs/2308.04036), handling arbitrary topology with a dynamic deformable graph and adding temporal regularization through a visco-elastic model. Handbook §15.4.2 organizes this lineage as the "monocular line of deformable SLAM."

The applications cluster on the medical side. Tsinghua's Song released [MIS-SLAM](https://ieeexplore.ieee.org/document/8458232) in 2018, tracking the deformation of intraoperative organs with stereo endoscopy. Children's National's Jayender group developed EMDQ (Expectation Maximization + Dual Quaternion), which estimated a smooth deformation field over SURF features. Intra-operative navigation in the real environment of minimally invasive surgery is the target of these systems.

One fundamental problem highlighted in Handbook §15.4.1: **Floating Map Ambiguity**. Without a prior, the rigid motion of a non-rigid object cannot be distinguished from the rigid motion of the camera. Whether the hand moved 30 cm or the camera moved 30 cm — observation alone says neither. Recovery of absolute scale here differs in character from the long-standing scale ambiguity of monocular SLAM. Not only scale but trajectory and deformation couple simultaneously and become ill-posed. DefSLAM and NR-SLAM break this ambiguity partially with isometric and visco-elastic priors, but no principled solution exists as of 2026.

> 📜 **Prediction vs. outcome.** In §7 Future Work of DynamicFusion (2015), Newcombe listed "extension to larger scenes and topology changes" and "integration with loop closure" as the next challenges. Topology change was answered in 2017 by KillingFusion. Large-scale scenes were addressed in part by the surfel-based SurfelWarp (2019). Integration with loop closure did not appear until 2024, under the name deformable geometric change detection in Khronos. Nine years in total. `[partial hit]`

---

## 15b.5 The intellectual lineage of three schools

The arrangement of the six Handbook authors in this chapter is itself the evidence.

**The Zaragoza school** (Montiel, Neira, Civera, Lamarca, Rodríguez) is the home of deformable geometry, running from MonoSLAM (Ch.5) through ORB-SLAM (Ch.7), DynaSLAM, DefSLAM, and NR-SLAM. A twenty-year tradition of pushing geometry to the limit in the monocular setting. **The Imperial/TUM lineage** (Davison, Newcombe, Rünz, Cremers) takes the dense and learning-based axis. KinectFusion (Ch.9) led to DynamicFusion; SLAM++ (Ch.18) led to Co-Fusion and MaskFusion. As the Cremers group shifted toward change-aware SLAM in the 2020s, it became the center of a new lineage. **The Cambridge/ETH/MIT lineage** (Schmid, Leutenegger, Agapito) converged on panoptic 4D. Schmid himself completed his doctorate under Cremers, passed through the Carlone group at MIT, and went on to JPL. That trajectory overlaps with the sequence KillingFusion → Dynablox → Panoptic Multi-TSDF → Khronos.

The six-author composition of Handbook Ch.15 reproduces these three schools almost exactly. That the field has split into three branches is self-evidenced by the author list.

---

## 🧭 Still open

**Absence vs evidence of absence.** Distinguishing whether an object has disappeared from the map or was merely occluded remains a foundational difficulty of long-term SLAM. Schmid's Panoptic Multi-TSDF gave a partial answer with an active-submap structure, but in outdoor large-scale environments and in settings with occlusion above 60%, the decision error remains large. As of 2026, no paper has claimed a principled solution.

**Floating Map Ambiguity.** In deformable SLAM, separating the camera's rigid motion from the object's rigid motion is still only worked around with isometric and visco-elastic priors. What conditions identify the two motions without a prior, and what observations break the ambiguity, are unresolved. Lamarca's [2023 IJRR paper](https://arxiv.org/abs/2302.03710) laid out some of the observation conditions, but no general theory yet.

**Online deformable SLAM.** DefSLAM and NR-SLAM come close to real time, but no system runs change-aware integration at the Khronos level online on a monocular RGB input. The optimization cost crosses the real-time limit. GPU acceleration and learned priors open possibilities, but no validated pipeline has appeared yet.

**The real-world gap in medical MIS.** MIS-SLAM and NR-SLAM work on phantoms and ex vivo data, but robustness drops in the actual surgical environment — blood, smoke, tool occlusion, abrupt lighting changes. Gaussian-based attempts such as 2024's EndoGS are appearing, but no system has been reported to reach deployment level.

---

## Note: a reframing recommendation for Ch.18 §18.4

Seen from this chapter, the title of Ch.18 §18.4, "The overheating and failure of Semantic SLAM," reads differently. Dense dynamic SLAM, change-aware SLAM, and deformable SLAM achieved real success between 2020 and 2025 by using semantic as an *auxiliary cue*. What failed was the prediction, in the SLAM++ mode, that "semantic will take over the SLAM frontend" — the **object-as-landmark** route. Narrowing the §18.4 title to "The contraction of the object-as-landmark route" and cross-referencing this chapter would be the natural revision. The concrete edits will be handled separately in Phase D3-B.

---

The question Ch.15b has been circling, what is the right representation for a world that changes, connects back to the main line at a different level. In the GS-SLAM and NeRF-SLAM lineages, the representation problem was approached by making rendering faster or more compact. Ch.16 approaches it differently: not by improving the representation pipeline, but by learning the geometry prior itself. DUSt3R and its successors begin there.

---

# Ch.16 — Foundation 3D: From DUSt3R to VGGT

Philippe Weinzaepfel and Jerome Revaud of Naver Labs Europe released CroCo in 2022, proposing a cross-view self-supervised pretraining scheme that learned visual representations using the fact that two images captured the same scene as its only cue. It looked like a feature-learning paper. A year later, when the same team built a system on top of CroCo's architecture that output pointmaps directly without calibration, DUSt3R moved past feature learning and redefined multi-view geometry. The lineage that began at Naver Labs Europe carried over to the VGG group at Oxford, and as of 2026 it is rewriting the question of what SfM is.

---

## 16.1 DUSt3R — learned pointmap

For the ten years starting in 2013, 3D reconstruction followed the same procedure. Find feature points, match them, estimate intrinsic and extrinsic camera parameters, build a point cloud through triangulation, and refine the whole thing with bundle adjustment. [COLMAP (Schönberger & Frahm, 2016)](https://openaccess.thecvf.com/content_cvpr_2016/html/Schonberger_Structure-From-Motion_Revisited_CVPR_2016_paper.html) was the most complete form of this pipeline. Errors dropped but the structure of the procedure did not change.

[Shuzhe Wang et al. 2023. DUSt3R: Geometric 3D Vision Made Easy](https://arxiv.org/abs/2312.14132) routes around this procedure. It takes two images as input and outputs 3D coordinates for each pixel directly. It does not require intrinsic parameters (focal length, principal point). The output, called a pointmap, gives coordinates in a common 3D space rather than in image coordinates. You do not need to know what lens the camera is using.

DUSt3R's transformer uses the encoder-decoder structure inherited from CroCo. Each image is encoded independently, then the decoder refers to the other image's encoder output through cross-attention. If self-attention handles relations between pixels within a single image, cross-attention implicitly learns the correspondence between two images. Which pixel sees the same 3D point as which — rather than coding this as a rule, the model absorbs it as a pattern from large-scale data. DUSt3R's training data consists of millions of image pairs from MegaDepth, ScanNet, ARKitScenes, BlendedMVS, and others. COLMAP produced the ground truth. The inversion where classical SfM supplies the ground truth for the learning era happens here.

> 🔗 **Borrowed.** DUSt3R's backbone comes from ViT ([Dosovitskiy et al. 2020](https://arxiv.org/abs/2010.11929)). But the decisive foothold is CroCo ([Weinzaepfel et al. 2022](https://arxiv.org/abs/2210.10716)), prior work from within Naver Labs Europe. CroCo proposed a cross-view self-supervised pretraining where masked regions in one image are reconstructed from information in the other. DUSt3R inherited CroCo's encoder-decoder structure as is and only swapped the task to "pointmap prediction."

Once a pair of pointmaps is obtained from two images, camera pose is recovered by rigid alignment between these pointmaps. It is a generalization of Procrustes alignment. Pose estimation is no longer a separate step but a derivative of the pointmap.

When extending to three or ten images, DUSt3R solves a global alignment. It is an optimization problem that registers the pointmaps of all image pairs into one common coordinate frame. Only at this stage does something resembling bundle adjustment appear, but it proceeds without feature matching or camera models.

---

## 16.2 Swallowing matching: MASt3R

DUSt3R's output is closer to reconstruction than to novel view synthesis. Yet an important sub-task in reconstruction, finding precise pixel correspondences between two images (feature matching), DUSt3R handles only implicitly. To replace the explicit matching performed by SuperPoint+SuperGlue or LightGlue, additional machinery was needed.

[Vincent Leroy et al. 2024. Grounding Image Matching in 3D with MASt3R (ECCV)](https://arxiv.org/abs/2406.09756) adds a matching head to DUSt3R. It is trained to output a feature descriptor for each pixel along with the pointmap, with joint learning that keeps the 3D position and the feature consistent. The resulting features are anchored in 3D space rather than in the image plane. Matching simplifies into nearest-neighbor search over these feature descriptors.

> 🔗 **Borrowed.** MASt3R's 3D-anchored matching attacks the problem SuperGlue ([Sarlin et al. 2020](https://arxiv.org/abs/1911.11763)) was trying to solve (resolving ambiguity in 2D descriptors through context) from a different direction. SuperGlue reduced ambiguity in 2D matching with a graph neural network. MASt3R learns 3D structure directly, eliminating the source of the ambiguity itself.

Within months of MASt3R's release, multiple groups in the SLAM community reported experiments that replaced the SuperPoint+SuperGlue combination with MASt3R. In late 2024, [Riku Murai, Eric Dexheimer, Andrew Davison](https://arxiv.org/abs/2412.12392) at Imperial College London released MASt3R-SLAM, using MASt3R's matching as the frontend and graph-based global optimization as the backend. The classical SLAM architecture kept its shape while nearly all of the internal parts were swapped out.

MASt3R's strength is that dense matching is possible without ground-truth calibration. As of 2026, inserting DUSt3R or MASt3R into a COLMAP-based SfM pipeline is becoming standard in experimental setups.

> 📜 **Prediction vs. outcome.** The DUSt3R paper itself did not include a dedicated "Future Work" section, but the structure of pair-wise + global alignment implies sequence processing and real-time operation as the next tasks. Spann3R arrived in August 2024, MASt3R-SLAM at the end of 2024. The two follow-up works responded to sequential extension and SLAM integration within 6–12 months. `[in progress]`

---

## 16.3 Spann3R — sequential processing

But batch processing has a fundamental constraint. In SLAM the images are not all available in advance.

DUSt3R and MASt3R take a set of images as input and process them in batch. The approach is to lay out all the images in a bag at once and register them. SLAM is different. Images arrive in temporal order, and the system must update the map at each frame.

[Hengyi Wang & Lourdes Agapito 2024. 3D Reconstruction with Spatial Memory (Spann3R)](https://arxiv.org/abs/2408.16061) reshapes DUSt3R's structure for sequential processing. The core idea is spatial memory. Information from already-processed frames is stored in a memory bank, and when a new frame comes in the system performs cross-attention against this memory. Attention decides which information from past frames each pixel of the new image is related to.

> 🔗 **Borrowed.** Spann3R's spatial memory mechanism is similar in concept to cross-attention memory. Structurally, it inherits DUSt3R's pretrained ViT encoder-decoder as is, and sets up memory keys that combine the decoder output (geometric features) with image features so that memory lookup reflects appearance and distance at once. It is a route where the geometric representation DUSt3R captured is reused as the index of the sequential memory.

Spann3R carries over DUSt3R's property of working without a calibrated camera. As each sequential image comes in, the map built so far is updated incrementally. It is not fully real-time, but it is one step closer to SLAM application than DUSt3R's batch approach.

---

## 16.4 VGGT — multi-view joint inference

Spann3R made sequential processing possible. But DUSt3R's basic skeleton of pair-wise pointmap + global alignment remained. The VGG group at Oxford enters. Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny pushed DUSt3R's logic all the way at the start of 2025. Rather than two images, they take an arbitrary number of multiple images as simultaneous input and output camera pose, depth, and point cloud in a single forward pass.

[Jianyuan Wang et al. 2025. VGGT: Visual Geometry Grounded Transformer](https://arxiv.org/abs/2503.11651) turned DUSt3R's pair-wise processing into genuine multi-view joint inference. To process N images with DUSt3R you compute N(N-1)/2 pairs of pointmaps and then solve global alignment. VGGT pushes N images through the transformer at once. Attention processes the relations among all image pairs simultaneously.

> 🔗 **Borrowed.** In the context where VGGT redefines COLMAP's role, there is an older lineage. The irony that COLMAP itself produces the ground truth of the learning era was noted earlier. And each stage of what COLMAP actually performs (pair-wise geometry estimation → graph construction → global optimization) is implicitly reproduced inside VGGT. The form is one where what classical SfM implemented as an algorithm is absorbed by a foundation model as weights.

In quantitative comparisons with DUSt3R, VGGT showed consistent advantage in camera pose estimation accuracy and point cloud quality. Processing speed is also faster because there is no global alignment optimization. And this is where a conceptual problem arises.

---

## 16.5 The boundary between pose estimation and reconstruction dissolves

Traditional computer vision distinguished the two problems. Camera pose estimation finds the current position in an already-known map, and 3D reconstruction recovers the geometry of an unknown environment. SLAM was hard because it solved both at the same time.

The systems from DUSt3R to VGGT are indifferent to this distinction. Predict a pointmap and pose comes out; get a pose and reconstruction comes out. The ordering itself of "first get the camera, then the point cloud" or "first get the point cloud, then the camera" is gone. A single forward pass outputs everything at once.

The multi-view geometry learned so far is not discarded. DUSt3R, MASt3R, and VGGT work because they have learned, inside the transformer weights, the geometric principles that epipolar constraint, triangulation, and bundle adjustment implement. What has been discarded is not those principles but the way of implementing them as explicit algorithms. Geometry has not vanished; it has been absorbed implicitly.

For the researcher, however, this is a real shift. You cannot debug DUSt3R the way you debugged Schönberger's COLMAP code. Where it failed and why is buried inside attention weights. The interpretability problem returns in a new form.

> 📜 **Prediction vs. outcome.** The MASt3R paper closed briefly, suggesting that matching without ground-truth calibration was open to several downstream tasks. It was not an explicit prediction of pipeline reshaping. As of 2026 several photogrammetry software packages are evaluating DUSt3R/MASt3R as an initialization stage, and the pattern looks more like hybrid insertion than full replacement. `[in progress]`

One research institute, Naver Labs Europe, walked through CroCo (2022) → DUSt3R (2023) → MASt3R (2024) within two years, a single team releasing in succession the stack from pretraining methodology to matching system. Not a large place like Google Brain, DeepMind, or Meta AI, but Naver Labs Europe, carried by a small team centered on Weinzaepfel, Revaud, and Leroy. The work of moving to the SLAM stage was taken up by the Davison group at Imperial College London (MASt3R-SLAM).

---

## 16.6 Another branch — semantic foundation enters the map

The narrative so far has been geometric foundation. DUSt3R, MASt3R, and VGGT deal with pointmaps, camera poses, and geometric structure. But around 2022 the phrase "foundation 3D" began to be used in the SLAM literature along two branches. One is the geometric lineage that started at Naver Labs Europe. The other is the semantic lineage that pulls CLIP, DINO, and SAM into the map. The former removed calibration; the latter removed the dictionary.

The semantic branch began in Luca Carlone's group at MIT. [Nathan Hughes et al. 2022. Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization](https://arxiv.org/abs/2201.13360) placed a hierarchical scene graph of objects → places → rooms → buildings online on top of Kimera's (Rosinol 2020) metric-semantic mesh. It sat inside the constraint that, as long as you use a closed-set classifier, a handbook would nail down as "100–1000 labels predefined dictionary," but it showed for the first time that a hierarchical map could run in real time.

The wall of the dictionary was torn down by foundation models. [Songyou Peng et al. 2023. OpenScene: 3D Scene Understanding with Open Vocabularies (CVPR)](https://arxiv.org/abs/2211.15654) came out of the ETH/Pollefeys group, and shortly after [Qiao Gu et al. 2024. ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning (ICRA)](https://arxiv.org/abs/2309.16650) from a Montréal-MIT collaboration. OpenScene distilled CLIP features onto 3D point clouds, allowing queries like "how close is this point to a chair" to be answered by natural language. ConceptGraphs went a step further. Instead of class labels, a VLM generated language descriptions as node attributes, and an LLM narrated relations between objects. As Peng's open-vocabulary features combined with Hydra's hierarchical structure, scene graphs came to accept even concepts not in the dictionary.

[Dominic Maggio et al. 2024. Clio: Real-time Task-Driven Open-Set 3D Scene Graphs](https://arxiv.org/abs/2404.13696) turned this lineage toward tasks. A natural-language task given to the robot is interpreted as an information bottleneck, and only the level of abstraction needed for that task is left in the scene graph. In an instruction like "clean near the coffee machine," the coffee machine and the objects around it are preserved while unrelated details are grouped. Which layer of the hierarchical graph to expose varies by task.

> 🔗 **Borrowed.** ConceptGraphs and Clio inherit the Carlone-group scene graph (Armeni → Rosinol-Kimera → Hughes-Hydra → Maggio-Clio) accumulated over eight years, swapping node features for CLIP, VLM, and LLM outputs while the objects-places-rooms hierarchy survives.

This book does not cover this branch. The problem of loading semantic onto the map is a separate trajectory that follows Ch.18 §18.4's note on the market shrinkage of the object-as-landmark lineage in 2017–2019, and the full narration ends here. As a historical record, two points are left. One is that semantic SLAM did not "fail" but returned in the form of hierarchical scene graphs. The other is that geometric foundation (the DUSt3R lineage) and semantic foundation (the Hydra → ConceptGraphs → Clio lineage) have not yet met in earnest as of 2026. An end-to-end system that attaches CLIP features to VGGT's pointmap, or a configuration that combines Clio's scene graph with DUSt3R's calibration-free geometry, has not yet been reported. The point of encounter is handed over to Ch.19's open problems.

---

## 16.7 What remains of SLAM

MASt3R-SLAM borrows the architecture of classical SLAM. Keyframe selection, loop closure, map management — these structures were still needed on top of the new representation. The DUSt3R family replaced the insides of feature matching and reconstruction, but the system-level judgments of SLAM are reused in the form classical methods resolved them.

This observation is consistent with a pattern that repeats across Part 5. NeRF-SLAM adopted NeRF as a map representation yet kept keyframe-based tracking. 3DGS-SLAM adopted Gaussians yet did loop closure in the classical way (Ch.15). Dynamic SLAM in Ch.15b also only swapped the front end for mask removal while the back end stayed the same. The representation changes; the system structure survives.

The same pattern repeats for foundation 3D. In 2025, Dominic Maggio and Luca Carlone at MIT released [VGGT-SLAM](https://arxiv.org/abs/2505.12549). The configuration has VGGT reconstruct local submaps and a factor graph weave them into the global coordinate frame. The transformer absorbed geometry, but the factor graph survived. Revaud himself wrote in Handbook Ch.13 that "a form of factor graph is still necessary." The speed of absorption is unusual but the final form is still open. How far foundation 3D can handle real-time large-scale sequences, and where it will merge with the semantic branch noted in 16.6, are the things to watch in 2026–2027.

---

## 🧭 Still open

**The wall of large-scale sequence processing.** The transformers in DUSt3R and VGGT require memory that scales quadratically with the number of images. Up to 100 images is realistic, but 1,000 or 10,000 is another matter. Spann3R's incremental approach is a partial answer, but smooth handling of large outdoor environments is unresolved. Who is working on it now? Several groups are exploring sparse attention and hierarchical global alignment, but no method has consensus.

**How to define loop closure in this frame.** In classical SLAM, loop closure is the mechanism that recognizes a previously visited place and corrects accumulated error. In the DUSt3R family, how is "a previously visited place" represented, and in a pointmap-based map, how is correction propagated? MASt3R-SLAM handles it with the existing approach, but whether this is the best or a principled solution is unknown.

**Generalization of metric scale.** DUSt3R's pointmaps are in relative scale. The depth ratio between two images is recovered, but absolute scale is unknown. Just as Metric3D or Depth Anything v2 targeted metric depth, generalizing metric scale remains a problem for foundation 3D. Camera-independent metric is not easy even at foundation scale. The physical constraint of determining absolute scale without GPS or IMU exists regardless of data scale.

**Is this flow SLAM's future, or a separate branch?** Like 3DGS in Ch.15, foundation 3D is in the middle of being absorbed by the SLAM community. With MASt3R-SLAM and VGGT-SLAM arriving in succession in 2024–2025, the path of absorption has taken shape. But the point of real-time large-scale sequence operation, and the junction with the semantic branch noted in §16.6 (Hydra → ConceptGraphs → Clio), remains unclear. The form in which geometric foundation and semantic foundation meet in one system is a core axis of Ch.19's open problems.

---

The three chapters of Part 5 arrive at the same conclusion. Whether NeRF or foundation model, changing representation changes the insides of reconstruction and localization. The system-level structure of SLAM (keyframe, loop closure, map management) survives on top of the new representation. Ch.17 returns to a lineage that ran in parallel through this entire period: LiDAR-based SLAM. Different sensors, different culture, largely the same questions. Where Ch.14–16 asked what kind of representation holds a camera-based map, Ch.17 asks how the same engineering problems were approached when the input was a rotating laser rather than a lens.

---

# Ch.17 — The LiDAR Parallel Universe: From LOAM to FAST-LIO

The lineage that runs from Ch.1 photogrammetry through Ch.16 Foundation 3D rests on one shared premise. The sensor is a camera. MonoSLAM, PTAM, ORB-SLAM, DSO, DUSt3R — these names all sit inside the tradition of reading the world through pixels. During the same period, inside the same robotics community, an entirely different lineage was growing. The LiDAR lineage built its own grammar on the bones of ICP, unrelated to the camera camp's keypoints, photometric consistency, or feature descriptors. The two lineages did not cite each other's papers, and their benchmarks and conferences were separate.

LOAM, the chapter's central system, sits on three pre-existing bones. Point-to-plane matching from Besl & McKay's 1992 ICP. The network-of-poses framing from Lu & Milios's 1997 work on globally consistent scan alignment. The real-time outdoor pressure carried over from the 2007 DARPA Urban Challenge. Ji Zhang's 2014 contribution was to split high-frequency odometry from low-frequency mapping on a spinning Velodyne, and that split became the default grammar of every LiDAR system that followed.

When Ji Zhang presented LOAM at RSS 2014, the Visual SLAM community paid little attention. That year the visual side was busy with ElasticFusion and LSD-SLAM. The LiDAR side was no different. LOAM shared no code with camera-based methods, and the research communities did not overlap. The two lineages ran for ten years under the shared name of robotics while barely looking at each other. LOAM stood on the old skeleton of [ICP (Besl·McKay, 1992)](https://graphics.stanford.edu/courses/cs164-09-spring/Handouts/paper_icp.pdf), and the factor graph of Graph SLAM had been standard on the visual side for a long while before it crossed over to LiDAR. The parallel universe matured without exchange.

---

## 17.1 LOAM: edges and planes, and the capture of KITTI

In 2014, Google's Waymo predecessor program was already driving on roads, and the aftershocks of the DARPA Urban Challenge had not yet faded. A Velodyne HDL-64E cost $75,000 per unit. Only groups at the scale of CMU, MIT, and Stanford could afford to take LiDAR as a research object. CMU Robotics Institute's Autonomous Mobile Robot Lab — Professor Sanjiv Singh's lab — was one of them.

Attempts to build maps with LiDAR existed before LOAM. [Lu & Milios 1997. "Globally Consistent Range Scan Alignment for Environment Mapping" (Autonomous Robots)](https://doi.org/10.1023/A:1008854305733) placed 2D range scans as nodes, tied them together with relative scan-to-scan constraints as edges, and jointly optimized the full trajectory; this "network of poses" idea would later be traced back as the origin point of pose-graph SLAM (see Ch.6). For the matching step itself, alongside Besl·McKay's ICP, [Biber·Straßer 2003. "The Normal Distributions Transform" (IROS)](https://doi.org/10.1109/IROS.2003.1249285) had proposed NDT — distribution-based matching that aligns to per-cell Gaussian distributions — and Magnusson's later 3D extension lived alongside ICP as an alternative. All of this was 2D, or offline 3D. LOAM's share was real-time 3D.

Ji Zhang, under Singh's supervision, released [Zhang & Singh 2014. "LOAM: Lidar Odometry and Mapping in Real-time" (RSS)](https://www.roboticsproceedings.org/rss10/p07.pdf). He classified LiDAR points into two kinds of features. An **edge point** is a point with high smoothness $c$ (high curvature); a **planar point** is one with low $c$ (low curvature). Rather than registering the whole point set like ICP, LOAM matches only these two feature sets. Edge points are constrained point-to-line against edge lines in the neighboring scan, and planar points are constrained point-to-plane against local planes. Computational cost drops. Real-time feasibility opens up.

The algorithm is split into two stages. Lidar Odometry estimates the 6-DoF transform between scans at 10 Hz. Lidar Mapping, at a lower frequency (1 Hz), registers against the full map to correct the error. Separating high-frequency odometry from low-frequency mapping suppresses drift while keeping real-time performance. This two-tier structure becomes the default grammar of LiDAR SLAM that follows.

On the KITTI benchmark, LOAM took first place shortly after release and held it for years. To be precise, until visual-LiDAR fusion methods appeared. On sequence 00, Zhang reported a relative translation error of 0.78%. Compared with the contemporary best visual odometry at around 1%, the structural advantage of LiDAR is clear.

> 🔗 **Borrowed.** LOAM's feature-based point registration starts from Besl·McKay's (1992) ICP. The difference is that it selectively matches only edge and planar features rather than all points. Selective reuse of classical registration bought both speed and precision.

---

## 17.2 LeGO-LOAM: cut the ground first

LOAM's problem was that it did not treat the ground plane explicitly. In outdoor self-driving environments, a significant share of the point cloud is road surface. Lumping it in with edge/planar features produces matching noise.

At Stevens Institute of Technology's Robust Field Autonomy Lab, Tixiao Shan and his advisor Brendan Englot separated ground segmentation as the first stage in [Shan & Englot 2018. LeGO-LOAM](https://doi.org/10.1109/IROS.2018.8594299). The point cloud is projected onto a range image, the ground points are separated first, and the non-ground points are then re-clustered. Ground is used for roll and pitch estimation, clusters for yaw and translation. Two-stage optimization.

The result was computation savings compared with LOAM. Where the original LOAM struggled to run in real time on a Velodyne VLP-16, LeGO-LOAM runs on the same sensor even on embedded platforms (NVIDIA Jetson). Lightweighting has its price. In sparse-point environments or environments with irregular ground structure — sections the laser occludes, rough off-road terrain, building interiors — segmentation fails and odometry wavers.

But LeGO-LOAM's real contribution was less the lightweighting itself than the design principle "preprocess sensor input into structured modules, then run odometry." FAST-LIO and LIO-SAM pick up this principle later.

Around the same time as LeGO-LOAM, at the University of Bonn, Jens Behley and Cyrill Stachniss brought **surfels** (surface elements), rather than edge/plane features, to outdoor LiDAR. Their **SuMa**, in [Behley & Stachniss 2018. "Efficient Surfel-Based SLAM using 3D Laser Range Data in Urban Environments" (RSS)](http://www.roboticsproceedings.org/rss14/p16.pdf), summarized each point's neighborhood as a disk-shaped surfel and performed scan-to-model registration; the follow-up [Chen et al. 2019. "SuMa++" (IROS)](https://doi.org/10.1109/IROS40897.2019.8967704) combined semantic segmentation to filter moving objects at the surfel level. This is the moment the surfel representation that Kintinuous and ElasticFusion had used in the Kinect indoor RGB-D lineage (Ch.9) crossed over to outdoor Velodyne. Three branches — feature selection (LOAM), segmentation-first (LeGO-LOAM), surfel accumulation (SuMa) — were competing simultaneously around 2018.

---

## 17.3 FAST-LIO — tightly coupled LiDAR-IMU

LiDAR scan frequency sits at 10–20 Hz. Fast motion in between scans produces motion distortion in the point cloud. The sensor position at the end of a scan differs from the position at the start, and this is the main reason the LOAM family wavers on high-speed platforms.

The IMU runs at 100–400 Hz. That is enough to fill the gaps in LiDAR. But performance depends on how LiDAR and IMU are combined. Loosely coupled: estimate each independently and then fuse. Tightly coupled: handle both inside a single state estimator. The latter is theoretically superior but harder to implement.

At Hong Kong University (HKU)'s MaRS Lab, Wei Xu and his advisor Fu Zhang presented [**FAST-LIO**](https://arxiv.org/abs/2010.08196) in RA-L 2021. The paper came out of a drone-control lab. There was a concrete field motivation: LiDAR odometry had to hold up on UAVs with heavy rotor vibration and fast maneuvering. The tool they chose was the **iterated Extended Kalman Filter (iEKF)**. The iEKF re-linearizes at the current estimate iteratively during the measurement update. It is consistently better than a basic EKF — which linearizes only once — under fast nonlinear motion.

The following year they published **FAST-LIO2** ([Xu et al. 2022](https://doi.org/10.1109/TRO.2022.3141876)) in TRO, adding the ikd-Tree. Conventional kd-Trees carry heavy reconstruction cost every time a point is added. The ikd-Tree is an incremental variant that performs only partial reconstruction. Real-time nearest-neighbor search stays feasible even with millions of map points. Experiments showed consistent performance on UAVs, handheld rigs, and autonomous cars. Drift stayed low even in drone environments.

The next move in the FAST-LIO lineage was to remove motion distortion entirely. [He et al. 2023. "Point-LIO: Robust High-Bandwidth Light Detection and Ranging Inertial Odometry" (Advanced Intelligent Systems)](https://doi.org/10.1002/aisy.202200459), also from the MaRS Lab, updates the state every time a LiDAR point arrives, instead of gathering points into a scan and updating all at once. Point-by-point observation update. Rather than correcting intra-scan distortion via constant velocity or IMU interpolation, each point is fused at its own timestamp, erasing the window in which distortion could occur. Drift reductions compared with FAST-LIO2 were reported on high-agility platforms.

> 🔗 **Borrowed.** FAST-LIO's tightly coupled IMU integration transplants to LiDAR a mathematical framework first sorted out in the Visual-Inertial SLAM camp. IMU preintegration theory was completed in [Forster et al. 2016. "On-Manifold Preintegration" (TRO)](https://doi.org/10.1109/TRO.2016.2597321), and FAST-LIO reimplements that spirit in iEKF form.

---

## 17.4 LIO-SAM: the factor graph crosses over to LiDAR

At the same time, on the Visual SLAM side the factor graph was already standard. [GTSAM (Dellaert·Kaess, 2012)](https://gtsam.org/) had settled in as the backend of Visual-Inertial systems. Yet the LiDAR side was still in EKF variants or scan-matching. The main benefit of graph optimization — correcting the full trajectory after a loop closure — LiDAR systems were not using properly.

Tixiao Shan, after LeGO-LOAM, released [Shan et al. 2020. LIO-SAM](https://doi.org/10.1109/IROS45743.2020.9341176), which explicitly adopted GTSAM's factor graph as the backend of a LiDAR-IMU system. IMU preintegration factors, LiDAR odometry factors, GPS factors, and loop closure factors are integrated into a single graph. Each keyframe becomes a node, each sensor constraint an edge. Marginalization keeps the graph size under control.

> 🔗 **Borrowed.** LIO-SAM imports directly into a LiDAR system the GTSAM factor graph backend that Dellaert (from 2006 onward) had standardized on the Visual SLAM side.

LIO-SAM is stronger than FAST-LIO2 in drift-accumulation scenarios. It has loop closure. On the other hand, computation cost is higher, and without GPS or additional sensor input the factor graph's advantage shrinks. The two systems have different design goals. FAST-LIO2 targets top speed and precision in a real-time single-sensor configuration; LIO-SAM targets consistency in multi-sensor long-term mapping.

There was also a paradoxical return. For close to ten years after LOAM, LiDAR odometry had run toward feature selection, surfels, and neural descriptors — getting steadily more complex — yet in 2023 Bonn's [Vizzo et al. 2023. "KISS-ICP: In Defense of Point-to-Point ICP" (RA-L)](https://doi.org/10.1109/LRA.2023.3236571) released the opposite direction. With no feature extraction and no learned descriptors, an adaptive threshold and a single point-to-point ICP with almost no tuning showed competitive odometry on KITTI. Keep It Small and Simple, as the name says. The authors' claim was close to historical revisionism: "LOAM did not arise because ICP was lacking; the engineering was." Behind the fact that a return to classical registration became possible after ten years lies practical progress at the level of GPUs and kd-Tree implementations.

---

## 17.5 Sensor price drop and diffusion: 2007–2024

In the history of LiDAR SLAM, sensor price is as important as any technical paper.

At the 2007 DARPA Urban Challenge, the Velodyne HDL-64E that top teams carried was $75,000 per unit. Unless you were an autonomous-driving research team or a defense project, the hardware was out of reach. In 2012 the HDL-32E was still around $30,000. By 2014, when LOAM came out, the VLP-16 had dropped to $7,999 — still a significant chunk of a research budget.

Over the following decade, the reversal happened. The Chinese startup Livox (part of DJI) released the Livox Mid-40 at $599 in 2019. Ouster brought 128-channel sensors into the low-thousands range. By 2023–2024, solid-state LiDARs from RoboSense, Innovusion, and Livox came in under $500. The price fell by more than 100x, and it took ten years.

Diffusion did not outrun algorithmic progress. Solid-state LiDARs, unlike spinning types, have a limited field of view (FoV). 70°×70° or narrower. It is not the 360° all-around scan that LOAM and FAST-LIO assumed. Existing algorithms do not just work out of the box. The spread of cheap sensors created new algorithmic research questions at the same time.

---

## 17.6 Why the Visual and LiDAR lineages split

Visual SLAM and LiDAR SLAM developed in the same period, yet the two communities did not exchange for a long time. The reason was not a single layer.

The first is the sensor itself. Cameras see texture and color; LiDAR sees range and geometry. While camera-based methods developed around keypoints, descriptors, and photometric consistency, LiDAR differentiated around edges, planes, and range images. The problem formulations themselves were different.

The conferences were also different. CVPR and ICCV were the main stages for camera-based methods; ICRA, IROS, and RSS were where LiDAR SLAM mostly appeared. The researcher populations did not overlap. During the early-to-mid 2010s, when Velodyne was shipping into Google and the autonomous-driving industry, the LiDAR SLAM researcher population was dense on the self-driving robotics side.

Place recognition methods diverged too. Cameras use visual appearance, as in DBoW2 and NetVLAD. LiDAR uses the structural features of a 3D point cloud, as in [Scan Context (Kim·Kim, 2018)](https://gisbi-kim.github.io/publications/gkim-2018-iros.pdf) or [PointNetVLAD](https://arxiv.org/abs/1804.03492). Even for the same location, the signal being recognized is different.

The first signs of convergence showed up in the early 2020s. Papers on LiDAR-Camera fusion started appearing at CVPR, and Tixiao Shan's [LVI-SAM (2021)](https://arxiv.org/abs/2104.10831) was an attempt to bolt a visual-inertial subsystem onto LIO-SAM. The authors presented it as a tightly coupled factor graph, but the two subsystems (LIS and VIS) operate largely independently and help each other on failure, which means a fully unified single state estimate is still open.

---

## 17.7 Visual-LiDAR convergence attempts: 2024–2025

Starting in 2024, the mood shifted. As foundation models developed toward extracting features sensor-agnostically, attempts to handle camera and LiDAR in one frame increased. There are two branches.

One is multi-modal pretrained features. Align LiDAR and camera into the same embedding space. As [CLIP (Radford et al., 2021)](https://arxiv.org/abs/2103.00020) did for image-text alignment, approaches that use LiDAR-image contrastive learning. In 2023–2024 several groups are at the experimental stage.

The other is unified sensor abstraction. Integrate sensor outputs into geometric primitives or neural fields and then process them in a single backend. This side is still at the research-paper stage, and systems that have shown real-time operation are rare.

Neither direction has yet produced a single lineage that actually merges LiDAR SLAM and Visual SLAM. FAST-LIO2 and ORB-SLAM3 are still used independently.

---

## 17.8 Radar is outside this book's scope

Right next to the LiDAR parallel universe sits another parallel universe. Radar SLAM has matured as an independent subfield on two hardware branches — spinning radar (the Navtech CIR family) and SoC-based 4D mmWave radar — and on the fact that Doppler radial velocity can be measured directly, which enables correspondence-free odometry, and on radio-specific noise models for speckle, multipath, and receiver saturation. The lineage runs from Oxford's radar localisation in [Cen & Newman 2018](https://doi.org/10.1109/ICRA.2018.8460687) through Adolfsson·Magnusson's **CFEAR** and its successor **TBV-SLAM** to Burnett·Barfoot's continuous-time ICP, and dedicated datasets such as Oxford Radar RobotCar, Boreas, and MulRan form the benchmark base for this area. The practical motivations — penetration through bad weather and smoke — are clear, but the contact surface with the lineage this book has been tracking (photogrammetry → SfM → Visual SLAM → learning → 3D foundation) is thin. Radar is left as "a neighbor to merge in later," and this book does not write its separate history — see Handbook of SLAM (2026) Ch.9 for details.

---

## 📜 Prediction vs. outcome

> Zhang·Singh named two items as explicit future work in the Conclusion of the 2014 LOAM paper. First, introducing loop closure to correct drift. Second, combining the method with IMU output through a Kalman filter. Both directions were realized within the next ten years. IMU integration was settled by FAST-LIO (2021) and FAST-LIO2 (2022) with a tightly coupled iEKF, and loop closure was folded in by LIO-SAM (2020) with a factor graph backend. The path the authors sketched was implemented fairly accurately. Beyond those two axes, however, there was a problem that never appeared in the paper's Conclusion but kept surfacing in the field. Dynamic object handling. Real-time separation of moving pedestrians and vehicles from LiDAR points still leans mostly on deep-learning segmentation as of 2026, and a solution built into the SLAM algorithm itself is still absent. `[hit + in progress]`

---

## 🧭 Still open

**Full Visual+LiDAR fusion.** Even after LVI-SAM, no system has reached practical status that handles the two sensors tightly coupled inside a single state estimator. The scenario where the camera fails in fog or rain and LiDAR has to fill the gap is a clear need in autonomous driving. Algorithmic and sensor-calibration difficulty are still the barriers. Several groups in 2024–2025 are experimenting with transformer-based fusion, but consistent results have not arrived.

**Algorithms optimized for solid-state LiDAR.** LOAM and FAST-LIO all assume 360° spinning LiDAR. The solid-state products from Livox and RoboSense use non-repetitive scan patterns. They accumulate by hitting the same point many times. Feature extraction and motion distortion correction that fit these characteristics need their own research. Livox LOAM exists, but the level of generalization is weak.

**Dynamic object handling.** This problem sits at the same place in Zhang's 2014 prediction and in 2026. The static-world assumption is an old premise of SLAM, and LiDAR is no exception. Real-time separation of moving objects from the point cloud is currently handed off to segmentation networks as a workaround. Methods that handle it inside SLAM on a geometric basis are expensive and unstable in accuracy. Companies like Waymo and Argo AI run their own solutions, but no general public algorithm exists.

---

The LiDAR lineage matured without crossing the visual main track. The two lineages acquired their own languages, and translation between those languages is still in progress. Ch.18 steps back from both tracks to look at what was tried alongside them — lineages that neither camera nor LiDAR absorbed, and what their failure meant.

---

# Ch.18 — Failed Cases and Lost Lineages

Across the same decades the camera and LiDAR lineages were building their grammars, other people inside the robotics community were walking in different directions. Lineages that were neither the camera lineage nor the LiDAR one — some began as early as RatSLAM in 2004, well before LOAM, others ran in parallel through the 2010s. That they were not chosen does not mean they were absent from the history.

Each dead lineage traced here has a live ancestor. Milford & Wyeth's RatSLAM (2004) pulls from O'Keefe & Dostrovsky's 1971 place-cell work by way of cognitive-map theory. Event SLAM inherits the silicon-retina line through Lichtsteiner, Posch, & Delbruck's 2008 DVS at ETH Zürich INI. Salas-Moreno et al.'s SLAM++ (2013) extends 1990s object-level scene understanding into the SLAM state. The chapter traces where those inheritances hit walls.

The history of SLAM is not made up only of successful lineages. Every decade there were approaches that accumulated enough papers and early results yet failed to enter the mainstream — engineering scaling hit a wall, or a more practical alternative took the seat first. That was a different matter from technical failure.

---

## 18.1 RatSLAM — place cell-based topological map

RatSLAM, presented by [Milford et al. 2004](https://doi.org/10.1109/ROBOT.2004.1302555) at ICRA 2004, approached the place recognition problem in an entirely different way. It imitated the firing patterns of **place cells** and **head direction cells** in the rat hippocampus so that the robot would form a place representation naturally while exploring. The computational model was called a **Continuous Attractor Network (CAN)**. The neurons' activation state forms a continuous activation 'bump' on a 2D grid, and the bump moves along the grid according to the robot's velocity and rotation input (path integration). When visual input arrives, it is compared against stored place representations and the bump position is corrected. This loop — bump propagation from motion, correction from visual matching — is the core operating principle of RatSLAM.

> 🔗 **Borrowed.** The place cell discovery by [O'Keefe and Dostrovsky (1971)](https://pubmed.ncbi.nlm.nih.gov/5124915/) began in neuroscience and led to the theory of cognitive maps. RatSLAM was the first completed attempt to carry that biological mechanism into an engineering system. The door opened, but not many engineers walked through it.

Milford and Gordon Wyeth, based at the Queensland University of Technology (QUT) robotics lab, ran outdoor driving experiments repeatedly on suburban roads in Brisbane between 2004 and 2008. The test vehicle mounted a camera on the roof and drove through suburban residential areas. RatSLAM received that image stream, recognized paths it had already traveled, and closed loops. The [Milford & Wyeth 2008](https://doi.org/10.1109/TRO.2008.2004520) IEEE T-RO paper reported results from processing tens of thousands of images over a 66 km route. At the same time, geometric SLAM systems were struggling at the scale of a few hundred meters, so by the numbers alone RatSLAM was ahead.

The engineering scaling stopped there, however. CAN grew in computational complexity as the number of places increased. The deeper problem was precision. The topological map RatSLAM built could judge "I have been here before," but it could not reliably produce metric position estimates at the meter level. Autonomous driving and manipulation demanded exact coordinates. Cognitive maps did not fit that demand.

> 📜 **Prediction vs. outcome.** In the Conclusion of the 2008 T-RO paper Milford and Wyeth argued that RatSLAM was "an alternative approach to vision-only SLAM" and that it performed repeatable, reliable loop closure in environments — long routes, large accumulated error, visual ambiguity — that would challenge the contemporary state-of-the-art SLAM. The claim was a position as an alternative, not a replacement. The actual development was partially right on the claim itself. RatSLAM showed competitiveness on specific benchmarks. But in the overall flow of the field after 2012, graph-based SLAM and visual odometry pulled ahead on both accuracy and speed, and while topological maps still appear in some place recognition research, the original RatSLAM ambition of metric-topological integration did not carry forward in another form. `[partial hit + abandoned]`

What RatSLAM left behind was not the algorithm itself. The idea that "place representation is possible without geometry" seeped into the place recognition literature. In 2012 [SeqSLAM](https://doi.org/10.1109/ICRA.2012.6224623) came out of the same Milford group, and image-sequence-based place recognition became one axis of visual place recognition benchmarks. The lineage itself survived, only its form had changed.

---

## 18.2 The engineering limits of biologically-inspired SLAM

RatSLAM was the most complete case of biologically-inspired SLAM, but it was not alone. From the mid-2000s to the early 2010s, SLAM variants imitating cognitive maps, entorhinal grid cells, and hippocampal replay came out steadily. They all carried similar problems.

Biological models describe *how* the brain represents space, but *why* it does so in that way, and whether that way also fits engineering purposes, is a different question. The rat hippocampus is a structure that hundreds of millions of years of evolution shaped to match specific environments and behavioral patterns. The conditions a robot operates in are not the same.

Engineering SLAM requires sub-meter position estimation accuracy, real-time processing, fast adaptation to new environments, and verifiable error bounds. Cognitive models had trouble guaranteeing these conditions. Neuroscience and robotics can draw inspiration from each other, but the gap was not short.

Entering the 2020s, there was room for this discussion to reopen. Observations appeared that the way foundation models form large-scale representation on their own structurally resembles the emergent properties of place cells. Whether this is rediscovery or convergence along a different path is not yet known.

---

## 18.3 Event SLAM — the gap between hardware and algorithm maturity

The [Dynamic Vision Sensor (DVS)](https://doi.org/10.1109/JSSC.2007.914337), developed by Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck at the ETH Zürich Institute of Neuroinformatics (INI), was first disclosed at ISSCC 2008. Each pixel independently compares logarithmic light intensity change against a threshold and outputs an event of positive (ON) or negative (OFF) polarity asynchronously. Without a global shutter, the firing moment is recorded per pixel at microsecond resolution. It was a camera without frames.

> 🔗 **Borrowed.** The DVS event sensor (Lichtsteiner et al. 2008) was hardware inspired by the change-detection mechanism of the biological retina. Event SLAM started with this sensor in hand. Hardware ran ahead of algorithms, and closing that gap took ten years.

The advantages of the event camera were a list that looked good: μs-level temporal resolution, no blur in high-speed motion, high dynamic range (HDR) that handled both tunnels and direct sunlight, and power consumption at a fraction of conventional cameras. Numbers that read well in a paper.

At ICRA 2014 [Weikersdorfer et al. 2014](https://doi.org/10.1109/ICRA.2014.6906882) presented event-based 3D SLAM. The same year, other groups released event-based optical flow and depth estimation. Between 2016 and 2018, Henri Rebecq (Davide Scaramuzza's group, the RPG lab at the University of Zurich) released [EVO](https://doi.org/10.1109/LRA.2016.2645143) (RA-L 2017) and [ESIM](https://proceedings.mlr.press/v87/rebecq18a.html) (CoRL 2018), and the event SLAM pipeline took concrete shape.

In real environments the results fell short of expectations. The first problem was resolution. Early DVS sensors were 128×128 pixels, not comparable to existing VGA cameras, and for SLAM where feature matching and map building depend directly on resolution, this constraint was heavy. The second was the algorithm paradigm itself. Existing frame-based algorithms could not be applied to event streams as-is, and building a new approach took time.

From 2014 to 2018, event SLAM produced good results in controlled environments and low-texture conditions, but did not outperform existing visual-inertial odometry in general environments.

During that same period, the application scope of the event approach quietly spread outside odometry. [EventVLAD](https://ieeexplore.ieee.org/document/9635907/) (Lee & Kim, IROS 2021) bundled edge images reconstructed from event streams with NetVLAD descriptors, showing that place recognition was possible even under sudden illumination changes and motion blur. It was an attempt where event touched environments that frame-based VPR struggled with.

---

## 18.4 Semantic SLAM — the shrinking of the object-as-landmark path

From 2017 to 2019, "semantic" did not leave the session titles at CVPR, ECCV, and IROS. This was the period when deep learning was advancing quickly in instance segmentation and object detection, and the question of what happens when that semantic understanding is integrated into SLAM circulated widely. The question itself was not wrong, but the execution did not keep up with the discourse.

[Salas-Moreno et al. 2013](https://doi.org/10.1109/CVPR.2013.178)'s **SLAM++** was the first large-scale declaration of that lineage. The group of Salas-Moreno and his advisor Andrew Davison at Imperial College took *objects* as the basic unit of the map instead of points or patches. They stored pre-defined 3D object models — chair, desk, monitor — in a database, and during SLAM execution they recognized those objects in RGB-D input via ICP (Iterative Closest Point) alignment and placed them on the map. Representing the map with tens of objects instead of thousands of points could shrink map size and make place recognition and loop closure more semantically grounded.

> 🔗 **Borrowed.** SLAM++'s object-level representation combined the scene graph representation from graphics with model-based recognition from computer vision. That idea carried forward into LERF (Language Embedded Radiance Field) and LangSplat in the 2020s. Only the representation unit changed from object to language feature; the intuition that "the map should be semantic" survived.

After SLAM++, between 2017 and 2019, [SemanticFusion](https://arxiv.org/abs/1609.05130) (McCormac et al., 2017, ICRA), [MaskFusion](https://arxiv.org/abs/1804.09194) (Rünz et al., 2018, ISMAR), and [SuperPoint](https://arxiv.org/abs/1712.07629) (DeTone et al., 2018)-based feature lines were released in succession. The shared claim across that period ran: deep semantic features are more robust to environmental change than geometric features, and SLAM with semantic understanding integrated is the next stage.

The actual development was different. Through 2019 what pushed performance up on autonomous driving benchmarks was traditional geometric pipelines like ORB-SLAM2, VINS-Mono, and LIO-SAM. Systems integrating deep semantic features had competitiveness only in specific indoor environments and with fixed object classes. Cases appeared where semantic priors in new object categories or unseen environments actually increased drift.

> 📜 **Prediction vs. outcome.** In the Conclusion of the SLAM++ paper, Salas-Moreno described his method as "a first step toward a more generic SLAM method," hoping it would extend to objects with low-dimensional shape variation, and ultimately to systems that segment and define object classes on their own. The paper's introduction added that an object-unit representation would bring "large map compression" and "gains in efficiency and robustness." The actual development partially hit the mark. Object-level maps found a place in AR and certain manipulation applications, and the compression and efficiency advantages were confirmed again in indoor environments with repeated objects. But mainstream geometric SLAM still retains sparse points and keyframe-based graphs as of 2026, and the stage where objects are segmented and defined autonomously has not been reached. Semantic ended up settling not inside SLAM but in downstream tasks — semantic mapping, task planning. `[partial hit + diverted]`

Why did semantic-first SLAM fail to become mainstream. One cause was dependency. Semantic SLAM needed segmentation to be accurate, and when segmentation was wrong the whole map was polluted. A geometric pipeline survived partial failure of feature matching through robust estimation. The other cause was generalization. Semantic priors trained on specific object classes were useless outside those classes, and the environments SLAM had to enter were far wider than the world those priors assumed.

What shrank was the object-as-landmark path, not semantic itself. In the same period another path survived. [SuMa++](https://doi.org/10.1109/IROS40897.2019.8967704) (Chen et al., IROS 2019) overlaid semantic classes onto LiDAR point clouds to filter out dynamic objects, and [Kimera](https://doi.org/10.1109/ICRA40945.2020.9196885) (Rosinol et al., ICRA 2020) bundled metric-semantic mesh with a 3D scene graph. [Hydra](https://doi.org/10.15607/RSS.2022.XVIII.050) (Hughes et al., RSS 2022) extended that scene graph in real time and hierarchically, and by [ConceptGraphs](https://doi.org/10.1109/ICRA57147.2024.10610243) (Gu et al., ICRA 2024) and [Clio](https://doi.org/10.1109/LRA.2024.3451395) (Maggio et al., RA-L 2024), open-vocabulary foundation features were layered on top. Semantic survived by rising not to the landmark seat but to a higher layer of the map. This lineage is in progress as of 2026, and is picked up again in [Ch.15b](chapter_15b_dynamic.md) (the semantic return of dynamic-static separation), [Ch.16](chapter_16_foundation_3d.md) (foundation 3D and metric-semantic core), and [Ch.19 §19.7](chapter_19_open_problems.md#197-the-return-of-semantic-representation-and-open-world) (The return of semantic).

---

## 18.5 The Manhattan-world assumption — scope of application and extinction

Around the same time, another lineage was quietly tried and quietly disappeared. It was SLAM using the Manhattan-world assumption.

The assumption itself was simple. Inheriting the Manhattan world concept from [Coughlan & Yuille 1999](https://doi.org/10.1109/ICCV.1999.790349), indoor environments were viewed as mostly aligned to three orthogonal axes (x, y, z) of the world coordinate frame. Walls, floors, and ceilings make those directions. Bundles of parallel lines in the image converge to vanishing points, and each vanishing point is described by the relation `v = K R d` between the camera's rotation matrix R and direction vector d (K: camera intrinsic matrix). Finding three orthogonal vanishing points allows direct recovery of the three columns of R. It meant drift could be suppressed with geometric constraints alone, without an IMU or feature matching. Attempts to combine this idea with visual odometry appeared in this period.

In long corridors and rectangular rooms drift did decrease. The problem was outside. Once outdoors, in environments with round structures, or in irregular industrial environments, the Manhattan-world assumption itself did not hold. A prior tightly fit to an environment became a liability outside that environment. As general-purpose visual-inertial odometry matured after 2015, this lineage lost attention. It remains as an auxiliary constraint in some indoor mapping tools, but as an independent research lineage it is gone.

---

## 18.6 Rediscovery patterns of extinct lineages

What it means for a lineage to die differs by case. RatSLAM's topological map idea carried into SeqSLAM, and its descendants are alive in the visual place recognition field. The object-level map intuition of SLAM++ returned in a different form after 2022, as NeRF and Gaussian splatting combined with language. [LERF](https://arxiv.org/abs/2303.09553) (Kerr et al., 2023) and [LangSplat](https://arxiv.org/abs/2312.16084) (Qin et al., 2023) are such cases.

Event camera SLAM followed a different path. The algorithms were not stuck; the hardware had not come that far yet. After 2022, event cameras above 640×480 came to market, and the need in high-speed drones and HDR environments became clear. The event vision community centered around [Guillermo Gallego](https://arxiv.org/abs/1904.08405) (TU Berlin) produced competitive results in event-based depth estimation and ego-motion estimation between 2020 and 2024.

Inspiration can be good yet engineering takes time to catch up, and even when the sensor is new the algorithm has to be built separately. How long it takes to close that gap depends on algorithm maturity, on how fast the hardware reaches practical use, and on whether a better alternative takes the seat first in between.

---

## 🧭 Still open

**Biologically-inspired SLAM.** The way foundation models form spatial representations through large-scale unsupervised learning has properties structurally similar to cognitive maps. Reports that place-cell-like units have been observed inside transformers appeared in 2023–2024. Whether this is convergence or coincidence is not known. There is a possibility that the RatSLAM-class lineage returns under a different name inside the foundation model paradigm.

**Event camera SLAM going mainstream.** After 2022, as commercial high-resolution event cameras spread, the research base widened. But the algorithm paradigm for processing event data effectively has not yet settled on a stable common framework. Integration with frame-based pipelines, new event representations, and the diversification of real-world benchmarks together with the establishment of evaluation standards are all in progress simultaneously. Whether it becomes mainstream is still too early to judge as of 2026.

**The direction of the "semantic map" concept.** After the 2017 overheating of semantic SLAM cooled, semantic representation was pushed outside SLAM — to downstream tasks. In 2024–2025, LERF and Gaussian splatting combined language features with dense scene representation, and another form appeared. Whether it will lead to internalization or remain downstream again is not known. Whether the pattern that geometry has to be right first for semantic to be useful will repeat this time, or whether a change in representation itself will overturn that order, is the question. Among the things treated as "solved" as of 2026, the name someone in the future will add to this chapter is still among those not yet out.

The open threads from Ch.17 and Ch.18 — full visual-LiDAR fusion, algorithms for solid-state sensors, dynamic object handling, event camera maturity, the second life of semantic maps — do not stay in their own chapters. Ch.19 collects them alongside the unresolved items from every earlier chapter and puts them in the same room.

---

# Ch.19 — Today's Map and Tomorrow's Blanks

Ch.0 described the 2026 landscape this way. AR layers stick to the wall, indoor delivery robots tell kitchens from conference rooms without being handed a map, and a DUSt3R-family model returns 3D structure from a few photos in seconds. The description is accurate, and it both props up and undermines this book's premise.

What got solved is the 2003 problem. Static scene, stable lighting, bounded space, the geometry of a single camera — on top of those assumptions the EKF ran, graph SLAM closed loops, and ORB-SLAM managed keyframes. Each answer is a real answer, and each assumption was a simplification chosen in earnest.

The final section of each of the 18 chapters carries the same mark: the things still open. This is a harvest, not an invention — the flags planted right next to the spots where each chapter declared something solved, laid out in one place.

---

## 19.1 Lighting and environmental change: reality the camera cannot handle

A problem has trailed Visual SLAM since the moment it stepped outdoors. Conditions the camera's photometric model cannot handle show up first in the field, every time.

Learned descriptors beat ORB inside the training domain but lose consistency underwater, thermal, and low-light; as of 2026 there is still no consensus on which is more robust (see Ch.2 §2.7). The low-light and dynamic tracking failures Ch.5 recorded — the very reason 2007 PTAM bounded itself as "Small AR Workspaces" — remain the implicit assumption of most feature-based SLAM today (see Ch.5 §🧭).

In the direct-method lineage the problem is more structural. The foundational premise of brightness preservation collapses immediately under auto-exposure, strong backlight, and tunnel-to-outdoor transitions, and no full solution that dynamically estimates the lighting model has arrived (see Ch.8 §🧭). In place recognition the same barrier has sat in the same spot for over ten years. Even with [DINOv2](https://arxiv.org/abs/2304.07193)-based methods narrowing the gap, no single model links the snow-covered winter and leafy summer of [Nordland](https://nikosuenderhauf.github.io/projects/placerecognition/) and [Oxford RobotCar](https://robotcar-dataset.robots.ox.ac.uk/) at 99% accuracy (see Ch.10 §10.7).

ORB-SLAM's long-term map reuse runs into the same wall. Atlas made multi-map maintenance possible, but recognizing a morning map in the evening fails under lighting change (see Ch.7 §🧭). That Ch.2, 5, 7, 8, and 10 report the same barrier in their own languages is not accident.

---

## 19.2 Dynamic-world assumption: the oldest simplification hits its limits

The static-world assumption is SLAM's oldest simplification, and it is the assumption the most chapters planted their own flag next to.

In the SfM lineage dynamic objects are a shared weak point of every current system, COLMAP included, and as of 2026 no Dynamic SfM implementation has COLMAP-level generality (see Ch.3 §3.7). Everything in Ch.9 from KinectFusion through BundleFusion assumed a static scene, and while DynaSLAM, MaskFusion, and others coupled real-time segmentation into dense SLAM, neither cost nor robustness reached practical deployment (see Ch.9 §🧭).

In monocular depth, self-supervised methods mask out moving objects, which avoids the problem rather than solving it (see Ch.11 §🧭). 3DGS SLAM still holds the static-world assumption in 2025; [4DGS](https://arxiv.org/abs/2310.08528) and [Deformable 3DGS](https://arxiv.org/abs/2309.13101) explore a time dimension but an integrated way to represent and track dynamic objects in a SLAM setting does not exist (see Ch.15 §🧭). LiDAR SLAM is not exempt either: the dynamic-object problem Zhang anticipated in 2014 sits in the same spot, and the in-house solutions of Waymo and Argo AI are not publicly available general algorithms (see Ch.17 §🧭). That the same question comes back across five chapters probably means the right approach itself has not appeared yet.

The long-term dynamic and deformable items harvested in [Ch.15b](chapter_15b_dynamic.md) sit on the same layer. **Absence vs evidence of absence** (did an object vanish or was it occluded) got a partial answer in [Schmid's Panoptic Multi-TSDF](https://doi.org/10.1109/LRA.2022.3148854) (2022) but misjudgment is large in large-scale outdoor and occlusion above 60%. **Floating Map Ambiguity** (separating camera rigid motion from object rigid motion) is only skirted with isometric and visco-elastic priors; identification conditions without a prior are unresolved. No system runs Khronos-level change-aware integration online from monocular RGB, and medical MIS degrades beyond phantom and ex vivo into actual surgical conditions. All four items from Ch.15b remain open here.

---

## 19.3 Scale and representational memory: the problem changes when the size does

Every time a SLAM system expanded from one room to a building, and from a building to a city, the same question came back in a new form.

Monocular scale is a geometric fact already proven in 1980s SfM theory; IMUs and depth sensors route around it, but a pure monocular method for holding metric scale keeps returning in shifted form (see Ch.5 §🧭). In Ch.11 the same question appears in another language. [Metric3D v2](https://arxiv.org/abs/2404.15506) and [Depth Anything v2](https://arxiv.org/abs/2406.09414) produce metric depth conditional on intrinsics, but the intrinsics-unknown case (smartphones, CCTV, archives, satellites) is common, and camera-independent metric depth is not easy even at foundation scale (see Ch.11 §🧭).

In the TSDF lineage the memory problem surfaced as a limit of the representation. [Voxblox](https://arxiv.org/abs/1611.03631) and [OctoMap](https://octomap.github.io/) reduced cost, but building-floor and city-block dense representation is still tens of gigabytes, and an adaptive-resolution map has no general-purpose solution (see Ch.9 §🧭). NeRF-SLAM hit the same ceiling — city-scale is open (see Ch.14 §🧭). In Gaussian Splatting the Gaussian count rises linearly with scene size, indoor hundreds of thousands becoming outdoor tens of millions, and the [Compact 3DGS](https://arxiv.org/abs/2311.13681) (Lee et al. 2024) family is exploring compression without agreement (see Ch.15 §🧭). In foundation 3D the problem is redefined as a physical limit of the transformer: memory quadratic in image count is realistic at 100 but a different problem at 1,000 or 10,000, and Spann3R's incremental approach is only partial (see Ch.16 §🧭). Even as the representation changes, the barrier of size sits in the same spot.

The other face of the size problem is **data movement cost** — not representational capacity but the physical cost of pushing bits between processor and memory, which drains power. In Handbook Ch.18 §18.8 Davison proposes "on-device data movement, measured in bits × millimetres" as the 12th SLAM performance metric, redefining a SLAM metric in the language of a hardware engineer. [Hughes et al.'s claim](https://doi.org/10.15607/RSS.2022.XVIII.050) that hierarchical scene graph compresses memory from $O(L \cdot V/\delta^3)$ to $O(N_\text{sub} + N_\text{obj} + N_\text{rooms})$ sits in the same register (Handbook Ch.16 Eq. 16.34-16.36). Whether this redefinition will be broadly accepted is without verdict.

---

## 19.4 Uncertainty calibration for learning-based systems

Since Julier and Uhlmann proved the inconsistency of the EKF in Ch.4, the question of how accurately a SLAM system knows that it does not know where it is has remained the field's question.

Non-Gaussian uncertainty touches the core assumption of the EKF. Real sensor errors are often multimodal or heavy-tailed; Stein particles, normalizing flows, and learned uncertainty are being tried, but real-time validation is limited (see Ch.4 §4.8). In graph SLAM robust cost function selection still leans on intuition — no principled method decides in advance which of Huber, Cauchy, or Geman-McClure fits a given environment and sensor (see Ch.6 §🧭). The tightness bounds of [Ch.6b](chapter_06b_certifiable.md) sit on the same layer. SE-Sync's exact recovery theorem gave the sufficient condition "noise below $\beta$" with no way to compute $\beta$ in advance on an actual instance. Extending the certifiable framework to Visual SLAM and VIO, and online certification that re-solves the SDP as new measurements come in, also sit open.

In learning-based methods the problem returns sharper. After Bayesian PoseNet's failure, how calibrated a learned uncertainty estimate is under OOD input remains open (see Ch.12 §🧭). As the DROID-SLAM lineage confirmed, learned priors silently degrade outside the training domain — geometric failure is explicit, learned failure is plausible. [TartanAir](https://arxiv.org/abs/2003.14338)-style synthetic training still leaves a sim-to-real gap (see Ch.13 §🧭).

In foundation 3D the same problem redefines loop closure. Propagating correction in a pointmap-based map — MASt3R-SLAM handles it with existing methods, but whether that is a principled solution is unknown (see Ch.16 §🧭). In autonomous driving and medical robotics calibrated uncertainty is required, but few systems treat it at that level.

Davison reformulates this. *"If a network has built a 3D model from 100 images, does adding one more image require running the whole thing again"* (Handbook Ch.18, p.528). The moment one admits long-term representation and fusion are needed, probabilistic state estimation and modular scene representation become necessary. The [GBP Learning](https://arxiv.org/abs/2312.14294) lineage (Nabarro et al.) puts network weights as random variables of a factor graph, erasing the divide between *"training time"* and *"test time"* (p.543). Whether this is the principled answer or a transfer into another assumption system is too early to judge.

---

## 19.5 Sensor fusion and new modalities: integration unfinished

Visual SLAM and LiDAR SLAM solved the same problem in different languages at the same time. The two lineages have never substantively merged.

LVI-SAM coupled visual odometry into LIO-SAM but only at loosely-coupled level; scenarios where cameras fail in fog and rain and LiDAR must take over are a clear autonomous-driving requirement, yet the algorithmic and calibration difficulty of tightly coupled fusion remains a barrier (see Ch.17 §🧭). The algorithmic gap from solid-state LiDAR sits on the same layer — LOAM and FAST-LIO presume 360° spinning, and the non-repetitive scan patterns of Livox and RoboSense need separate research without adequate generalization (see Ch.17 §🧭).

Wide-baseline matching is another angle on modality fusion. Past 45 degrees of viewpoint change Harris- and ORB-based matching drops sharply, and DUSt3R opened a breakthrough by avoiding matching itself, but whether this is the end of the descriptor problem or a bypass is too early to judge (see Ch.2 §2.7). The integration between place recognition and metric localization is a pipeline-level disconnect; 2023-2025 attempts to unify them into a single representation exist but none achieves both precision and speed (see Ch.10 §10.7).

Event cameras show how far algorithms lag when a modality is new. Commercial high-resolution event cameras spread after 2022, but integration with frame-based pipelines, new event representations, and real-world benchmarks are all underway simultaneously (see Ch.18 §🧭). The same sequence as Kinect launching in 2010 and KinectFusion arriving a year later.

There are modalities this book left out of scope: **4D imaging radar** and **legged/proprioceptive SLAM**. Radar is the only commercial sensor covering conditions where camera and LiDAR both fail in fog and rain — Oxford Radar RobotCar (2019), NuScenes' radar channel, and 4D imaging radar (Arbe, Mobileye) after 2023 entered the mainstream of autonomous-driving research. Legged SLAM opened a separate lineage fusing kinematic and contact priors with outdoor deployment of ANYmal, Spot, and Unitree in the 2020s. Both have different origins and benchmarks from the visual, LiDAR, and foundation-3D lineages, and each is a size that needs its own history.

---

## 19.6 Recoupling compute structure and hardware

An axis that rarely appeared in SLAM histories came to the front in Davison's Handbook Ch.18 in the late 2020s: matching the graph structure of the algorithm to the graph structure of the silicon.

Dennard scaling broke, single-core CPU clock speed stopped near 4GHz in the mid-2000s and *"this has stopped being true"* (Handbook Ch.18, p.528), while the constraint of wearable Spatial AI stays at one pair of glasses — 65g, <1W. That gap pushes the field into a **heterogeneous, specialized, parallel** architecture.

Concrete silicon cases gathered in the mid-2020s. [Apple Vision Pro R1](https://www.apple.com/apple-vision-pro/specs/) (2023) is a dedicated chip for 12 ms sensor processing, [Meta ARIA Gen 2](https://www.projectaria.com/ariagen2/) (2024) carries "ultra low power and on-device machine perception" custom silicon, the [Graphcore IPU](https://www.graphcore.ai/products/ipu) has thousands of independent cores with local memory communicating by message passing, Manchester's [SCAMP5](https://personalpages.manchester.ac.uk/staff/p.dudek/papers/carey-iscas2013.pdf) implements 256×256 per-pixel in-plane processing at 1.2W, and [SpiNNaker](https://apt.cs.manchester.ac.uk/projects/SpiNNaker/) ties up to one million ARM cores in a neuromorphic structure. Each demands a different graph topology, and there is no systematic theory yet for how to map Spatial AI algorithms onto which silicon.

On this axis Davison's later track — **Gaussian Belief Propagation** — found its place. [Ortiz et al.](https://arxiv.org/abs/2203.11618) (2022) accelerated bundle adjustment on the IPU with GBP by 30× over CPU, and [Murai et al. Robot Web](https://arxiv.org/abs/2306.04620) (2024) showed multi-robot SLAM where robots share factor graph fragments over Wi-Fi and converge through asynchronous message passing. *"We must get away from the idea that a 'god's eye view' of the whole structure of the graph will ever be available"* (Handbook Ch.18, p.541) is the philosophy. Take the factor graph as master representation, give up full-posterior computation, and let messages "bubble" over the graph, converging locally. Whether this approach combines with transformer-based systems like MASt3R-SLAM or remains a separate stem is still unanswered.

Of Davison's twelve metrics, number 11 "power usage" and number 12 "on-device data movement" are the new ones in the language of hardware engineering — a proposal to evaluate by **power and distance traveled** as much as by accuracy. Whether this will be absorbed into mainstream benchmarks like TUM, KITTI, and EuRoC is without consensus. It is an area outside the algorithm-centered bias of this book, and that bias itself is being newly problematized in the late 2020s.

---

## 19.7 The return of semantic representation and Open-World

The verdict that semantics shrank out of SLAM's landmark slot ([Ch.18 §18.4](chapter_18_dead_ends.md#184-semantic-slam--the-shrinking-of-the-object-as-landmark-path)) is true in the narrow sense. Neither ORB-SLAM3 nor MASt3R-SLAM uses object-level primitives. But over the same period semantics rose onto an **upper layer of the map** and built a trajectory of real success — a path the Ch.1-18 narrative does not fully surface.

The trajectory is clear. [Kimera](https://doi.org/10.1109/ICRA40945.2020.9196885) (2020) bundled metric-semantic mesh and 3D scene graph, and [Hydra](https://doi.org/10.15607/RSS.2022.XVIII.050) (2022) extended it in real time and hierarchically — *"first online system to produce fully hierarchical scene graphs that included objects, places, and rooms"* (Handbook Ch.16, §16.4.2). Foundation features sat on top. [ConceptFusion](https://arxiv.org/abs/2302.07241) and [VLMaps](https://arxiv.org/abs/2210.05714) (2023) put CLIP into a dense map, [ConceptGraphs](https://doi.org/10.1109/ICRA57147.2024.10610243) (2024) into open-vocabulary object nodes, [Clio](https://doi.org/10.1109/LRA.2024.3451395) (2024) into task-driven hierarchy, and [LERF](https://arxiv.org/abs/2303.09553) and [LangSplat](https://arxiv.org/abs/2312.16084) into radiance fields and Gaussian splatting. Semantic SLAM did not die; it raised its representational layer.

But this trajectory opened more than it resolved. Hughes/Carlone's picked open problem is *"performing uncertainty quantification in hierarchical representations mixing discrete and continuous variables is still a largely unexplored problem"* (p.488). When discrete variables like object category and room ID mix with continuous ones like pose and surface in the same graph, how to propagate uncertainty has no principled answer. Extending scene graphs into outdoor and unstructured environments is also open, and dynamically reconfiguring task-driven hierarchy (Clio's Information Bottleneck, Handbook Ch.16 Eq. 17.8) lacks generalization.

A larger question is "do we still need a map". Paull and the editors take it up directly in Ch.17 §17.4.2 "Revisiting the Question of the Need for Maps". If one feeds every past frame into a long-context VLM, is planning possible without an explicit scene graph? [OpenEQA](https://open-eqa.github.io/) and [Mobility VLA](https://arxiv.org/abs/2407.07775) (2024) show map-free works on short and simple tasks but fails as spatial and temporal horizons lengthen. *"the need for an explicit map representation ... largely depend[s] on the spatial and temporal horizons of the considered tasks and remains an active area of research"* (p.515). Neither a declaration of solved nor of unneeded has come.

The relation of SLAM to generative robot policies sits open on the same horizon. Do VLA models like [RT-2](https://robotics-transformer2.github.io/) (2023), [OpenVLA](https://arxiv.org/abs/2406.09246) (2024), and [π₀](https://www.physicalintelligence.company/blog/pi0) (2024) replace SLAM, or sit on top of it. The **final sentence** of the whole Handbook answers. *"true generalization and scalability to compositional tasks ... could be achieved through some form of explicit structure that is learned through a process such as SLAM. ... these two paradigms ... are entirely complementary"* (Paull/Carlone, Handbook Ch.17, p.520). 527 pages converge into one sentence declaring the two lineages need each other — the position closest to consensus, but what architectural combination "complementary" means is still open.

---

## 19.8 The shape of the open questions

Gathering the open items this book has tracked across 18 chapters reveals a pattern.

The open problems are not all open in the same way. The monocular scale ambiguity of Ch.5 is a geometric fact already proven in SfM theory, and it stays in the same formulation in 2026. The dynamic-world assumption, in contrast, has come back in shifting forms over twenty years. It appeared differently in the SfM language of Ch.3, the dense-SLAM language of Ch.9, the Gaussian language of Ch.15, and the LiDAR language of Ch.17. How to redefine loop closure in the foundation-3D lineage, and how to calibrate learned uncertainty, only earned the name of problem in 2026. They have been named for only a few years.

Ch.0 described the era in which SLAM is taken to be solved. The description is accurate. That the five editors of the 2026 SLAM Handbook wrote jointly in the Epilogue *"If someone tells you 'SLAM is solved,' don't listen to them"* is the same landscape seen from inside. The history of SLAM is not a history of stacking up new things but a history of learning when to let go of what. The moment one assumption is released, a problem previously closed returns in new form. When the EKF's linearity assumption was set down particle filters followed, and when sparse features were set down dense methods came in, each transition moving into a new assumption system rather than discarding the earlier method.

What is taken as solved in 2026 also sits somewhere inside this cycle. When the assumption now held with confidence begins to shake, the blanks open again.

---

## 19.9 Lineage map

```mermaid
graph TD
 PM[사진측량 1858]
 BA[Bundle Adjustment Brown 1958]
 SfM[Photo Tourism 2006]
 COLMAP[COLMAP 2016]

SC[Smith-Cheeseman 1986]
  Mono[MonoSLAM 2003]
  PTAM[PTAM 2007]
  ORB[ORB-SLAM 2015]
  ORB3[ORB-SLAM3 2020]

LSD[LSD-SLAM 2014]
  DSO[DSO 2016]
  VIDSO[VI-DSO 2018]

LM[Lu-Milios 1997]
 FG[Factor Graph Dellaert 2000s]
 iSAM[iSAM 2008]
 iSAM2[iSAM2 2012]
 g2o[g2o 2011]

Forster[Preintegration Forster 2016]
 VINS[VINS-Mono 2018]

Kinect[KinectFusion 2011]
  Elastic[ElasticFusion 2015]

SESync[SE-Sync 2019]
  TEASER[TEASER 2020]

LOAM[LOAM 2014]
  FAST[FAST-LIO 2021]

NeRF[NeRF 2020]
  iMAP[iMAP 2021]
  NICE[NICE-SLAM 2021]

GS3D[3DGS 2023]
  Spla[SplaTAM 2024]
  MonoGS[MonoGS 2024]

DROID[DROID-SLAM 2021]
  DPV[DPV-SLAM 2024]

DUSt3R[DUSt3R 2023]
  MASt[MASt3R 2024]
  VGGT[VGGT 2025]
  MASlam[MASt3R-SLAM 2025]

Hydra[Hydra 2022]
  Clio[Clio 2024]

click PM "#chapter-1" "Ch.1 Prehistory — Photogrammetry"
  click BA "#chapter-1" "Ch.1 Prehistory — Bundle Adjustment"
  click SfM "#chapter-3" "Ch.3 Structure from Motion"
  click COLMAP "#chapter-3" "Ch.3 SfM — COLMAP"
  click SC "#chapter-4" "Ch.4 EKF-SLAM — Smith-Cheeseman"
  click Mono "#chapter-5" "Ch.5 MonoSLAM·PTAM"
  click PTAM "#chapter-5" "Ch.5 MonoSLAM·PTAM"
  click ORB "#chapter-7" "Ch.7 ORB-SLAM family"
  click ORB3 "#chapter-7" "Ch.7 ORB-SLAM3"
  click LSD "#chapter-8" "Ch.8 Direct Methods — LSD-SLAM"
  click DSO "#chapter-8" "Ch.8 Direct Methods — DSO"
  click VIDSO "#chapter-8" "Ch.8 Direct Methods — VI-DSO"
  click LM "#chapter-6" "Ch.6 Graph SLAM — Lu-Milios"
  click FG "#chapter-6" "Ch.6 Graph SLAM — Factor Graph"
  click iSAM "#chapter-6" "Ch.6 Graph SLAM — iSAM"
  click iSAM2 "#chapter-6" "Ch.6 Graph SLAM — iSAM2"
  click g2o "#chapter-6" "Ch.6 Graph SLAM — g2o"
  click Forster "#chapter-7" "Ch.7b IMU Preintegration (after Ch.7)"
  click VINS "#chapter-7" "Ch.7 — VINS-Mono"
  click Kinect "#chapter-9" "Ch.9 RGB-D — KinectFusion"
  click Elastic "#chapter-9" "Ch.9 RGB-D — ElasticFusion"
  click SESync "#chapter-6" "Ch.6b Certifiable (after Ch.6)"
  click TEASER "#chapter-6" "Ch.6b Certifiable — TEASER"
  click LOAM "#chapter-17" "Ch.17 LiDAR — LOAM"
  click FAST "#chapter-17" "Ch.17 LiDAR — FAST-LIO"
  click NeRF "#chapter-14" "Ch.14 NeRF-SLAM"
  click iMAP "#chapter-14" "Ch.14 NeRF-SLAM — iMAP"
  click NICE "#chapter-14" "Ch.14 NeRF-SLAM — NICE-SLAM"
  click GS3D "#chapter-15" "Ch.15 Gaussian Splatting"
  click Spla "#chapter-15" "Ch.15 — SplaTAM"
  click MonoGS "#chapter-15" "Ch.15 — MonoGS"
  click DROID "#chapter-13" "Ch.13 Hybrid — DROID-SLAM"
  click DPV "#chapter-13" "Ch.13 — DPV-SLAM"
  click DUSt3R "#chapter-16" "Ch.16 Foundation 3D — DUSt3R"
  click MASt "#chapter-16" "Ch.16 — MASt3R"
  click VGGT "#chapter-16" "Ch.16 — VGGT"
  click MASlam "#chapter-16" "Ch.16 — MASt3R-SLAM"
  click Hydra "#chapter-16" "Ch.16 §16.6 Semantic Foundation"
  click Clio "#chapter-16" "Ch.16 §16.6 — Clio"
```