In-domain Pretraining for Vision-and-Language Navigation

Pierre-Louis Guhur 1, 2, Makarand Tapaswi 3 , Shizhe Chen 1, Ivan Laptev 1, 2, Cordelia Schmid 1, 2

1 Inria Paris, 2 ENS, CNRS, PSL Research University, 2 IIIT Hyderabad

How to follow instructions in environments with new objects?

VLN tasks are evaluated on unseen environments at test time. These environments contain new objects. How can an agent follow an instruction referring to a Christmas tree when the latter has never been observed in the language or visual corpus?

Build a large-scale dataset with navigation instructions from BnB listings

We build a large-scale, visually diverse, and in-domain dataset by creating path-instruction pairs close to a VLN-like setup and show the benefits of self-supervised pretraining.

Building path-instruction pairs from BnB dataset

Though navigation instructions are rarely found on the Internet, image-caption pairs from home environments are abundant in online marketplaces (e.gAirbnb), which include images and descriptions of rental listings.

We collect BnB, a new large-scale dataset with 1.4M indoor images and 0.7M captions. First, we show that in-domain image-caption pairs bring additional benefits for downstream VLN tasks when applied with generic web data [17]. In order to further reduce the domain gap between the BnB pretraining and the VLN task, we present an approach to transform static image-caption pairs into visual paths and navigation-like instructions, leading to large additional performance gains.

Airbert: A Pretrained VLN Model

Our pretrained model, Airbert, is a generic transformer backbone that can be readily integrated in both discriminative VLN tasks such as path-instruction compatibility prediction [17] and generative VLN tasks [18] in R2R navigation [2] and REVERIE remote referring expression [19]. We achieve state-of-the-art performance on these VLN tasks with our pretrained model. Beyond the standard evaluation, our in-domain pretraining opens an exciting new direction of one/few-shot VLN where the agent is trained on examples only from one/few environment(s) and expected to generalize to other unseen environments.

We also propose a shuffling loss that improves the model’s temporal reasoning abilities by learning a temporal alignment between a path and the corresponding instruction.


We evaluate Airbert on the test set by submitting our best method to the R2R leaderboard. As seen on the following Table, our method achieves the highest success rate at 77% and is ranked first as of the submission deadline.

Speaker-Follower [27] 1,257 4.87 0.01 96 53
PreSS [16] 10.5 24.5 0.63 57 53
PREVALENT [14] 10.21 4.52 0.56 64 59
Self-Monitoring [28] 373 4.48 0.02 97 61
Reinforced CM [31] 358 4.03 0.02 96 63
EnvDrop [2] 687 3.26 0.01 99 69
AuxRN [51] 41 3.24 0.21 81 71
VLN-BERT [17] 687 3.09 0.01 99 73
Airbert (ours) 687 2.69 0.01 99 77