In-domain Pretraining for Vision-and-Language Navigation

Pierre-Louis Guhur ^🏠, Makarand Tapaswi ^{🏠, 🏢} , Shizhe Chen ^🏠, Ivan Laptev ^🏠, Cordelia Schmid ^{🏠, 🛖}

🏠 Inria Paris, 🏢 IIIT Hyderabad, 🛖 Google Research

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements.

In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs.

We use BnB to pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

Read the paper Supplementary material

Figure 1: VLN tasks are evaluated on unseen environments at test time. Top: None of the training houses contain a Christmas theme making this test environment particularly challenging. Bottom: We build a large-scale, visually diverse, and in-domain dataset by creating path-instruction pairs close to a VLN-like setup and show the benefits of self-supervised pretraining.

In vision-and-language navigation (VLN), an agent is asked to navigate in home environments following natural language instructions [1], [2]. This task is attractive to many real-world applications such as domestic robotics and personal assistants. However, given the high diversity of VLN data across environments and the difficulty of the manual collection and annotation of VLN training data at scale, the performance of current methods remains limited, especially for previously unseen environments [3].

Our work is motivated by significant improvements in vision and language pretraining [4]–[9], where deep transformer models [10] are trained via self-supervised proxy tasks [11] using large-scale, automatically harvested image-text datasets [12], [13]. Such pretraining enables learning transferable multi-modal representations achieving state-of-the-art performance in various vision and language tasks. Similarly, with the goal of learning an embodied agent that generalizes, recent works [14]–[17] have explored different pretraining approaches for VLN tasks.

In [14], [15], annotated path-instruction pairs are augmented with a speaker model that generates instructions for random unseen paths. However, as these paths originate from a small set of 61 houses used during training, they are limited in visual diversity. The limited pretraining environments do not equip agents with visual understanding abilities that enable generalization to unseen houses, see Fig. 1. To address this problem, VLN-BERT [17] proposes to pretrain the agent on generic image-caption datasets that are abundant and cover diverse visio-linguistic knowledge. However, these image-caption pairs are quite different from the dynamic visual stream (path) and navigable instructions observed by a VLN agent, and such out-of-domain pretraining, although promising, only brings limited gains to the navigation performance. Besides the above limitations, existing pretraining methods do not place much emphasis on temporal reasoning abilities in their self-supervised proxy tasks such as one-step action prediction [14] and path-instruction pairing [17], while such reasoning is important to a sequential decision making task like VLN. As a result, even if performance in downstream tasks is improved, the pretrained models may still be brittle. For example, a simple corruption of instructions by swapping noun phrases within the instruction, or replacing them with other nouns, leads to significant confusion as models are unable to pick the correct original pair.

In this paper, we explore a different data source and proxy tasks to address the above limitations in pretraining a generic VLN agent. Though navigation instructions are rarely found on the Internet, image-caption pairs from home environments are abundant in online marketplaces (e.g.. Airbnb), which include images and descriptions of rental listings. We collect BnB, a new large-scale dataset with 1.4M indoor images and 0.7M captions. First, we show that in-domain image-caption pairs bring additional benefits for downstream VLN tasks when applied with generic web data [17]. In order to further reduce the domain gap between the BnB pretraining and the VLN task, we present an approach to transform static image-caption pairs into visual paths and navigation-like instructions (Fig. 1 bottom), leading to large additional performance gains. We also propose a shuffling loss that improves the model’s temporal reasoning abilities by learning a temporal alignment between a path and the corresponding instruction.

Our pretrained model, Airbert, is a generic transformer backbone that can be readily integrated in both discriminative VLN tasks such as path-instruction compatibility prediction [17] and generative VLN tasks [18] in R2R navigation [2] and REVERIE remote referring expression [19]. We achieve state-of-the-art performance on these VLN tasks with our pretrained model. Beyond the standard evaluation, our in-domain pretraining opens an exciting new direction of one/few-shot VLN where the agent is trained on examples only from one/few environment(s) and expected to generalize to other unseen environments.

In summary, the contributions of this work are three-fold. (1) We collect a new large-scale in-domain dataset, BnB, to promote pretraining for vision-and-language navigation tasks. (2) We curate the dataset in different ways to reduce the distribution shift between pretraining and VLN and also propose the shuffling loss to improve temporal reasoning abilities. (3) Our pretrained Airbert can be plugged into generative or discriminative architectures and achieves state-of-the-art performance on R2R and REVERIE datasets. Moreover, our model generalizes well under a challenging one/few-shot VLN evaluation, truly highlighting the capabilities of our learning paradigm. We will release the code, model, and data.

Vision-and-language navigation. VLN [2] has received significant attention with a large number of followup tasks introduced in recent years [1], [19]–[26]. Early days of VLN saw the use of sequence-to-sequence LSTMs to predict low-level actions [2] or high-level directions in a panoramic action space [27]. For better cross-modal alignment, a visio-linguistic co-grounding attention mechanism is proposed in [28], and instructions are further disentangled into objects and directions in [29]. To alleviate exposure bias in supervised training of the agent, reinforcement learning has been adopted through planning [30], REINFORCE [31], A2C [32] and reward learning [33]. A few works also explore different search algorithms such as backtracking by monitoring progress [28], [34] or beam search [27], [32], [35] in environment exploration.

To improve an agent’s generalization to unseen environments, data augmentation is performed by using a speaker model [27] that generates instructions for random paths in seen environments, and environment dropout [32] is used to mimic new environments. While pretraining LSTMs to learn vision and language representations is adopted by [15], recently, there has been a shift towards transformer models [14] to learn generic multimodal representations. This is further extended to a recurrent model that significantly improves sequential action prediction [18]. However, the limited environments in pretraining [14], [15] constrain the generalization ability to unseen scenarios. Most related to this work, VLN-BERT [17] transfers knowledge from abundant, but out-of-domain image-text data to improve path-instruction matching. In this work, we not only create a large-scale, in-domain BnB dataset to improve visual diversity, but also propose effective pretraining strategies to mitigate the domain-shift between webly crawled image-text pairs and VLN data.

Large-scale visio-linguistic pretraining. Thanks to large-scale vision-language pairs automatically collected from the web [12], [13], [36], [37], visio-linguistic pretraining (VLP) has made great breakthroughs in recent years towards learning transferable multimodal representations. Several VLP models [5]–[7], [38] have been proposed based on the transformer architecture [10]. These models are often pretrained with self-supervised objectives akin to those in BERT [11]: masked language modeling, masked region modeling and vision-text pairing. Fine-tuning them on downstream datasets achieves state-of-the-art performance on various VL tasks [39]–[42]. While such pretraining focuses on learning correlations between vision and text, it is not designed for sequential decision making as required in embodied VLN. The goal of this work is not to improve VLP architectures but to present in-domain training strategies that lead to performance improvements for VLN tasks.

The number images from Matterport environments [44] refers to the number of panoramas. The speaker model [32] generates instructions for randomly selected trajectories, but is limited to panoramas from 60 training environments. Note that the data from Conceptual Captions (ConCaps) may feature some houses, but it is not the main category.

Dataset	Source	#Envs	#Imgs	#Texts
R2R [2]	Matterport	90	10.8K	21.7K
REVERIE [19]	Matterport	86	10.6K	10.6K
Speaker [32]	Matterport	60	7.8K	0.2M
ConCaps [13]	Web	-	3.3M	3.3M
BnB (ours)	Airbnb	140K	1.4M	0.7M

Hosts that rent places on online marketplaces often upload attractive and unique photos along with descriptions. One such marketplace, Airbnb, has 5.6M listings from over 100K cities all around the world [43]. We propose to use this abundant and curated data for large-scale in-domain VLN pretraining. In this section, we first describe how we collect image-caption pairs from Airbnb. Then, we propose methods to transform images and captions into VLN-like path-instruction pairs to reduce the domain gap between webly crawled image-text pairs and VLN tasks (see Fig. 2).

Figure 2: We explore several strategies to automatically create navigation-like instructions from image-caption pairs.

Collection process. We restrict our dataset to listings from the US (about 10% of Airbnb) to ensure high quality English captions and visual similarity with Matterport environments [44]. The data collection proceeds as follows: (1) obtain a list of locations from Wikipedia; (2) find listings in these locations by querying the Airbnb search engine; (3) download listings and their metadata; (4) remove outdoors images³ as classified by a ResNet model pretrained on Places365 [45]; and (5) remove invalid image captions such as emails, URLs and duplicates.

Statistics. We downloaded almost 150k listings and their metadata (1/4 of the listings in the US) in step 3, leading to over 3M images and 1M captions. After data cleaning with steps 4 and 5, we obtain 713K image-caption pairs and 676K images without captions. Table 2 compares our BnB dataset to other datasets used in previous works for VLN (pre-)training. It is larger than R2R [2], REVERIE [19] and includes a large diversity of rooms and objects, which is not the case for Conceptual Captions [13]. We posit that such in-domain data is crucial to deal with the data scarcity challenge in VLN environments as illustrated above. We use 95% of our BnB dataset for training and the remaining 5% for validation.

Apart from images and captions, our collected listings contain structured data including a list of amenities, a general description, reviews, location, and rental price, which may offer additional applications in the future. More details about the dataset and examples are presented in the supplementary material.

BnB image-caption (IC) pairs are complementary to Conceptual Captions (ConCaps) as they capture diverse VLN environments. However, they still have large differences from path-instruction (PI) pairs in VLN tasks. For example, during navigation, an agent observes multiple panoramic views of a sequence of locations rather than a single image, and the instruction may contain multiple sentences describing different locations along the way. To mitigate this domain gap, we propose strategies to automatically craft path-instruction pairs starting from BnB-IC pairs.

Images in a BnB listing usually depict different locations in a house, mimicking the sequential visual observations an agent makes while navigating in the house. To create a VLN-like path-instruction pair, we randomly select and concatenate K⁴ image-caption pairs from the listing . In between each caption, we randomly add a word from “and”, “then”, “.” or nothing to make the concatenated instruction more fluent and diverse.

In the above concatenated path, each location only contains one BnB image, and perhaps with a limited view angle as hosts may focus on objects or amenities they wish to highlight. Therefore, it lacks the panoramic visual context at each location that the agent receives in real navigation paths. Moreover, each location in the concatenated instruction is described by a unique sentence, while adjacent locations are often expressed together in one sentence in VLN instructions [46]. To address the above issues with concatenation, we propose two approaches to compose paths that have more visual context and can also leverage the abundant images without captions (denoted as captionless images).

Image merging extends the panoramic context of a location by grouping images from similar room categories (see Fig. 2). For example, if the image depicts a kitchen sink, it is natural to expect images of other objects such as forks and knives nearby. Specifically, we first cluster images of similar categories (e.g.. kitchen) using room labels predicted by a pretrained Places365 model [45]. Then, we extract multiple regions from this merged set of images, and use them as an approximation to the panoramic visual representation.

Captionless image insertion. The Table 1 shows that half of the BnB images are captionless. Using them allows to increase the size of the dataset. When creating a path-instruction pair from the concatenation approach, a captionless image is inserted as if its caption was an empty string. The BnB PI pairs hence generated better approximate the distribution of the R2R path-instructions: (1) some images in the path are not described and (2) instructions have similar number of noun phrases.

The concatenated captions mainly describe rooms or objects at different locations, but do not contain any of the actionable verbs as in navigation instructions, e.g.. “turn left at the door” or “walk straight down the corridor”. We suggest two strategies to create fake instructions that have fluent transitions between sentences.

Instruction rephrasing. We use a fill-in-the-blanks approach to replace noun-phrases in human annotated navigation instructions [2] by those in BnB captions (see Fig. 2). Concretely, we create more than 10K instruction templates containing 2-7 blanks, and fill the blanks with noun-phrases extracted from BnB captions. The noun-phrases matched to object categories from the Visual Genome [47] dataset are preferred during selection. This allows us to create VLN-like instructions with actionable verbs interspersed with room and object references for visual cues that are part of the BnB path (see Fig. 2).

Instruction rephrasing. It is a video captioning like model that takes in a sequence of images and generates an instruction corresponding to an agent’s path through an environment. To train this model, we adopt ViLBERT and train it to generate captions for single BnB image-caption pairs. Further, this model is fine-tuned on trajectories of the R2R dataset to generate corresponding instructions. Finally, we use this model to generate BnB PI pairs by producing an instruction for a concatenated image sequence from BnB (the path).

Figure 3: We explore several strategies to automatically create navigation-like instructions from image-caption pairs.

In this section, we present Airbert, our multi-modal transformer pretrained on the BnB dataset with masking and shuffling losses. We first introduce the architecture of Airbert, and then describe datasets and pretext tasks in pretraining. Finally, we show how Airbert can be adapted to downstream VLN tasks.

ViLBERT [7] is a multi-modal transformer extended from BERT [11] to learn joint visio-linguistic representations from image-text pairs, as illustrated in Fig. 3.

Given an image-text pair (V, C), the model encodes the image as region features [v₁, …, v_𝒱] via a pretrained Faster R-CNN [48], and embeds the text as a series of tokens: [[CLS], w₁, …, w_T, [SEP]], where [CLS]and [SEP]are special tokens added to the text. ViLBERT contains two separate transformers that encode V and C and it learns cross-modal interactions via co-attention [7].

We follow a similar strategy to encode path-instruction pairs (created in Sec. 3.1) that contain multiple images and captions {(V_k, C_k)}_k = 1^K. Here, each V_k is represented as visual regions v_i^k and C_k as word tokens w_t^k. Respectively, the visual and text inputs to Airbert are:
$$\begin{aligned} X_V &= [\texttt{[IMG]}, v^1_1, \ldots, v^1_{\mathcal{V}_1}, \ldots, \texttt{[IMG]}, v^K_1, \ldots, v^K_{\mathcal{V}_K}], \\ X_C &= [\texttt{[CLS]}, w^1_1, \ldots, w^1_{T_1}, \ldots, w^K_1, \ldots, w^K_{T_K}, \texttt{[SEP]}] ,\end{aligned}$$
where the [IMG] token is used to separate image region features taken at different locations.

Note that while our approach is not limited to a ViLBERT-like architecture, we choose ViLBERT for a fair comparison with previous work [15].

We use Conceptual Captions (ConCaps) [37] and BnB-PI in subsequent pretraining steps (see Fig. [fig:model]) to reduce the domain gap for downstream VLN tasks.

Previous multi-modal pretraining efforts [7], [15], [17] commonly use two self-supervised losses given image-caption (IC) pairs or path-instruction (PI) pairs: (1) Masking loss: An input image region or word is randomly replaced by a [MASK] token. The output feature of this masked token is trained to predict the region label or the word given its multi-modal context. (2) Pairing loss: Given the output features of [IMG]and [CLS] tokens, a binary classifier is trained to predict whether the image (path) and caption (instruction) are paired.

The above two pretext tasks mainly focus on learning object-word associations instead of reasoning about the temporal order of paths and instructions. For example, if an image V_i appears before V_j, then words from its caption C_i should appear before C_j. In order to promote such a temporal reasoning ability, we propose an additional shuffling loss to enforce alignment between PI pairs.

Given an aligned PI pair X⁺ = {(V_k, C_k)}_k = 1^K, we generate 𝒩 negative pairs X_n⁻ = {(V_k, C_l)}, k ≠ l, by shuffling the composed images or the captions. We train our model to choose the aligned PI pair as compared to the shuffled negatives by minimizing the cross-entropy loss:
$$L = -\log \frac{\exp(f(X^+))}{\exp(f(X^+)) + \sum_n \exp(f(X^-_n))} \, ,$$
where f(X) denotes the similarity score (logit) computed via Airbert for some PI pair X.

We consider two VLN tasks: goal-oriented navigation (R2R [2]) and object-oriented navigation (REVERIE [19]). Airbert can be readily integrated in discriminative and generative models for the above VLN tasks.

The navigation problem on the R2R dataset is formulated as a path selection task in [15]. Several candidate paths are generated via beam search from a navigation agent such as [32], and a discriminative model is trained to choose the best path among them. We fine-tune Airbert on the R2R dataset for path selection. A two-stage fine-tuning process is adopted: in the first phase, we use masking and shuffling losses on the PI pairs of the target VLN dataset in a manner similar to BnB PI pairs; in the second phase, we choose a positive candidate path as one that arrives within 3m of the goal, and contrast it against 3 negative candidate paths. We also compare multiple strategies to mine additional negative pairs (other than the 3 negative candidates), and in fact, empirically show that negatives created using shuffling outperform other options.

Generative Model: Recurrent VLN-BERT [18]. The Recurrent VLN-BERT model adds recurrence to a state in the transformer to sequentially predict actions, achieving state-of-the-art performance on R2R and REVERIE tasks. We use our Airbert architecture as its backbone and apply it to the two tasks as follows. First, the language transformer encodes the instruction via self-attention. Then, the embedded [CLS] token in the instruction is used to track history and concatenated with visual tokens (observable navigable views or objects) in each action step. Self-attention and cross-attention on embedded instructions are employed to update the state and visual tokens and the attention score from the state token to visual tokens is used to decide the action at each step. We fine-tune the Recurrent VLN-BERT model with Airbert as the backbone in the same way as [18].

Please refer to the supplementary material for additional details about the models and their implementation.

Table 2: Comparison between various BnB PI pair creation strategies for pretraining. The first row denotes the use of image-caption pairs. All methods from the second row use masking and shuffling during pretraining. Cat: naive concatenation; Rep: instruction rephrasing; Gen: instruction generation; Merge: image merging; and Insert: captionless image insertion.
		Rep	Gen	Merge	Insert	Seen	Unseen
1	-	-	-	-	-	71.21	62.45
2	✔️	-	-	-	-	73.84	62.71
3	-	✔️	-	-	-	72.67	63.35
4	-	-	✔️	-	-	71.19	63.11
5	-	-	-	✔️	-	70.51	64.07
6	-	-	-	-	✔️	74.43	66.05
7	-	✔️	-	✔️	✔️	73.57	66.52

Table 4: Comparison between different strategies for fine-tuning a ViLBERT model on the R2R task. VLN-BERT [17] fine-tunes ViLBERT with a masking and ranking loss. Each row (described in the text) is an independent data augmentation and can be compared directly against the baseline (row 1).
	Fine-tuning	Additional
	Strategies	Negatives	Seen	Unseen
1	VLN-BERT [17]	0	70.20	59.26
2	(1) + Wrong trajectories	2	70.11	59.11
3	(1) + Highlight keywords	0	71.89	61.37
4	(1) + Hard negatives	2	71.89	61.63
5	(1) + Shuffling (Ours)	2	72.46	61.98

Table 3: Impact of shuffling during pretraining and fine-tuning. While additional data helps, we see that using the shuffling loss (abbreviated as Shuf.) consistently improves model performance. Row 1 corresponds to VLN-BERT [17].
	Mask	Shuf.	Rank	Shuf.	Rank	Shuf.	Seen	Unseen
1	-	-	-	-	✔️	-	70.20	59.26
2	✔️	-	-	-	✔️	-	73.24	64.21
3	✔️	✔️	-	-	✔️	-	73.57	66.52
4	✔️	✔️	-	-	✔️	✔️	74.69	66.90
5	✔️	-	✔️	-	✔️	-	70.21	65.52
6	✔️	✔️	✔️	✔️	✔️	✔️	73.83	68.67

Table 5: Accuracy of models attempting to pick the correct PI pair from a pool of correct + 10 negatives created by simple corruptions such as replacing or swapping noun phrases and switching directions (left with right). Random performance is 1/11 or 9.1%.
	Seen	Unseen	Seen	Unseen	Seen	Unseen
VLN-BERT	60.3	58.7	53.4	52.3	46.2	45.3
Airbert	68.3	66.6	66.6	61.1	47.3	49.8

Table 8: Navigation performance on the R2R unseen test set as indicated on the benchmark leaderboard.
Model	OSR	SR
Speaker-Follower [27]	96	53
PreSS [16]	57	53
PREVALENT [14]	64	59
Self-Monitoring [28]	97	61
Reinforced CM [31]	96	63
EnvDrop [2]	99	69
AuxRN [51]	81	71
VLN-BERT [17]	99	73
Airbert (ours)	99	77

We first perform ablation studies evaluating alternative ways to pretrain Airbert in Sec. 5.1. Then, we compare Airbert with state-of-the-art methods on R2R and REVERIE tasks in Sec. 5.2. Finally, in Sec. 5.3, we evaluate models in a more challenging setup: VLN few-shot learning where an agent is trained on examples taken from one/few houses.

R2R Setup. We briefly describe the two evaluation datasets used in our work: R2R [2] and REVERIE [19]. Most of our experiments are conducted on the R2R dataset [2], where we adopt standard splits and metrics defined by the task. We focus on success rate (SR), which is the ratio of predicted paths that stop within 3m of the goal. Please refer to [2], [17] for a more detailed explanation of the metrics. In particular, as the discriminative model uses path selection for R2R, we follow the pre-explored environment setting adopted by VLN-BERT [17], and compute metrics on the selected path.

REVERIE Setup. We also adopt standard splits and metrics on the REVERIE task [19]. Here, the success rate (SR) is the ratio of paths for which the agent stops at a viewpoint where the target object is visible. Remote Grounding Success Rate (RGS) measures accuracy of localizing the target object in the stopped viewpoint, and RGS per path length (RGSPL) is a path length weighted version.

We perform ablation studies on the impact of various methods for creating path-instruction pairs. We also present ablation studies that highlight the impact of using the shuffling loss during Airbert’s pretraining as well as fine-tuning stages. Throughout this section, our primary focus is on the SR on the unseen validation set and we compare our results against VLN-BERT [17], which achieves a SR of 59.26%.

1. Impact of creating path-instruction pairs. Table 2 presents the performance of multiple ways of using the BnB dataset after ConCaps pretraining as illustrated in Fig. 3. In row 1, we show that directly using BnB IC pairs without any strategies to reduce domain gap improves performance over VLN-BERT by 3.2%. Even if we skip ConCaps pretraining, we achieve 60.54% outperforming 59.26% of VLN-BERT. It proves that our BnB dataset is more beneficial to VLN than the generic ConCaps dataset.

Naive concatenation (row 2) does only slightly better than using the IC pairs (row 1) as there are still domain shifts with respect to fluency of transitions and lack of visual context. Rows 3-6 show that each method mitigates domain-shift to some extent. Instruction rephrasing (row 3) performs better at improving instructions than instruction generation (row 4), possibly since the generator is unable to use the diverse vocabulary of the BnB captions. Inserting captionless images at random locations (row 6) reduces the domain-shift significantly and achieves the highest individual performance. Finally, a combination of instruction rephrasing, image merging and captionless insertion provides an overall 3.8% improvement over concatenation, and a large 7.2% improvement over VLN-BERT.

2. Shuffling loss applied during pretraining. Table 3 demonstrates that shuffling is an effective strategy to train the model to reason about temporal order, and enforce alignment between PI pairs. Rows 2-4 show that shuffling is beneficial both during pretraining with BnB-PI data, or during fine-tuning with R2R data, and results in 2.3% and 0.4% improvements respectively. In combination with the Speaker dataset (paths from seen houses with generated instruction yielding 178K additional PI pairs [32]), we see that shuffling has a major role to play and provides 3.1% overall improvement (row 5 vs. 6). 68.67% is also our highest single-model performance on the R2R dataset.

3. Shuffling loss applied during fine-tuning. The final stage of model training on R2R involves fine-tuning to rank multiple candidate paths that form the path selection task. We compare various approaches to improve this fine-tuning procedure (results in Table 4). (1) In row 2, we explore the impact of using additional negative paths. Unsurprisingly, this does not improve performance. (2) Inspired by [49], we highlight keywords in the instruction using a part-of-speech tagger [50], and include an extra loss term that encourages the model to pay attention to their similarity scores (row 3). (3) Another alternative suggested by [49] involves masking keywords in the instruction and using VLP models to suggest replacements, resulting in hard negatives (row 4).

Hard negatives and highlighting keywords show good performance improvements, about 2.1-2.3%, but at the cost of extra parsers or VLP models. On the other hand, shuffling visual paths to create two additional negatives results in highest performance improvement (row 5, +2.7% on val unseen) and appears to be a strong strategy to enforce temporal order reasoning, that neither requires an external parser nor additional VLP models.

4. Error analysis. We study the areas in which Airbert brings major improvements by analyzing scores for aligned PI pairs and simple corruptions that involve replacing noun phrases (e.g.. bedroom by sofa), swapping noun phrases appearing within the instruction, or switching left and right directions (e.g.. turn left/right or leftmost/rightmost chair). In particular, for every ground-truth aligned PI pair, we create 10 additional negatives by corrupting the instruction, and measure how well the model is able to assign the highest score to the correct pair as accuracy. Table 3 shows that Airbert with in-domain training and the shuffling loss achieves large improvements (> 8%) for corruptions involving replacement or swapping of noun phrases. On the other hand, distinguishing directions continues to be a challenging problem; but here as well we see Airbert outperform VLN-BERT by 4.5%.

Model	SR	OSR	SPL	RGS	RGSPL
Seq2Seq-SF [2]	3.99	6.88	3.09	2.00	1.58
RCM [31]	7.84	11.68	6.67	3.67	3.14
SMNA [28]	5.80	8.39	4.53	3.10	2.39
FAST-MATTN [19]	19.88	30.63	11.61	11.28	6.08
Rec (OSCAR) [18]	22.14	24.54	18.25	11.51	9.55
Rec (ViLBERT)	22.17	25.51	17.28	12.87	10.00
Rec (VLN-BERT)	23.57	26.83	18.73	14.24	11.63
Rec (Airbert)	30.28	34.20	23.61	16.83	13.28

R2R. We first evaluate the discriminative model for the R2R task. Similar to VLN-BERT, we evaluate Airbert as an ensemble model created by a linear combination (chosen through grid search) of multiple model outputs (see Table 6). First, we see that Airbert alone (row 2) outperforms VLN-BERT (row 1) by 9.4% on the unseen environments and a strong ensemble of speaker and follower models [32] (row 3) by 0.7%. Ensembling Airbert results in a gain of 1.4% over the VLN-BERT ensemble (row 4 vs. 5).

We also obtain results on the test set by submitting our best method to the R2R leaderboard.⁵ As seen from Table 9, our method of ensembling Airbert, speaker, and follower (similar to VLN-BERT with speaker and follower [11]) achieves the highest success rate at 77% and is ranked first as of the submission deadline. Airbert also benefits generative models for the R2R task. The results are presented in the supplementary material. In both VLN-BERT and Airbert, 30 candidate trajectories are sampled using beam search with the EnvDrop [32] approach, inducing the same path length (PL) for all three methods (687 in 8. The SPL metric on the leaderboard takes into account the total path length over the 30 trajectories. This explains why SPL is very low and similar across multiple approaches.

REVERIE. Table 7 presents results for the REVERIE dataset. The last four rows in the table use Recurrent VLN-BERT [18] with different backbones or parameter initialization. The OSCAR and ViLBERT backbones are pretrained on out-of-domain image-caption pairs. As compared to OSCAR, we observe slight improvements using the ViLBERT backbone for the REVERIE task. VLN-BERT shares the same architecture as ViLBERT, but is pretrained on the R2R dataset, resulting in performance improvement on the unseen environments. Our pretrained Airbert achieves significantly better performance than VLN-BERT, with over 2.4% gain on navigation SR and 1.8% gain on RGS in unseen environments (val unseen). Without any special adaptation, we see that Airbert brings benefits from pretraining on the BnB dataset. We also achieve the state-of-the-art performance on the REVERIE test set, surpassing previous works by a large margin.

We hypothesize that in-domain pretraining, especially one that leverages proposed PI pair generation methods, can achieve superior performance while requiring less training data. To evaluate this, we propose a novel few shot evaluation paradigm for VLN: models are allowed to fine-tune on samples (PI pairs) from one (or few) environments. Few-shot learning for VLN is particularly interesting as visual appearance of houses may differ vastly across geographies, and while training data is hard to obtain, pretraining data like BnB may be readily available.

One/few shot tasks. We considered two types of setups: (1) learning from a single environment, which we refer as one-shot learning; and (2) learning from 6 environments (representing 10% of the total training size). For both cases, we randomly sample 5 sets of environments, and report average results (standard deviations in the supplementary material). As the number of paths in an environment may have a large impact on performance, we exclude 17 of 61 environments with less than 80 paths.

Results. We adopt VLN-BERT, pretrained on ConCaps, as a baseline for few-shot tasks. Recall that fine-tuning VLN-BERT and Airbert on R2R relies on candidate paths drawn from an existing model (EnvDrop [32]). However, as this would lead to unfair comparisons (EnvDrop is trained on the full dataset), we sample candidate paths by random walks from the starting position in the environment.

Table 9 shows that Airbert outperforms VLN-BERT by very large margins on the unseen validation set: 22.4% with 1 house and 21% with 6 houses. In fact, Airbert fine-tuned on 6 houses is almost as good as VLN-BERT on the entire training set. Interestingly, as seen in the last two rows of the table, using random paths for fine-tuning does not lead to a large performance drop for both models and is a testament to the power of pretrained networks.

6. Conclusion

We introduced BnB, a large-scale, in-domain, image-text dataset from houses listed on online rental marketplaces and showed how domain gaps between BnB image-caption pairs and VLN tasks can be mitigated through the creation of path-instruction pairs. We also proposed shuffling, as a means to improve an agent’s reasoning about temporal order. Our pretrained model Airbert, achieved state-of-the-art on R2R through the discriminative path-selection setting, and REVERIE through a generative setting. We also demonstrated large performance improvements when applying our model to a challenging one/few-shot VLN setup, highlighting the impact of good pretraining in VLN tasks.

[1] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and others, “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.

[2] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in CVPR, 2018.

[3] Y. Zhang, H. Tan, and M. Bansal, “Diagnosing the environment bias in vision-and-language navigation,” in IJCAI, 2020.

[4] C. Alberti, J. Ling, M. Collins, and D. Reitter, “Fusion of detected objects in text for visual question answering,” in EMNLP, 2019.

[5] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020.

[6] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, and others, “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV, 2020.

[7] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NIPS, 2019.

[8] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, “12-in-1: Multi-task vision and language representation learning,” in CVPR, 2020.

[9] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “VL-bert: Pre-training of generic visual-linguistic representations,” in ICLR, 2019.

[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” NIPS, 2017.

[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

[12] V. Ordonez, G. Kulkarni, and T. Berg, “Im2Text: Describing images using 1 million captioned photographs,” in NIPS, 2011.

[13] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018.

[14] W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in CVPR, 2020.

[15] H. Huang, V. Jain, H. Mehta, A. Ku, G. Magalhaes, J. Baldridge, and E. Ie, “Transferable representation learning in vision-and-language navigation,” in ICCV, 2019.

[16] X. Li, C. Li, Q. Xia, Y. Bisk, A. Celikyilmaz, J. Gao, N. Smith, and Y. Choi, “Robust navigation with language pretraining and stochastic sampling,” EMNLP, 2019.

[17] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” in ECCV, 2020.

[18] Y. Hong, Q. Wu, Y. Qi, C. Rodriguez-Opazo, and S. Gould, “A recurrent vision-and-language BERT for navigation,” arXiv preprint arXiv:2011.13922, 2021.

[19] Y. Qi, Q. Wu, P. Anderson, X. Wang, W. Y. Wang, C. Shen, and A. van den Hengel, “REVERIE: Remote embodied visual referring expression in real indoor environments,” in CVPR, 2020.

[20] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments,” in CVPR, 2019.

[21] J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environments,” in ECCV, 2020.

[22] A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-Across-Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” in EMNLP, 2020.

[23] K. Nguyen and H. Daumé III, “Help, Anna! Visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning,” in ACL, 2019.

[24] K. Nguyen, D. Dey, C. Brockett, and B. Dolan, “Vision-based navigation with language-based assistance via imitation learning with indirect intervention,” in CVPR, 2019.

[25] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: A benchmark for interpreting grounded instructions for everyday tasks,” in CVPR, 2020.

[26] J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer, “Vision-and-dialog navigation,” in CoRL, 2020.

[27] D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-Follower models for vision-and-language navigation,” in NIPS, 2018.

[28] C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-monitoring navigation agent via auxiliary progress estimation,” in ICLR, 2019.

[29] Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, “Object-and-action aware model for visual language navigation,” in ECCV, 2020.

[30] X. Wang, W. Xiong, H. Wang, and W. Y. Wang, “Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation,” in ECCV, 2018.

[31] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in CVPR, 2019.

[32] H. Tan, L. Yu, and M. Bansal, “Learning to navigate unseen environments: Back translation with environmental dropout,” in NAACL, 2019.

[33] H. Wang, Q. Wu, and C. Shen, “Soft expert reward learning for vision-and-language navigation,” in ECCV, 2020.

[34] C.-Y. Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira, “The Regretful Agent: Heuristic-aided navigation through progress estimation,” in CVPR, 2019.

[35] L. Ke, X. Li, Y. Bisk, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, and S. Srinivasa, “Tactical rewind: Self-correction via backtracking in vision-and-language navigation,” in CVPR, 2019.

[36] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips,” in ICCV, 2019.

[37] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, and others, “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.

[38] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in EMNLP, 2019.

[39] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “VQA: Visual Question Answering,” in ICCV, 2015.

[40] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “ReferItGame: Referring to objects in photographs of natural scenes,” in EMNLP, 2014.

[41] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” in CVPR, 2016.

[42] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge,” PAMI, vol. 39, no. 4, pp. 652–663, 2016.

[43] “Airbnb fast facts.”.

[44] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor environments,” 3DV, 2017.

[45] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” PAMI, vol. 40, no. 6, pp. 1452–1464, 2017.

[46] Y. Hong, C. Rodriguez, Q. Wu, and S. Gould, “Sub-instruction aware vision-and-language navigation,” in EMNLP, 2020.

[47] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, and others, “Visual Genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV, vol. 123, pp. 32–73, 2017.

[48] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018.

[49] T. Gupta, A. Vahdat, G. Chechik, X. Yang, J. Kautz, and D. Hoiem, “Contrastive learning for weakly supervised phrase grounding,” in ECCV, 2020.

[50] V. Joshi, M. E. Peters, and M. Hopkins, “Extending a parser to distant domains using a few dozen partially annotated examples,” in ACL, 2018.

[51] F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language navigation with self-supervised auxiliary reasoning tasks,” in CVPR, 2020.

[52] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” arXiv preprint arXiv:1506.03134, 2015.

Airbert

In-domain Pretraining for Vision-and-Language Navigation

Pierre-Louis Guhur ^🏠, Makarand Tapaswi ^{🏠, 🏢} , Shizhe Chen ^🏠, Ivan Laptev ^🏠, Cordelia Schmid ^{🏠, 🛖}

🏠 Inria Paris, 🏢 IIIT Hyderabad, 🛖 Google Research

Abstract

1. Introduction

2. Related Work

Table 1: Comparing BnB to other existing VLN datasets

3. BnB Dataset

3.1. Collecting BnB Image-Caption Pairs

3.2. Creating BnB Path-Instruction Pairs

Building path-instruction pairs

Augmenting Paths with Visual Contexts

Crafting Instructions with Fluent Transitions

4. Airbert: A Pretrained VLN Model

4.1. ViLBERT-like Architecture

4.2. Datasets and Pretext Tasks for Pretraining

4.3. Adaptations for Downstream VLN tasks

Experimental Results

5.1 Pretraining with BnB

5.2. Comparison against state-of-the-art

5.3. Training a navigation agent on few houses

6. Conclusion

Airbert

In-domain Pretraining for Vision-and-Language Navigation

Pierre-Louis Guhur 🏠, Makarand Tapaswi 🏠, 🏢 , Shizhe Chen 🏠, Ivan Laptev 🏠, Cordelia Schmid 🏠, 🛖

🏠 Inria Paris, 🏢 IIIT Hyderabad, 🛖 Google Research

Abstract

1. Introduction

2. Related Work

Table 1: Comparing BnB to other existing VLN datasets

3. BnB Dataset

3.1. Collecting BnB Image-Caption Pairs

3.2. Creating BnB Path-Instruction Pairs

Building path-instruction pairs

Augmenting Paths with Visual Contexts

Crafting Instructions with Fluent Transitions

4. Airbert: A Pretrained VLN Model

4.1. ViLBERT-like Architecture

4.2. Datasets and Pretext Tasks for Pretraining

4.3. Adaptations for Downstream VLN tasks

Experimental Results

5.1 Pretraining with BnB

5.2. Comparison against state-of-the-art

5.3. Training a navigation agent on few houses

6. Conclusion

Pierre-Louis Guhur ^🏠, Makarand Tapaswi ^{🏠, 🏢} , Shizhe Chen ^🏠, Ivan Laptev ^🏠, Cordelia Schmid ^{🏠, 🛖}