A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

What is WikiDO?

To address this gap, we introduce WikiDO (drawn from Wikipedia Ddiversity Observatory), a new cross-modal retrieval benchmark to assess the OOD generalization capabilities of pretrained VLMs. This consists of 380K image-text pairs from Wikipedia with domain labels, along with carefully curated, human-verified in-distribution (ID) and OOD test sets of size 3K each. The image-text pairs are very diverse in topics.

Why WikiDO?

Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs). Several state-of-the-art VLMs (e.g. CLIP, BLIP-2) have achieved near-perfect performance on widely-used image-text retrieval benchmarks such as MSCOCO-Test-5K and Flickr30K-Test-1K. As a measure of out-of-distribution (OOD) generalization, prior works rely on zero-shot performance evaluated on one dataset (Flickr) using a VLM finetuned on another one (MSCOCO). We argue that such comparisons are insufficient to assess the OOD generalization capability of models due to high visual and linguistic similarity between the evaluation and finetuning datasets. WikiDO offers a strong cross-modal retrieval benchmark for current VLMs, especially for evaluating OOD generalization. WikiDO Paper

Getting Started

The data is split into training, dev, and test sets. Download the dataset here (distributed under the CC BY-NC 4.0 license):

WikiDO dataset

Details of baseline models and evaluation script can be found on the following GitHub site: WikiDO Github Page
We will update the models and results on the leaderboard based on the publicly available papers. Feel free to contact Pavan Kalyan if you want to submit your results.

How we construct WikiDO?

WikiDO consists of image-text data derived from Wikipedia Diversity Observatory, a diverse source of Wikipedia articles spanning several diversity axes including geography, gender, ethnicity and domains/topics. We focus on the domains axis that is most diverse in terms of coverage and spans different topics (as determined via topic labels assigned to each article) such as food, books, fashion and sports.

Have Questions or Want to Contribute ?

Feel free to contact Pavan Kalyan and Piyush Pasi. We would greatly appreciate it if you could provide us your helpful suggestions for this project.

Leadboard

Here we will rank different VLMs based on their performance on OOD split of WikiDO.

Image-Text retrieval

We use Recall as a metric for evaluation. Mean of (R@1, R@5, R@10) is presented here. I2T expands to image-to-text retrieval while T2I expands to text-to-image retrieval.

Rank Model ID I2T ID T2I OOD I2T OOD T2I

1

June 1, 2023
CLIP (ViT-L 336)428 M (Radford, Alec et al. 2021) 91.7 90.9 84.3 84.4

2

June 1, 2023
BLIP 2 (ViT-L)473 M (Li, Junnan et al. 2023) 90.9 91.2 82.8 83.7

3

June 1, 2023
BLIP 2 (ViT-G)1172 M (Li, Junnan et al. 2023) 89.6 89.8 81.0 82.4

4

June 1, 2023
BLIP (ViT-L)446 M (Li, Junnan et al. 2022) 86.0 86.1 76.9 78.2

5

June 1, 2023
BLIP (ViT-B)223 M (Li, Junnan et al. 2022) 86.2 85.6 75.4 76.0