Skip to the content.

VLURes: Understanding Vision and Language Across Cultures

A New Benchmark for Smarter, More Equitable AI

Jesse Atuhurra1, Iqra Ali2, Tomoya Iwakura3, Hidetaka Kamigaito1, and Tatsuya Hiraoka4
1 NAIST 2 QMUL 3 Fujitsu Ltd 4 MBZUAI

Code | Data

🌍 Motivation: A Multilingual, Multimodal World Needs Multilingual, Multimodal AI

Despite recent advances in Vision-Language Models (VLMs), most benchmarks evaluate models in English, with limited regard for non-English languages or rich, real-world contexts. This monolingual bias severely limits how we assess AI’s true generalization capabilities, especially for low-resource languages.

VLURes is designed to change that. It rigorously evaluates visual and linguistic understanding across English, Japanese, Swahili, and Urdu, using diverse tasks, rich prose, and grounded cultural contexts.

VLURes Task Overview Figure 1: VLURes Task Overview

We envision a world comprising generalist intelligent agents, such as robots, that accomplish several Vision-Language tasks.
🌍 What We Built: The VLURes Benchmark

VLURes is more than just a dataset; it’s a comprehensive testbed for the next generation of intelligent agents.

🧠 What Is VLURes?

VLURes is a multilingual vision-language benchmark aimed at testing intelligent agents under realistic conditions. Each input contains an image and an article-level text (not just captions), and the benchmark tests a model’s ability to perform both image-only and image+text reasoning.

VLURes covers 8 tasks:

🏗️ Dataset Construction

We collected articles and images from multiple web sources, including Wikipedia, Wikinews, blogs, and forums. The collection covers diverse topics such as animals, locations, food, buildings, and events.

We used CLIP similarity scores to align the most relevant image to each article. All data was cleaned manually, filtered for quality, and checked for NSFW or offensive content.

🎯 New Task: The "Unrelatedness" Challenge

The proposed Unrelatedness task. Left: The VLM inputs consist of two modalities, a pair of images and texts. The image undergoes a series of transformations in the vision encoder and connector, generating visual tokens that are ready for alignment in a shared embedding space. Similarly, a tokenizer tokenizes text, generating textual tokens. Textual and visual tokens are aligned in a shared embedding space and fed as input to the LLM. Right. The LLM uses its multimodal understanding to decide what textual information is relevant to different parts of the image. We see that the text painted green (marked with a cross sign) is directly related to the region of the image shown inside a green square box. That is, the text matches the image part shown in green. But in this task, we are interested in text unrelated to the image. Hence, yellow text (marked with a check sign) answers our Unrelatedness task.

VLURes Task Overview Figure 2: Our proposed Unrelatedness Task

Unlike traditional matching tasks, Unrelatedness tests whether a model can identify irrelevant information. This is vital in noisy, multimodal environments like news feeds or social media.

Can the model ignore text that does not describe or relate to the image?
This is the inverse of standard grounding tasks and pushes models to reason beyond associations.

📊 Summary of the Benchmark Pipeline
  1. Task Definition: 8 vision-language tasks
  2. Data Collection: From native-language web sources
  3. Alignment: Image selection via CLIP similarity
  4. Evaluation: Via human and automatic judges
  5. Results: Quantitative accuracy + qualitative rationale analysis
🔬 Evaluation Protocols

Models were tested under:

We used both:

VLURes Task Performance

🧪 Experiment Results: Key Findings
📉 Challenges Highlighted
🔓 Open Access

We believe in open science. All assets are publicly available:

🧑‍💻 Authors, BibTeX, Usage and License Notices

🧑‍💻 Authors

For questions about this research, please get in touch with the corresponding authors:

📚 BibTeX

Under review at Journal of Machine Learning Research

Usage and License Notices

The code, annotations, and other original materials in this repository are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).