Including a Multimodal Japanese Dataset for Grounding Language and Vision in Robotics
Jesse Atuhurra1,2, Hidetaka Kamigaito1, Taro Watanabe1, Koichiro Yoshino1,2
1 NAIST
2 RIKEN Guardian Robot Project
Code | Dataset on HuggingFace |
Robot perception is susceptible to object occlusions, constant object movements, and ambiguities within the scene that make perceiving those objects more challenging. Yet, such perception is crucial to make the robot more useful in accomplishing tasks, especially when the robot interacts with humans. We introduce an object-attribute annotations framework to describe objects, from the robotβs ego-centric view, and keep track of changes in object attributes. Then, we leverage vision language models (VLMs) to interpret objects and their attributes. After that, the VLMs facilitate the robot to accomplish tasks such as object identification, reference resolution, and next-action prediction.
Robot perception in the real world faces challenges of (1) dynamically changing scenes in which the people and objects undergo constant changes, and (2) the co-existence of similar objects in a scene. Moreover, modern robots must understand complex commands involving objects, references, and actions. Lastly, robots need to comprehend and make sense of commands and dialogues in numerous languages. Yet, most datasets in robot vision-language learning are limited to English and rely on short prompts or synthetic setups. J-ORA addresses this gap by introducing a comprehensive multimodal dataset grounded in Japanese, containing real-world images and human-annotated linguistic references, actions, and object descriptions.
In Japanese human-robot interaction (HRI), understanding ambiguous object references and predicting suitable actions are essential. Japanese presents unique linguistic challenges like elliptical expressions and non-linear syntax. Yet few datasets exist for grounded multimodal reasoning in Japanese, particularly in the context of robotic perception and manipulation.
J-ORA is designed to support fine-grained multimodal understanding and language grounding for tasks critical to domestic service robots, including object recognition, action prediction, and spatial reference disambiguation.
J-ORA contains 142 real-world image-dialogue pairs annotated with rich multimodal information:
Each image is paired with:
Each object is described using these features: category, color, shape, size, material, surface texture, position, state, functionality, brand, interactivity, and proximity to the person.
Feature | Value |
---|---|
Hours | 3 hrs 3 min 44 sec |
Unique dialogues | 93 |
Total dialogues | 142 |
Utterances | 2,131 |
Sentences | 2,652 |
Average turns per dialogue | 15 |
Average duration per turn | 77 sec |
Total turns | 2,131 |
Image-Dialogue pairs | 142 |
Unique object classes | 160 |
Object attribute annotations | 1,817 |
Languages | Japanese |
Given a Japanese utterance and an image, identify all objects mentioned in the utterance.
Given a Japanese utterance and an image, and the object mentions from the OI task above, describe where in the image the mentioned objects occur.
From the objects identified in the dialogue utterance in the OI task, and the locations in the image described in the RR task, predict the next most probable high-level action (e.g., pick, move, discard) implied by the instruction.
The three tasks are framed as an end-to-end multimodal perception problem and are performed by VLMs. Performance is evaluated with standard accuracy as the major metric.
We benchmark these leading Vision-Language Models (VLMs) on J-ORA, including:
We compare zero-shot and fine-tuned settings for VLMs with or without object attributes.
J-ORA supports research in:
Dataset: Full annotations and image data.
The data introduced in this project extends the J-CRe3 data.
If you use J-ORA in your research, please cite these two resources:
IROS 2025 coming soon...
And J-CRe3 below
@inproceedings{ueda-2024-j-cre3,
title = {J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution},
author = {Nobuhiro Ueda and Hideko Habe and Yoko Matsui and Akishige Yuguchi and Seiya Kawano and Yasutomo Kawanishi and Sadao Kurohashi and Koichiro Yoshino},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = may,
year = {2024},
url = {https://aclanthology.org/2024.lrec-main.829},
pages = {9489--9502},
address = {Turin, Italy},
}
For questions and collaboration inquiries:
atuhurra.jesse.ag2@naist.ac.jp
koichiro.yoshino@riken.jp
The dataset and code are released under: CC BY-SA 4.0, i.e., Attribution-ShareAlike International License