OpenPerception

J-ORA: A Robot Perception Framework for Japanese Object Identification, Reference Resolution, and Next Action Prediction

Including a Multimodal Japanese Dataset for Grounding Language and Vision in Robotics

Jesse Atuhurra1,2, Hidetaka Kamigaito1, Taro Watanabe1, Koichiro Yoshino1,2
1 NAIST
2 RIKEN Guardian Robot Project

πŸŽ‰ Accepted to IROS 2025 πŸ€–!

Code Dataset on HuggingFace
🌟 Overview

Robot perception is susceptible to object occlusions, constant object movements, and ambiguities within the scene that make perceiving those objects more challenging. Yet, such perception is crucial to make the robot more useful in accomplishing tasks, especially when the robot interacts with humans. We introduce an object-attribute annotations framework to describe objects, from the robot’s ego-centric view, and keep track of changes in object attributes. Then, we leverage vision language models (VLMs) to interpret objects and their attributes. After that, the VLMs facilitate the robot to accomplish tasks such as object identification, reference resolution, and next-action prediction.

🌟 Highlight: Our framework is effective in representing dynamic *object changes* in non-English languages, e.g., Japanese.
🧩 The Problem

Robot perception in the real world faces challenges of (1) dynamically changing scenes in which the people and objects undergo constant changes, and (2) the co-existence of similar objects in a scene. Moreover, modern robots must understand complex commands involving objects, references, and actions. Lastly, robots need to comprehend and make sense of commands and dialogues in numerous languages. Yet, most datasets in robot vision-language learning are limited to English and rely on short prompts or synthetic setups. J-ORA addresses this gap by introducing a comprehensive multimodal dataset grounded in Japanese, containing real-world images and human-annotated linguistic references, actions, and object descriptions.

🧩 Highlight: We aim to train robots to excellently perceive and understand the scene despite the dynamic changes in the scene.

Scenes Figure

πŸš€ Motivation and J-ORA

In Japanese human-robot interaction (HRI), understanding ambiguous object references and predicting suitable actions are essential. Japanese presents unique linguistic challenges like elliptical expressions and non-linear syntax. Yet few datasets exist for grounded multimodal reasoning in Japanese, particularly in the context of robotic perception and manipulation.

J-ORA is designed to support fine-grained multimodal understanding and language grounding for tasks critical to domestic service robots, including object recognition, action prediction, and spatial reference disambiguation.

Intro Figure

πŸ“¦ Dataset Summary

J-ORA contains 142 real-world image-dialogue pairs annotated with rich multimodal information:

Each image is paired with:

Each object is described using these features: category, color, shape, size, material, surface texture, position, state, functionality, brand, interactivity, and proximity to the person.

Data Collection Pipeline Figure

πŸ“Š Quantitative Summary of the J-ORA dataset

Feature Value
Hours 3 hrs 3 min 44 sec
Unique dialogues 93
Total dialogues 142
Utterances 2,131
Sentences 2,652
Average turns per dialogue 15
Average duration per turn 77 sec
Total turns 2,131
Image-Dialogue pairs 142
Unique object classes 160
Object attribute annotations 1,817
Languages Japanese
🧠 Tasks in Detail

🟑 Object Identification (OI)

Given a Japanese utterance and an image, identify all objects mentioned in the utterance.

πŸ”΅ Reference Resolution (RR)

Given a Japanese utterance and an image, and the object mentions from the OI task above, describe where in the image the mentioned objects occur.

πŸ”΄ Next Action Prediction (AP)

From the objects identified in the dialogue utterance in the OI task, and the locations in the image described in the RR task, predict the next most probable high-level action (e.g., pick, move, discard) implied by the instruction.

The three tasks are framed as an end-to-end multimodal perception problem and are performed by VLMs. Performance is evaluated with standard accuracy as the major metric.

πŸ§ͺ Evaluation and Baselines

We benchmark these leading Vision-Language Models (VLMs) on J-ORA, including:

We compare zero-shot and fine-tuned settings for VLMs with or without object attributes.

Data Collection Pipeline Figure Data Collection Pipeline Figure Data Collection Pipeline Figure Data Collection Pipeline Figure
πŸ” Key Findings
Use Cases, Resources, Citation, Contact & License

πŸ” Use Cases

J-ORA supports research in:

πŸ›  Resources

πŸ“„ Citation

If you use J-ORA in your research, please cite these two resources:

IROS 2025 coming soon...

And J-CRe3 below

@inproceedings{ueda-2024-j-cre3,
  title     = {J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution},
  author    = {Nobuhiro Ueda and Hideko Habe and Yoko Matsui and Akishige Yuguchi and Seiya Kawano and Yasutomo Kawanishi and Sadao Kurohashi and Koichiro Yoshino},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  month     = may,
  year      = {2024},
  url       = {https://aclanthology.org/2024.lrec-main.829},
  pages     = {9489--9502},
  address   = {Turin, Italy},
}

πŸ“¬ Contact

For questions and collaboration inquiries:

πŸ“œ License

The dataset and code are released under: CC BY-SA 4.0, i.e., Attribution-ShareAlike International License