PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Omni-MMSI:
Towards Identity-attributed Social Interaction Understanding

Xinpeng Li¹, Bolin Lai², Hardy Chen³, Shijian Deng¹, Cihang Xie³, Yuyin Zhou³, James Matthew Rehg⁴, Yapeng Tian¹

¹University of Texas at Dallas, ²Georgia Institute of Technology, ³University of California Santa Cruz, ⁴University of Illinois Urbana-Champaign
CVPR 2026

Paper Presentation Code (Coming Soon) Dataset (Coming Soon)

The Omni-MMSI explores social interaction understanding only using raw audio and video input. To address the task, we propose Omni-MMSI-R, a referenced-guided pipeline that generates identity-attributed cues with tools and performs CoT social reasoning.

Abstract

We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution ability, which leads to inaccurate social interaction understanding. To address this task, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To train this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI.

BibTeX

@article{li2026omni,
  title={Omni-MMSI: Towards Identity-attributed Social Interaction Understanding},
  author={Li, Xinpeng and Lai, Bolin and Chen, Hardy and Deng, Shijian and Xie, Cihang and Zhou, Yuyin and Rehg, James M and Tian, Yapeng},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
}

Omni-MMSI:
Towards Identity-attributed Social Interaction Understanding

The Omni-MMSI explores social interaction understanding only using raw audio and video input. To address the task, we propose Omni-MMSI-R, a referenced-guided pipeline that generates identity-attributed cues with tools and performs CoT social reasoning.

Abstract

First image description.

Second image description.

Third image description.

Fourth image description.

Video Presentation

Another Carousel

Poster

BibTeX

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3

Omni-MMSI: Towards Identity-attributed Social Interaction Understanding

The Omni-MMSI explores social interaction understanding only using raw audio and video input. To address the task, we propose Omni-MMSI-R, a referenced-guided pipeline that generates identity-attributed cues with tools and performs CoT social reasoning.

Abstract

First image description.

Second image description.

Third image description.

Fourth image description.

Video Presentation

Another Carousel

Poster

BibTeX

Omni-MMSI:
Towards Identity-attributed Social Interaction Understanding