PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Towards Online Multimodal Social Interaction Understanding

Xinpeng Li¹, Shijian Deng¹, Bolin Lai², Weiguo Pian¹, James Matthew Rehg³, Yapeng Tian¹

¹University of Texas at Dallas, ²Georgia Institute of Technology, ³University of Illinois Urbana-Champaign
Transactions on Machine Learning Research (TMLR) 2026

Paper Supplementary Presentation Codebase Dataset (Coming Soon)

We introduce Online-MMSI where the model must perform MMSI using historical information. To address the challenge, we propose Online-MMSI-VLM, a VLM-based framework that integrates multi-party conversation forecasting and socially-aware visual prompting.

Abstract

In this paper, we introduce a new problem, Online-MMSI, where the model must perform multimodal social interaction understanding (MMSI) using only historical information. Given a recorded video and a multi-party dialogue, the AI assistant is required to immediately identify the speaker’s referent, which is critical for real-world human-AI interaction. Without access to future conversational context, both humans and models experience substantial performance degradation when moving from offline to online settings. To tackle the challenges, we propose Online-MMSI-VLM, a novel framework based on multimodal large language models. The core innovations of our approach lie in two components: (1) multi-party conversation forecasting, which predicts upcoming speaker turns and utterances in a coarse-to-fine manner; and (2) socially-aware visual prompting, which highlights salient social cues in each video frame using bounding boxes and body keypoints. Our model achieves state-of-the-art results on three tasks across two datasets, significantly outperforming the baseline and demonstrating the effectiveness of Online-MMSI-VLM.

BibTeX

@article{li2025towards,
  title={Towards online multi-modal social interaction understanding},
  author={Li, Xinpeng and Deng, Shijian and Lai, Bolin and Pian, Weiguo and Rehg, James M and Tian, Yapeng},
  journal={Transactions on Machine Learning Research (TMLR)},
  year={2026},
}

Towards Online Multimodal Social Interaction Understanding

We introduce Online-MMSI where the model must perform MMSI using historical information. To address the challenge, we propose Online-MMSI-VLM, a VLM-based framework that integrates multi-party conversation forecasting and socially-aware visual prompting.

Abstract

First image description.

Second image description.

Third image description.

Fourth image description.

Video Presentation

Another Carousel

Poster

BibTeX

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3

Towards Online Multimodal Social Interaction Understanding

We introduce Online-MMSI where the model must perform MMSI using historical information. To address the challenge, we propose Online-MMSI-VLM, a VLM-based framework that integrates multi-party conversation forecasting and socially-aware visual prompting.

Abstract

First image description.

Second image description.

Third image description.

Fourth image description.

Video Presentation

Another Carousel

Poster

BibTeX