Unified Robotic Vision with Cross-Modal Sensing and Alignment

URVIS bridging computer vision, robotics, and multimodal learning to build unified perception across RGB, depth, LiDAR, event cameras, audio, and tactile sensing.

Workshop Scope

URVIS investigates how unified perception can emerge from heterogeneous sensing. We explore architectures and training strategies that allow modalities to guide, compensate, or regularize one another under partial or missing signals.

Semantic Consistency
Aligned cross-modal features

RGB, depth, LiDAR, event, audio, tactile cues.

Unified Design
Sensor-aware models

Dynamic weighting by reliability and context.

Scaling Behavior
Quality vs. quantity

Few calibrated sensors vs. many redundant ones.

Cross-Modal Guidance
Fill gaps gracefully

Inference under dropouts, occlusions, or async data.

Core challenge: learning under missing or partial modalities while retaining robustness, interpretability, and efficiency for real-world deployment.

Call for Papers

We invite original research contributions on multisensor and multimodal perception, learning, and reasoning for robotics and embodied AI. The workshop aims to bring together researchers working on the integration, alignment, and exploitation of heterogeneous sensory signals for robust and intelligent robotic systems.

Topics of interest include, but are not limited to:

  • Multisensor and multimodal data fusion (early, late, hybrid, adaptive fusion)
  • Cross-modal alignment, correspondence, and synchronization
  • Multimodal representation learning and embedding spaces
  • Vision-based perception for robotics and autonomous systems
  • RGB-D, RGB-T, event-based, LiDAR, radar, audio, tactile, and proprioceptive sensing
  • Event cameras, neuromorphic sensing, and spiking neural networks
  • Multimodal foundation models and large-scale pretraining
  • Self-supervised, weakly supervised, and unsupervised multimodal learning
  • Multimodal tracking, detection, segmentation, and 3D understanding
  • Sensor calibration, registration, and geometric consistency
  • Uncertainty modeling, probabilistic fusion, and reliability-aware perception
  • Learning under missing, degraded, or noisy modalities
  • Cross-domain and cross-sensor generalization
  • Online, continual, and lifelong multimodal learning
  • Multimodal perception for manipulation, navigation, and human-robot interaction
  • Multimodal SLAM, mapping, and localization
  • Multimodal perception for autonomous driving and mobile robots
  • Efficient, real-time, and resource-aware multimodal systems
  • Simulation-to-real transfer and multimodal domain adaptation
  • Benchmark datasets, evaluation protocols, and reproducible research

We welcome regular paper submissions following the official CVPR Workshops submission pipeline. Accepted papers will be published in the CVPRW Proceedings and must comply with the standard CVPRW formatting guidelines.

All submissions will undergo a double-blind peer-review process to ensure fairness, rigor, and impartiality.

Challenges

Schedule

TimeEvent
08:30Welcome and Opening Remarks
08:40Challenge Report 1
08:50Challenge Report 2
09:00Invited Talk 1
09:30Invited Talk 2
10:00Coffee Break
10:20Invited Talk 3 (TBD)
10:30Oral Presentation 1
10:40Oral Presentation 2
10:50Challenge Report 3
11:00Invited Talk 4
11:30Invited Talk 5
12:00Awards and Closing Remarks

Speakers

Organizers

Sponsors

Special Issue

Based on the paper quality, extended submissions might be invited to an Elsevier Computer Vision and Image Understanding (CVIU) special issue.