EgoM2P: Egocentric Multimodal Multitask Pretraining

EgoM2P: A large-scale egocentric multimodal and multitask model, pretrained on eight extensive egocentric datasets. It incorporates four modalities—RGB and depth video, gaze dynamics, and camera trajectories—to handle challenging tasks like monocular egocentric depth estimation, camera tracking, gaze estimation, and conditional egocentric video synthesis. For simplicity, we only visualize four frames here.

Abstract

Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction, enabling systems to better interpret the camera wearer’s actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video, and also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully open-source EgoM2P to support the community and advance egocentric vision research.

Method

Pretraining

method overview

(1) We train VQ-VAE tokenizers for camera trajectories and gaze dynamics, and adopt Cosmos tokenizers to tokenize RGB and depth streams. High-dimensional input modalities, including videos, gaze dynamics, and camera trajectories, are compressed into discrete tokens to serve as our training database. (2) Our EgoM2P follows the architecture of T5-Base. We perform multimodal masked pretraining, where we randomly sample a fixed number of input and target tokens from our token database without overlap. For simplicity, we only visualize four frames here.

Multitasking

We benchmark EgoM2P's multitasking abilities with SOTA specialist models in downstream tasks, including egocentric perception and synthesis. We also benchmark it on unseen datasets without any fine-tuning to show the strong generalization ability of the pretrained feature.

Egocentric Camera Tracking

EgoExo4D

Compared to specialist SOTA models that require geometry-based test-time optimization, our feed-forward EgoM2P predicts camera trajectories with better translation and rotation accuracy, achieving an inference speed of 300+ FPS

EgoM2P can also predict smooth and plausible camera trajectories in the sense that it learns to capture the uniqueness of egocentric head motion, while baselines suffer from temporal jittering

ADT (unseen)

EgoM2P effectively predicts realistic egocentric camera trajectories, even on unseen dataset ADT

Gaze Estimation in Egocentric Videos

EgoExo4D

EgoM2P can better understand human intentions

Egocentric Monocular Depth Estimation

H2O

Baselines requires sequence-level optimization which is time-consuming. Our model achieves at least 30 times faster inference speed, while ensuring temporal consistency

HOI4D (unseen)

EgoM2P has strong generalization abilities to unseen datasets

Conditional Egocentric Video Synthesis

HoloAssist

EgoM2P demonstrates superior performance to generate depth-aligned egocentric RGB videos, minimizing hallucinations and produces realistic finger motions

Egocentric 4D Reconstruction

Given ground-truth camera intrinsics and an egocentric video, we compare EgoM2P with the SOTA baseline MegaSAM (CVPR 2025) for 4D reconstruction. Unlike MegaSAM, which relies on SOTA monocular depth estimators and expensive geometry optimization, EgoM2P efficiently reconstructs dynamic egocentric scenes. For a 2-second video at 8 FPS, EgoM2P completes the reconstruction in less than 1 second, whereas MegaSAM requires 71 seconds.

MegaSam (71 s)

Ours (<1 s)

Video

Coming Soon...

Citation

EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang

@article{li2025egom2p,
    title={EgoM2P: Egocentric Multimodal Multitask Pretraining},
    author={Li, Gen and Chen, Yutong and Wu, Yiqian and Zhao, Kaifeng and Pollefeys, Marc and Tang, Siyu},
    journal={arXiv preprint arXiv:2506.07886},
    year={2025}
  }

Contact

For questions, please contact Gen Li:
gen.li@inf.ethz.ch

EgoM2P

Egocentric Multimodal Multitask Pretraining

Gen Li¹ Yutong Chen^1 Yiqian Wu^{1, 2} Kaifeng Zhao^*1 Marc Pollefeys^{1, 3} Siyu Tang¹

* Indicates equal contribution, listed in alphabetical order.

¹ETH Zürich ²Zhejiang University ³Microsoft

The International Conference on Computer Vision (ICCV) 2025

Arxiv Code

Abstract

Method

Pretraining

Multitasking

Egocentric Camera Tracking

EgoExo4D

ADT (unseen)

Gaze Estimation in Egocentric Videos

EgoExo4D

Egocentric Monocular Depth Estimation

H2O

HOI4D (unseen)

Conditional Egocentric Video Synthesis

HoloAssist

Egocentric 4D Reconstruction

MegaSam (71 s)

Ours (<1 s)

Video

Citation

Contact