4D World Model (MVISTA-4D)

Co-author · ICML 2026 (arXiv) — a collaboration across CUHK MMLab, HKUST, HKU, Tsinghua, and X-Humanoid.

MVISTA-4D turns a single RGB-D view into geometry-consistent predictions across four-plus synchronized cameras, supporting an imagine-then-act manipulation paradigm. Trained on 2–3 views, it generalizes to 4–5 via a masked-completion strategy. On the RoboTwin benchmark it reaches FVD 21.93, AbRel 2.60, and 6.51 cm Chamfer Distance, beating UniPi, 4DGen, and TesserAct, and attains the best real-robot success rate among baselines across 14 manipulation tasks.

My contribution (research assistant, 2025.12 – 2026.02): built the physical robot platform, the multi-camera calibration / time-sync / cross-view alignment pipeline, and the 4D manipulation dataset; validated imagine-then-act on real hardware — closing the loop from sensor capture and multi-view reconstruction to model training and real-robot validation.

合作论文 · ICML 2026（arXiv）——由 CUHK MMLab、香港科技大学、香港大学、清华与 X-Humanoid 合作完成。

MVISTA-4D 将单视角 RGB-D 转化为跨四路以上同步相机的几何一致预测，支持 imagine-then-act（先想象后执行）操作范式。模型以 2–3 视角训练，借助掩码补全策略泛化到 4–5 视角。在 RoboTwin 基准上达到 FVD 21.93、AbRel 2.60、Chamfer 距离 6.51 cm，优于 UniPi、4DGen、TesserAct，并在 14 项操作任务上取得基线中最佳的真机成功率。

我的贡献（研究助理，2025.12 – 2026.02）： 搭建实体机器人平台、多相机标定 / 时间同步 / 跨视角对齐流程，以及 4D 操作数据集；在真实硬件上验证 imagine-then-act——打通从传感器采集、多视角重建到模型训练与实机验证的完整闭环。