We propose an embodied 4D world model for robotic manipulation that takes single-view RGBD input and generates multi-view-consistent 3D/4D scene representations, enabling an "imagine-then-act" decision paradigm. The work covers a complete loop from sensor capture, multi-view reconstruction, dataset construction, and model training, to real-robot validation, demonstrating the feasibility of generative world models for action reasoning on physical robots.
@inproceedings{he2026vista,title={MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation},author={Wang, Jiaxu and Jiang, Yicheng and He, Tianlun and Sun, Jingkai and Zhang, Qiang and He, Junhao and Cao, Jiahang and Gan, Zesen and Sun, Mingyuan and Shao, Qiming and Yue, Xiangyu},booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},year={2026},month=apr,}