Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

Nov 1, 2025·

Chongyang Xu

Haipeng Li

Cheng Shen

Haoqiang Fan

Ziliang Feng

Shuaicheng Liu

· 0 min read

Abstract

By building upon pre-trained 3D geometric foundation models like VGGT or pi3, this work unifies geometry-aware latents with semantic features to jointly predict future action sequences and 3D scene evolutions, enabling robust, coordination-aware bimanual manipulation directly from RGB observations.

Type

Preprint

Publication

Submitted to CVPR 2026

Last updated on Nov 1, 2025

VLA 3D Foundation Models Bimanual Manipulation

Authors

Chongyang Xu

🎓 Ph.D. Student @ Sichuan University
🔭 Embodied AI Intern @ Tongyi Lab

Hello! 👋

I’m Chongyang, a researcher who’s into physical AI & robotics, equally passionate about sports, music, humanities, and sociology. I’m doing multimodal learning and reinforcement learning in the grandest simulator of all: life — one episode at a time, learning what’s worth the strife.

I’m openly seeking collaborations — if you have any research ideas or projects, feel free to reach out!

Education 🎓

I’ve been studying at Sichuan University for 7 years and have fallen deeply in love with Chengdu. I received my B.Eng. in Software Engineering and am now pursuing my Ph.D. in Computer Science.

Reconstructing Multi-Person Interactions via Semantic-Geometric Graph Optimization Nov 1, 2025 →