Prompting with the Future: Open-World
Model Predictive Control with Interactive Digital Twins

RSS 2025

Cornell University
Equal advising
arXiv Code

We propose Prompting with the Future, a model predictive control framework for open-world manipulation. We build interactive digital twins as dynamic models to provide outcomes of cadidate actions. The VLM acts as a cost function to evaluate the results of candidate actions to guide the planning.

Abstract

Open-world robotic manipulation requires robots to perform novel tasks described by free-form language in unstructured settings. While vision-language models (VLMs) offer strong high-level semantic reasoning, they lack the fine-grained physical insight needed for precise low-level control. To address this gap, we introduce Prompting-with-the-Future, a model-predictive control framework that augments VLM-based policies with explicit physics modeling. Our framework builds an interactive digital twin of the workspace from a quick handheld video scan, enabling prediction of future states under candidate action sequences. Instead of asking the VLM to predict actions or results by reasoning dynamics, the framework simulates diverse possible outcomes, renders them as visual prompts with adaptively selected camera viewpoints that expose the most informative physical context. A sampling-based planner then selects the action sequence that the VLM rates as best aligned with the task objective. We validate Prompting-with-the-Future on eight real-world manipulation tasks involving contact-rich interaction, object reorientation, and tool use, demonstrating significantly higher success rates than state-of-the-art VLM-based control methods. Through ablation studies, we further analyze the performance and demonstrate that explicitly modeling physics, while still leveraging VLM semantic strengths, is essential for robust manipulation.

Building interactive digital twins

Starting from a video scan of the environment, we construct an interactive digital twin that combines mesh-based simulation and Gaussian-based rendering. We segment the movable objects in both representations, enabling physically grounded simulation and photo-realistic rendering.

Sampling-based motion planning

With the interactive digital twin, we can simulate outcomes of candidate actions and render the resulting states. The VLM adaptively selects the most informative view for rendering and evaluates the predicted outcomes for sampling-based motion planning.

Interactive digital twins


The interactive digital twins closely resemble the real world, modeling the dynamics and providing photo-realistic rendering.

Open-world manipulation


Prompting with the Future can perform diverse open-world manipulation tasks without any task-specific training or examples.

BibTeX

@inproceedings{ning2025prompting,
  title={Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins},
  author={Ning, Chuanruo and Fang, Kuan and Ma, Wei-Chiu},
  booktitle={RSS},
  year={2025}
}