GTA-VLA

Guide, Think, Act: Interactive Spatially Steerable Vision-Language-Action

Yiran Ling^2,3,7,*,‡, Qing Lian^1,*, Jinghang Li^1,4, Qing Jiang^3,5, Tianming Zhang⁶, Xiaoke Jiang⁶, Chuanxiu Liu⁶, Jie Liu^2,7,†, Lei Zhang^1,3,6,†

* Equal contribution

‡ This work was done during an internship at Futian Laboratory.

† Corresponding author: leizhang@idea.edu.cn, jieliu@hit.edu.cn.

¹Futian Laboratory
²Faculty of Computing, Harbin Institute of Technology
³International Digital Economy Academy (IDEA)
⁴School of Robotics, Hunan University
⁵South China University of Technology
⁶Visincept
⁷National Key Laboratory of Smart Farm Technologies and Systems

Paper Code Playground

Conventional direct VLA policies can fail under spatial ambiguity or imprecise grounding, since they lack an explicit mechanism for interactive correction. GTA-VLA resolves this by using one-shot spatial guidance (affordance points, boxes, or traces) to correct grounding and enable accurate execution.

Abstract

In this paper, we propose GTA-VLA (Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control.

Overview of GTA-VLA (Guide, Think, Act). The framework consists of three stages. Guide: the model receives the primary image, the language instruction, and optional spatial priors (e.g., affordance points, boxes, or traces). Think: the VLM backbone generates a conditioned spatial-visual reasoning sequence and the corresponding latent reasoning states H_reasoning. Act: a downstream Flow-Matching action head consumes the latest reasoning states together with high-frequency control observations to produce continuous action chunks. This design decouples slow autoregressive reasoning from fast closed-loop control.

Interact-306K and automatic instruction annotation. Left: Dataset composition: 306K episodes collected from six manipulation sources. Right: Automatic annotation pipeline: keyframe extraction and task decomposition from trajectories, followed by open-vocabulary grounding and tracking to produce structured subtask instructions with temporally consistent object annotations.

Qualitative examples and success rates across four real-world picking tasks (seen/unseen targets and single/multiple candidate objects). Explicit reasoning improves over the baseline, and point guidance provides the largest gains in unseen and reference-ambiguous settings.

Main Results

Method	Spatial	Object	Goal	Long	LIBERO Avg	Spoon	Carrot	Cube	Eggplant	SimplerEnv Avg
OpenVLA	84.7	88.4	79.2	53.7	76.5	4.2	0.0	8.3	45.8	14.6
OpenVLA-OFT	96.2	98.3	96.2	90.7	95.3	-	-	-	-	-
π0	96.8	98.8	95.8	85.2	94.1	50.0	41.7	29.2	70.8	47.9
GR00T-N1	94.4	97.6	93.0	90.6	93.9	64.5	65.5	5.5	93.0	57.1
π0.5	98.8	98.2	98.0	92.4	96.9	-	-	-	-	-
X-VLA	98.2	98.6	97.8	97.6	98.1	95.8	75.0	62.5	70.8	76.0
ThinkAct	88.3	91.4	87.1	70.9	84.4	37.5	8.7	58.3	70.8	43.8
CoT-VLA	87.5	91.6	87.6	69.0	81.1	-	-	-	-	-
Uni-VLA	97.0	99.0	92.6	90.8	94.8	83.3	66.7	33.3	95.8	69.8
MolmoAct	87.0	95.4	87.6	77.2	86.6	-	-	-	-	-
GTA-VLA	99.0	98.8	98.4	97.6	98.6	95.8	87.5	66.7	75.0	81.2

OOD Generalization (SimplerEnv-Plus)

Method	Sensor	Lighting	State	Diversity	Unseen Obj.	Distractor	Avg
OpenVLA	5.2	6.3	0.0	8.3	2.1	0.0	3.7
π0.5	9.4	10.4	9.4	8.3	6.3	0.0	7.3
X-VLA	27.1	68.8	68.7	66.3	36.2	46.9	52.3
GTA-VLA	39.6	76.1	79.2	68.1	58.3	50.0	61.4

Guidance Efficacy

Guidance Modality	Unseen Toy	Unseen Fruit	Unseen Tool	Unseen Avg	Color Distractor	Pos. Distractor	Distractor Avg
Dense Instruction (π0.5)	8.3	12.5	8.3	9.7	8.3	8.3	8.3
Dense Instruction (GTA-VLA)	12.5	41.6	29.2	27.8	45.8	29.2	37.5
+ Visual Point Guide	33.3	47.9	41.6	40.9	58.3	50.0	54.2
+ Visual Box Guide	54.1	70.8	45.8	56.9	41.6	45.8	43.7

Videos

Simulation

Real World Deployment

Training on clean environment data and achieving performance on real devices under noisy interference environments.

BibTeX

@misc{ling2026guide,
  title={Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models},
  author={Ling, Yiran and Lian, Qing and Li, Jinghang and Jiang, Qing and Zhang, Tianming and Jiang, Xiaoke and Liu, Chuanxiu and Liu, Jie and Zhang, Lei},
  year={2026},
  eprint={2605.13632},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2605.13632}
}