The action decoder translates the model's internal representation into robot commands. Four main approaches exist, each trading off expressiveness, inference speed, and simplicity.
Diffusion policy head
Iterative denoising → continuous action
most expressiveslower inferenceStarts from random noise and iteratively refines it into a plausible action. Naturally handles situations where multiple valid actions exist — for example, moving left or right around an obstacle. Each step of denoising costs compute, making this the slowest decoder at inference time.
Used by: π0, Octo, HybridVLA, CogACT, RDT-1B
Autoregressive transformer
Token-by-token → discretized action sequence
sequentially expressivefamiliar architectureActions are discretized into tokens and predicted one at a time — exactly like next-word prediction in an LLM. Natural extension of the language model backbone. The tradeoff: discretization loses precision, and continuous motion is hard to capture with a fixed vocabulary of action tokens.
Used by: ACT, Gato, OpenDrive VLA, GRAPE, ECoT
MLP and token predictor
Single forward pass → direct action values
fastestsimplestA feedforward network maps the model's output representation directly to continuous joint values or end-effector deltas. No iteration, no decoding loop — one forward pass gives you the action. The cost: a single deterministic output struggles when a task is genuinely ambiguous. Works well with enough data and constrained tasks.
Used by: OpenVLA, RoboAgent, HiRT, ARM4R
Planner and MPC head
High-level waypoints → classical controller executes
most modularsafety-friendliestThe model outputs a plan — a sequence of subgoals or waypoints — and a separate Model Predictive Control (MPC) loop executes it. This lets you plug in decades of classical robotics work on trajectory optimization and safety constraints. The tradeoff: the planner and executor can be mismatched, and you reintroduce the hand-engineered pipeline that end-to-end learning was meant to replace.
Used by: VoxPoser, FLaRe, Shake-VLA, LMM Planner Integration
Bar ratings are qualitative and relative across these four approaches.