Action decoder types in vision-language-action models, comparing expressiveness, speed, and simplicity

The action decoder translates the model's internal representation into robot commands. Four main approaches exist, each trading off expressiveness, inference speed, and simplicity.

Diffusion policy head

Iterative denoising → continuous action

most expressiveslower inference

Starts from random noise and iteratively refines it into a plausible action. Naturally handles situations where multiple valid actions exist — for example, moving left or right around an obstacle. Each step of denoising costs compute, making this the slowest decoder at inference time.

Expressiveness
Inference speed
Simplicity

Used by: π0, Octo, HybridVLA, CogACT, RDT-1B

Autoregressive transformer

Token-by-token → discretized action sequence

sequentially expressivefamiliar architecture

Actions are discretized into tokens and predicted one at a time — exactly like next-word prediction in an LLM. Natural extension of the language model backbone. The tradeoff: discretization loses precision, and continuous motion is hard to capture with a fixed vocabulary of action tokens.

Expressiveness
Inference speed
Simplicity

Used by: ACT, Gato, OpenDrive VLA, GRAPE, ECoT

MLP and token predictor

Single forward pass → direct action values

fastestsimplest

A feedforward network maps the model's output representation directly to continuous joint values or end-effector deltas. No iteration, no decoding loop — one forward pass gives you the action. The cost: a single deterministic output struggles when a task is genuinely ambiguous. Works well with enough data and constrained tasks.

Expressiveness
Inference speed
Simplicity

Used by: OpenVLA, RoboAgent, HiRT, ARM4R

Planner and MPC head

High-level waypoints → classical controller executes

most modularsafety-friendliest

The model outputs a plan — a sequence of subgoals or waypoints — and a separate Model Predictive Control (MPC) loop executes it. This lets you plug in decades of classical robotics work on trajectory optimization and safety constraints. The tradeoff: the planner and executor can be mismatched, and you reintroduce the hand-engineered pipeline that end-to-end learning was meant to replace.

Expressiveness
Inference speed
Simplicity

Used by: VoxPoser, FLaRe, Shake-VLA, LMM Planner Integration

Bar ratings are qualitative and relative across these four approaches.