The action decoder translates the model's internal representation into robot commands. Four main approaches exist, each trading off expressiveness, inference speed, and simplicity.

Diffusion policy head

Iterative denoising → continuous action

most expressiveslower inference

Starts from random noise and iteratively refines it into a plausible action. Naturally handles situations where multiple valid actions exist — for example, moving left or right around an obstacle. Each step of denoising costs compute, making this the slowest decoder at inference time.

Expressiveness

Inference speed

Simplicity

Used by: π0, Octo, HybridVLA, CogACT, RDT-1B

Autoregressive transformer

Token-by-token → discretized action sequence

sequentially expressivefamiliar architecture

Actions are discretized into tokens and predicted one at a time — exactly like next-word prediction in an LLM. Natural extension of the language model backbone. The tradeoff: discretization loses precision, and continuous motion is hard to capture with a fixed vocabulary of action tokens.

Expressiveness

Inference speed

Simplicity

Used by: ACT, Gato, OpenDrive VLA, GRAPE, ECoT

MLP and token predictor

Single forward pass → direct action values

fastestsimplest

A feedforward network maps the model's output representation directly to continuous joint values or end-effector deltas. No iteration, no decoding loop — one forward pass gives you the action. The cost: a single deterministic output struggles when a task is genuinely ambiguous. Works well with enough data and constrained tasks.

Expressiveness

Inference speed

Simplicity

Used by: OpenVLA, RoboAgent, HiRT, ARM4R

Planner and MPC head

High-level waypoints → classical controller executes

most modularsafety-friendliest

The model outputs a plan — a sequence of subgoals or waypoints — and a separate Model Predictive Control (MPC) loop executes it. This lets you plug in decades of classical robotics work on trajectory optimization and safety constraints. The tradeoff: the planner and executor can be mismatched, and you reintroduce the hand-engineered pipeline that end-to-end learning was meant to replace.

Expressiveness

Inference speed

Simplicity

Used by: VoxPoser, FLaRe, Shake-VLA, LMM Planner Integration

Action decoder types in vision-language-action models, comparing expressiveness, speed, and simplicity