VLA internals · lecture material

Action decoders —
what does each one output?

The four decoder families from the paper's taxonomy. All four receive the same input: a fused vision-language embedding representing what the robot sees and what it has been asked to do. What they do with it differs entirely.

01 · direct prediction
MLP and token predictor
OpenVLA, TraceVLA, RoboMamba — efficient low-level control
The idea
The vision-language embedding passes through one or a few linear layers and directly produces an action vector. One observation in, one action out. No iteration, no sampling. It is the simplest possible decoder — whatever the network learned during training is frozen into the weights, and inference is just a few matrix multiplications.
What the robot receives
A single action vector — typically 7 numbers for a standard arm: 6 joint angles in radians and a gripper value between 0 (open) and 1 (closed). This gets sent to the robot controller immediately. The model is queried again at the next control timestep.
# Vision + language embedding → two linear layers → action vector # Output shape: [7] — one action for this single timestep action = mlp_head(embedding) action = [+0.03, −0.31, +0.74, +1.18, −0.07, +0.41, 0.04] # j1 j2 j3 j4 j5 j6 gripper # radians 0=open 1=closed
The core limitation: the MLP always produces one answer — a single point in action space. If training demonstrations contain two valid ways to grasp something (left side, right side), the MLP learns to predict somewhere in between. That in-between position may not be a valid grasp at all. This is the averaging problem that motivates diffusion-based decoders.
02 · token-by-token generation
Autoregressive transformer
Gato, RT-2 — generates action sequences step by step
The idea
Actions are turned into discrete tokens — the same mechanism the model uses to generate words. Each joint dimension is divided into bins (e.g. 256 bins covering the joint's full range). The model generates one token at a time, feeding each result back in as context before generating the next. No new architecture is needed: the LLM's existing next-token machinery handles everything.
What the robot receives
A sequence of bin indices, decoded back into continuous joint values. Because bins have finite width, there is always a small quantisation error — you cannot command finer than one bin width. With 256 bins over a ±π range, one bin is about 1.4° of joint rotation.
# The model generates this token sequence left to right, # one token at a time, language and action tokens in the same stream: "pick" "up" "the" "cup" <act_142> <act_091> <act_178> <act_203> <act_117> <act_088> <act_012> # j1 j2 j3 j4 j5 j6 gripper
"pick""up""the""cup" <act_142> <act_091> <act_178> <act_203> <act_117> <act_088> <act_012>
← language tokens                ← action tokens (hover to see which joint)
# formula: angle = (bin_index / 255) × joint_range + joint_min # example joint range: −π to +π rad joint_1 = +0.11 rad ← bin 142 / 255 × 6.28 − 3.14 joint_2 = −0.43 rad ← bin 91 joint_3 = +0.82 rad ← bin 178 joint_4 = +1.10 rad ← bin 203 joint_5 = −0.09 rad ← bin 117 joint_6 = −0.51 rad ← bin 88 gripper = 0.05 ← bin 12 / 255 → nearly open # Resolution: 256 bins over ±π ≈ 0.025 rad per step ≈ 1.4° per step
Why this approach exists: the LLM already knows how to generate tokens — language, code, and now actions all use the same mechanism. No new decoder architecture is needed. The trade-off is quantisation: bins are discrete, so fine-grained precision tasks (sub-millimetre insertion) are out of reach without dramatically increasing the number of bins, which inflates the vocabulary and slows training.
03 · iterative denoising
Diffusion policy head
Octo, HybridVLA, DexGraspVLA — fine-grained, temporally smooth control
The idea
Instead of predicting an action directly, the model learns to remove noise. At inference you start with a trajectory of pure random noise and run K denoising steps. Each step is guided by the vision-language embedding, nudging the noisy trajectory a little closer to something physically meaningful. After K steps the noise has resolved into a clean, smooth action sequence.
What the robot receives
A full trajectory matrix — T timesteps × 7 joint values. All values are continuous floats. The iterative denoising process naturally produces smooth transitions between timesteps, because each step refines the whole trajectory at once rather than predicting each timestep independently. This smoothness matters in contact-rich tasks where sudden jumps cause slipping or damage.
# Shape: [T timesteps × 7 joints] — continuous floats, no quantisation trajectory[t=0] = [+0.034, −0.287, +0.741, +1.182, −0.073, +0.412, 0.04] trajectory[t=1] = [+0.041, −0.301, +0.758, +1.197, −0.079, +0.421, 0.05] trajectory[t=2] = [+0.049, −0.314, +0.772, +1.209, −0.084, +0.429, 0.07] # ... values change smoothly across all timesteps ... trajectory[t=15] = [+0.198, −0.447, +0.921, +1.388, −0.142, +0.541, 0.91] # Compare to MLP: MLP outputs one row. Diffusion outputs the whole table. # Compare to autoregressive: no bins, no quantisation loss.
Why diffusion is the most expressive: it can represent multimodal action distributions. If demonstrations show two valid grasps, diffusion captures both modes and samples one at inference. MLP would average them into an invalid middle. The cost is K forward passes through the network instead of one — typically 10 to 100 passes — making it slower than MLP or autoregressive heads.
04 · plan + classical control
Planner and MPC head
VoxPoser, LMM Planner Integration, FLaRe — decision-making in dynamic tasks
The idea
The neural network does not output joint angles at all. It outputs a geometric plan — a cost map, a subgoal position, or a constraint set. A separate classical Model Predictive Control algorithm then solves an optimisation problem at each timestep to translate that plan into actual motor commands. Neural network handles perception and reasoning; MPC handles real-time motion generation.
What the robot receives
Motor commands from the MPC solver, not from the network. The MPC re-plans continuously, so the robot reacts to disturbances within its planning horizon. The network's output — the plan — is human-readable: you can inspect exactly what the model decided before the robot moves.
# The network generates a 3D value map over the robot's workspace # Shape: [64, 64, 64] voxels — one score per point in space # High value = good place to move end-effector toward # Low value = region to avoid value_map[x=32, y=41, z=28] = +0.94 ← directly above the mug handle: go here value_map[x=32, y=41, z=20] = +0.71 ← approach path, good value_map[x=10, y=10, z=05] = −0.82 ← near table edge: avoid value_map[x=00, y=00, z=00] = −1.00 ← collision zone # This is the network's entire output. No joint angles yet.
network output
3D value map
[64×64×64]
human-readable
MPC optimiser
finds trajectory
maximising value
re-runs each step
robot controller
joint torques
at control rate
actual motor commands
How it differs from the other three: in MLP, autoregressive, and diffusion, the neural network is what ultimately causes the robot to move — joint angles come directly from the network output. Here the network produces a plan and a classical algorithm does the motion generation. The plan is inspectable and safety constraints can be enforced in the MPC layer. The downside: if the plan is slightly wrong, the MPC cannot recover — it only sees the value map, not the intent behind it.