VLA internals · lecture material

Action decoders —
what does each one output?

The four decoder families from the paper's taxonomy. All four receive the same input: a fused vision-language embedding representing what the robot sees and what it has been asked to do. What they do with it differs entirely.

01 · direct prediction

MLP and token predictor

OpenVLA, TraceVLA, RoboMamba — efficient low-level control

The idea

The vision-language embedding passes through one or a few linear layers and directly produces an action vector. One observation in, one action out. No iteration, no sampling. It is the simplest possible decoder — whatever the network learned during training is frozen into the weights, and inference is just a few matrix multiplications.

What the robot receives

A single action vector — typically 7 numbers for a standard arm: 6 joint angles in radians and a gripper value between 0 (open) and 1 (closed). This gets sent to the robot controller immediately. The model is queried again at the next control timestep.

Literal output — one forward pass, pick-and-place task

# Vision + language embedding → two linear layers → action vector
# Output shape: [7]  — one action for this single timestep

action = mlp_head(embedding)

action = [+0.03, −0.31, +0.74, +1.18, −0.07, +0.41, 0.04]
#          j1      j2      j3      j4      j5      j6    gripper
#          radians                                        0=open 1=closed

Joint values this timestep

The core limitation: the MLP always produces one answer — a single point in action space. If training demonstrations contain two valid ways to grasp something (left side, right side), the MLP learns to predict somewhere in between. That in-between position may not be a valid grasp at all. This is the averaging problem that motivates diffusion-based decoders.

02 · token-by-token generation

Autoregressive transformer

Gato, RT-2 — generates action sequences step by step

The idea

Actions are turned into discrete tokens — the same mechanism the model uses to generate words. Each joint dimension is divided into bins (e.g. 256 bins covering the joint's full range). The model generates one token at a time, feeding each result back in as context before generating the next. No new architecture is needed: the LLM's existing next-token machinery handles everything.

What the robot receives

A sequence of bin indices, decoded back into continuous joint values. Because bins have finite width, there is always a small quantisation error — you cannot command finer than one bin width. With 256 bins over a ±π range, one bin is about 1.4° of joint rotation.

Literal token output — pick up the cup (RT-2 style)

# The model generates this token sequence left to right,
# one token at a time, language and action tokens in the same stream:

"pick" "up" "the" "cup" <act_142> <act_091> <act_178> <act_203> <act_117> <act_088> <act_012>
#                        j1         j2         j3         j4         j5         j6         gripper

"pick""up""the""cup" <act_142> <act_091> <act_178> <act_203> <act_117> <act_088> <act_012>

← language tokens ← action tokens (hover to see which joint)

Decoded back to joint angles

# formula: angle = (bin_index / 255) × joint_range + joint_min
# example joint range: −π to +π rad

joint_1 = +0.11 rad  ← bin 142 / 255 × 6.28 − 3.14
joint_2 = −0.43 rad  ← bin  91
joint_3 = +0.82 rad  ← bin 178
joint_4 = +1.10 rad  ← bin 203
joint_5 = −0.09 rad  ← bin 117
joint_6 = −0.51 rad  ← bin  88
gripper =  0.05      ← bin  12 / 255 → nearly open

# Resolution: 256 bins over ±π ≈ 0.025 rad per step ≈ 1.4° per step

Why this approach exists: the LLM already knows how to generate tokens — language, code, and now actions all use the same mechanism. No new decoder architecture is needed. The trade-off is quantisation: bins are discrete, so fine-grained precision tasks (sub-millimetre insertion) are out of reach without dramatically increasing the number of bins, which inflates the vocabulary and slows training.

03 · iterative denoising

Diffusion policy head

Octo, HybridVLA, DexGraspVLA — fine-grained, temporally smooth control

The idea

Instead of predicting an action directly, the model learns to remove noise. At inference you start with a trajectory of pure random noise and run K denoising steps. Each step is guided by the vision-language embedding, nudging the noisy trajectory a little closer to something physically meaningful. After K steps the noise has resolved into a clean, smooth action sequence.

What the robot receives

A full trajectory matrix — T timesteps × 7 joint values. All values are continuous floats. The iterative denoising process naturally produces smooth transitions between timesteps, because each step refines the whole trajectory at once rather than predicting each timestep independently. This smoothness matters in contact-rich tasks where sudden jumps cause slipping or damage.

The trajectory at each denoising step — click to inspect

Final denoised output — sent to robot controller

# Shape: [T timesteps × 7 joints] — continuous floats, no quantisation

trajectory[t=0]  = [+0.034, −0.287, +0.741, +1.182, −0.073, +0.412, 0.04]
trajectory[t=1]  = [+0.041, −0.301, +0.758, +1.197, −0.079, +0.421, 0.05]
trajectory[t=2]  = [+0.049, −0.314, +0.772, +1.209, −0.084, +0.429, 0.07]
# ... values change smoothly across all timesteps ...
trajectory[t=15] = [+0.198, −0.447, +0.921, +1.388, −0.142, +0.541, 0.91]

# Compare to MLP: MLP outputs one row. Diffusion outputs the whole table.
# Compare to autoregressive: no bins, no quantisation loss.

Why diffusion is the most expressive: it can represent multimodal action distributions. If demonstrations show two valid grasps, diffusion captures both modes and samples one at inference. MLP would average them into an invalid middle. The cost is K forward passes through the network instead of one — typically 10 to 100 passes — making it slower than MLP or autoregressive heads.

04 · plan + classical control

Planner and MPC head

VoxPoser, LMM Planner Integration, FLaRe — decision-making in dynamic tasks

The idea

The neural network does not output joint angles at all. It outputs a geometric plan — a cost map, a subgoal position, or a constraint set. A separate classical Model Predictive Control algorithm then solves an optimisation problem at each timestep to translate that plan into actual motor commands. Neural network handles perception and reasoning; MPC handles real-time motion generation.

What the robot receives

Motor commands from the MPC solver, not from the network. The MPC re-plans continuously, so the robot reacts to disturbances within its planning horizon. The network's output — the plan — is human-readable: you can inspect exactly what the model decided before the robot moves.

VoxPoser — what the network outputs (not joint angles)

# The network generates a 3D value map over the robot's workspace
# Shape: [64, 64, 64] voxels  — one score per point in space
# High value = good place to move end-effector toward
# Low value  = region to avoid

value_map[x=32, y=41, z=28] = +0.94  ← directly above the mug handle: go here
value_map[x=32, y=41, z=20] = +0.71  ← approach path, good
value_map[x=10, y=10, z=05] = −0.82  ← near table edge: avoid
value_map[x=00, y=00, z=00] = −1.00  ← collision zone

# This is the network's entire output. No joint angles yet.

Then the MPC converts the plan to motor commands

network output

3D value map
[64×64×64]
human-readable

→

MPC optimiser

finds trajectory
maximising value
re-runs each step

→

robot controller

joint torques
at control rate
actual motor commands

How it differs from the other three: in MLP, autoregressive, and diffusion, the neural network is what ultimately causes the robot to move — joint angles come directly from the network output. Here the network produces a plan and a classical algorithm does the motion generation. The plan is inspectable and safety constraints can be enforced in the MPC layer. The downside: if the plan is slightly wrong, the MPC cannot recover — it only sees the value map, not the intent behind it.

Action decoders —what does each one output?

Action decoders —
what does each one output?