The four decoder families from the paper's taxonomy. All four receive the same input: a fused vision-language embedding representing what the robot sees and what it has been asked to do. What they do with it differs entirely.
01 · direct prediction
MLP and token predictor
OpenVLA, TraceVLA, RoboMamba — efficient low-level control
The idea
The vision-language embedding passes through one or a few linear layers and directly produces an action vector. One observation in, one action out. No iteration, no sampling. It is the simplest possible decoder — whatever the network learned during training is frozen into the weights, and inference is just a few matrix multiplications.
What the robot receives
A single action vector — typically 7 numbers for a standard arm: 6 joint angles in radians and a gripper value between 0 (open) and 1 (closed). This gets sent to the robot controller immediately. The model is queried again at the next control timestep.
Literal output — one forward pass, pick-and-place task
# Vision + language embedding → two linear layers → action vector# Output shape: [7] — one action for this single timestepaction = mlp_head(embedding)
action = [+0.03, −0.31, +0.74, +1.18, −0.07, +0.41, 0.04]
# j1 j2 j3 j4 j5 j6 gripper# radians 0=open 1=closed
Joint values this timestep
The core limitation: the MLP always produces one answer — a single point in action space. If training demonstrations contain two valid ways to grasp something (left side, right side), the MLP learns to predict somewhere in between. That in-between position may not be a valid grasp at all. This is the averaging problem that motivates diffusion-based decoders.
02 · token-by-token generation
Autoregressive transformer
Gato, RT-2 — generates action sequences step by step
The idea
Actions are turned into discrete tokens — the same mechanism the model uses to generate words. Each joint dimension is divided into bins (e.g. 256 bins covering the joint's full range). The model generates one token at a time, feeding each result back in as context before generating the next. No new architecture is needed: the LLM's existing next-token machinery handles everything.
What the robot receives
A sequence of bin indices, decoded back into continuous joint values. Because bins have finite width, there is always a small quantisation error — you cannot command finer than one bin width. With 256 bins over a ±π range, one bin is about 1.4° of joint rotation.
Literal token output — pick up the cup (RT-2 style)
# The model generates this token sequence left to right,# one token at a time, language and action tokens in the same stream:
"pick" "up" "the" "cup" <act_142> <act_091> <act_178> <act_203> <act_117> <act_088> <act_012>
# j1 j2 j3 j4 j5 j6 gripper
← language tokens ← action tokens (hover to see which joint)
Decoded back to joint angles
# formula: angle = (bin_index / 255) × joint_range + joint_min# example joint range: −π to +π radjoint_1 = +0.11 rad← bin 142 / 255 × 6.28 − 3.14joint_2 = −0.43 rad← bin 91joint_3 = +0.82 rad← bin 178joint_4 = +1.10 rad← bin 203joint_5 = −0.09 rad← bin 117joint_6 = −0.51 rad← bin 88gripper = 0.05 ← bin 12 / 255 → nearly open# Resolution: 256 bins over ±π ≈ 0.025 rad per step ≈ 1.4° per step
Why this approach exists: the LLM already knows how to generate tokens — language, code, and now actions all use the same mechanism. No new decoder architecture is needed. The trade-off is quantisation: bins are discrete, so fine-grained precision tasks (sub-millimetre insertion) are out of reach without dramatically increasing the number of bins, which inflates the vocabulary and slows training.
03 · iterative denoising
Diffusion policy head
Octo, HybridVLA, DexGraspVLA — fine-grained, temporally smooth control
The idea
Instead of predicting an action directly, the model learns to remove noise. At inference you start with a trajectory of pure random noise and run K denoising steps. Each step is guided by the vision-language embedding, nudging the noisy trajectory a little closer to something physically meaningful. After K steps the noise has resolved into a clean, smooth action sequence.
What the robot receives
A full trajectory matrix — T timesteps × 7 joint values. All values are continuous floats. The iterative denoising process naturally produces smooth transitions between timesteps, because each step refines the whole trajectory at once rather than predicting each timestep independently. This smoothness matters in contact-rich tasks where sudden jumps cause slipping or damage.
The trajectory at each denoising step — click to inspect
Final denoised output — sent to robot controller
# Shape: [T timesteps × 7 joints] — continuous floats, no quantisationtrajectory[t=0] = [+0.034, −0.287, +0.741, +1.182, −0.073, +0.412, 0.04]
trajectory[t=1] = [+0.041, −0.301, +0.758, +1.197, −0.079, +0.421, 0.05]
trajectory[t=2] = [+0.049, −0.314, +0.772, +1.209, −0.084, +0.429, 0.07]
# ... values change smoothly across all timesteps ...trajectory[t=15] = [+0.198, −0.447, +0.921, +1.388, −0.142, +0.541, 0.91]
# Compare to MLP: MLP outputs one row. Diffusion outputs the whole table.# Compare to autoregressive: no bins, no quantisation loss.
Why diffusion is the most expressive: it can represent multimodal action distributions. If demonstrations show two valid grasps, diffusion captures both modes and samples one at inference. MLP would average them into an invalid middle. The cost is K forward passes through the network instead of one — typically 10 to 100 passes — making it slower than MLP or autoregressive heads.
04 · plan + classical control
Planner and MPC head
VoxPoser, LMM Planner Integration, FLaRe — decision-making in dynamic tasks
The idea
The neural network does not output joint angles at all. It outputs a geometric plan — a cost map, a subgoal position, or a constraint set. A separate classical Model Predictive Control algorithm then solves an optimisation problem at each timestep to translate that plan into actual motor commands. Neural network handles perception and reasoning; MPC handles real-time motion generation.
What the robot receives
Motor commands from the MPC solver, not from the network. The MPC re-plans continuously, so the robot reacts to disturbances within its planning horizon. The network's output — the plan — is human-readable: you can inspect exactly what the model decided before the robot moves.
VoxPoser — what the network outputs (not joint angles)
# The network generates a 3D value map over the robot's workspace# Shape: [64, 64, 64] voxels — one score per point in space# High value = good place to move end-effector toward# Low value = region to avoid
value_map[x=32, y=41, z=28] = +0.94← directly above the mug handle: go here
value_map[x=32, y=41, z=20] = +0.71← approach path, good
value_map[x=10, y=10, z=05] = −0.82← near table edge: avoid
value_map[x=00, y=00, z=00] = −1.00← collision zone# This is the network's entire output. No joint angles yet.
Then the MPC converts the plan to motor commands
network output
3D value map [64×64×64] human-readable
→
MPC optimiser
finds trajectory maximising value re-runs each step
→
robot controller
joint torques at control rate actual motor commands
How it differs from the other three: in MLP, autoregressive, and diffusion, the neural network is what ultimately causes the robot to move — joint angles come directly from the network output. Here the network produces a plan and a classical algorithm does the motion generation. The plan is inspectable and safety constraints can be enforced in the MPC layer. The downside: if the plan is slightly wrong, the MPC cannot recover — it only sees the value map, not the intent behind it.