Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Controls

Code

Abstract

Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment to real robots. For adaptation, we demonstrate in simulation that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

Framework

Framework
Large-scale pretralning and efficient FineTuning (LIFT) Framework. In stage (i), we implement SAC in JAX to support large-batch update and high UTD, achieving fast, robust convergence in massively parallel simulation and zero-shot deployment to a real humanoid in outdoor experiments. In stage (ii), we pretrain a physics-informed world model on the SAC data, combining Lagrangian dynamics with a residual predictor to capture contact forces and other unmodeled effects. In stage (iii), we finetune both the policy and the world model to new environments while executing only deterministic actions in the environment. Stochastic exploration is confined to rollouts within the world model. This framework enhances both the safety and efficiency of finetuning.

Pretraining Results: MuJoCo Playground

Framework
LIFT achieves comparable or higher evaluation returns than PPO and FastTD3 while stabilizes at its peak return faster on rough terrain environments. On flat terrain, it achieves comparable peak performance with similar wall-clock runtime.

Finetuning Result: Brax

Framework
LIFT consistently converges across all tasks, achieving stable walking that closely tracks the desired forward speed during the Booster T1 finetuning experiments. After finetuning, the policy demonstrates significantly reduced body oscillations and less deviation from the desired speed direction, with noticeable improvements in velocity.

Pretraining Results: MuJoCo Playground

T1LowDimJoystickFlatTerrain
T1JoystickFlatTerrain
G1JoystickFlatTerrain
T1LowDimJoystickRoughTerrain
T1JoystickRoughTerrain
G1JoystickRoughTerrain

As shown in the video, LIFT can solve these real-world humanoid robot tasks within one hour of training on an NVIDIA 4090, for both leg-only and full-body control settings. The T1LowDim environment involves only the 12 leg degrees of freedom; notably, a policy pretrained in T1LowDimJoystickRough can be zero-shot deployed to the real robot (see accompanying video). In contrast, the T1Joystick and G1Joystick tasks include additional degrees of freedom for the arms and waist. The video demonstrates that the learned policies are able to coordinate these joints to maintain balance. Although the reward does not constrain arm motion, efficient locomotion behaviors emerge, such as swinging the arms for balance. Moreover, pretrained policies provide strong initialization for fine-tuning with LIFT in new environments.

Finetuning Result: T1 on Brax

Framework
Sim-to-sim transfer and fine-tuning results for the Booster T1 in Brax with a target velocity of 1.5 m/s. Top: Policy before fine-tuning, the policy can not achieve stable walking. Bottom: Policy after fine-tuning, achieving stable locomotion at 1.5m/s .

Finetuning Result: G1 on Brax

Framework
Sim-to-sim transfer and fine-tuning results for the Unitree G1 in Brax with a target velocity of 1.5 m/s. Top: Policy before fine-tuning, exhibiting instability and shuffling gait. Bottom: Policy after fine-tuning, demonstrating a stable, human-like walking gait with reduced torso pitch.

Sim-to-real Transfer with LIFT Pretrained Policy

Scene 1: Walk forwards and backwards indoor
Scene 2: Lateral movement
Scene 3: Walk on concrete blocks
Scene 4: Walk on grass
Scene 5: Walk on muddy ground
Scene 6: Walk over stage
Scene 7: Spin
Scene 8: Walk uphill
Scene 9: Walking downhill

To demonstrate the potential for zero-shot deployment, we deploy the LIFT pretrained policy for the Booster~T1 directly on the physical robot. Zero-shot transfer to previously unseen surfaces (e.g., grass, uphill, downhill, mud) is shown in the figure. This provides practical evidence that large-scale, parallel SAC pretraining can yield deployable humanoid controllers. And it also establishes a suitable starting point for subsequent fine-tuning.

Interactive Demo of Brax Finetune

(Drag with your mouse and click the robot to fix the view)

We transfer the pretrained policy to Brax. During pretraining, target linear velocities along the x-axis were uniformly sampled in [-1,1] m/s. For fine-tuning in Brax, we specify new forward-velocity targets. As shown in the demo, LIFT finetuning improves x-velocity tracking accuracy and reduces lateral (y-axis) drift.

T1 1.5m/s (Out-of-Distribution)
Before finetuning, the policy is initially unstable.
T1 1.5m/s (Out-of-Distribution)
After finetuning, the policy achieves stable locomotion at 1.5 m/s
T1 1.2m/s (Out-of-Distribution)
Before finetuning, the policy exhibits some instability during the initial walking phase.
T1 1.2m/s (Out-of-Distribution)
After finetuning, initial stepping stability improves and the robot attains higher speed.
T1 1.0m/s (Long Tail)
Before finetuning, the policy nearly reaches 0.9 m/s, but it drifts in the y-direction.
T1 1.0m/s (Long Tail)
After finetuning, lateral deviation is notably reduced and nearly reaches 1.0 m/s, yielding straighter walking.
G1 1.5m/s (Out-of-Distribution)
Before finetuning, the policy walks stably at about 1.0 m/s but can't reach higher speeds.
G1 1.5m/s (Out-of-Distribution)
After finetuning, the policy nearly reaches 1.5 m/s.
G1 1.2m/s (Out-of-Distribution)
Before finetuning, the policy nearly reaches 1.0 m/s.
G1 1.2m/s (Out-of-Distribution)
After finetuning, the policy nearly reaches 1.2 m/s.
G1 1.0m/s (Long Tail)
Before finetuning, the policy nearly reaches 0.9 m/s.
G1 1.0m/s (Long Tail)
After finetuning, the policy nearly reaches 1.0 m/s.

LIFT policies trained with whole body tracking pipeline on the LAFAN1 dataset

fight1_subject2(LAFAN1)
fight1_subject2(LAFAN1)
jumps1_subject2(LAFAN1)
run1_subject2(LAFAN1)
walk1_subject1(LAFAN1)

We reimplemented the observation and reward structure of BeyondMimic in JAX within MuJoCo Playground and used the Unitree motion dataset (LAFAN1) to pretrain a whole-body tracking policy for the Unitree G1 humanoid. We present these whole-body tracking results as preliminary evidence of LIFT’s broader applicability.

Real-world Finetuning in Booster T1

Please watch the video on: YouTube

We conduct real-world fine-tuning experiments on the Booster T1 humanoid robot. We first pretrained a policy in the MuJoCo-Playground T1LowDimJoystickFlatTerrain task, where we removed most energy-related constraint terms and kept only the action-rate L2 penalty. Although this policy transfers well between simulators (MuJoCo -> Brax), the reduced energy constraints and flat-terrain pretraining lead to a failure in zero-shot sim-to-real transfer. We then used this policy as the initialization for real-world fine-tuning to evaluate the effectiveness of our method. As shown in the video, the policy begins with unstable behavior (0 s of data) and gradually improves as additional real-world experience is incorporated. After collecting 80–590 s of data, the robot exhibits increasingly upright posture, smoother gait patterns, and more stable forward velocity. These results demonstrate that our fine-tuning framework can successfully adapt a weak sim-to-real policy and make it substantially more robust after collecting only several minutes of data.