APEX: Action Priors Enable Efficient Exploration for Robust Motion Tracking on Legged Robots

Abstract

Learning natural, animal-like locomotion from demonstrations has become a core paradigm in legged robotics. Despite the recent advancements in motion tracking, most existing methods demand extensive tuning and rely on reference data during deployment, limiting adaptability. We present APEX (Action Priors enable Efficient Exploration), a plug-and-play extension to state-of-the-art motion tracking algorithms that eliminates any dependence on reference data during deployment, improves sample efficiency, and reduces parameter tuning effort. APEX integrates expert demonstrations directly into reinforcement learning (RL) by incorporating decaying action priors, which initially bias exploration toward expert demonstrations but gradually allow the policy to explore independently. This is combined with a multi-critic framework that balances task performance with motion style. Moreover, APEX enables a single policy to learn diverse motions and transfer reference-like styles across different terrains and velocities, while remaining robust to variations in reward design. We validate the effectiveness of our method through extensive experiments in both simulation and on a Unitree Go2 robot. By leveraging demonstrations to guide exploration during RL training, without imposing explicit bias toward them, APEX enables legged robots to learn with greater stability, efficiency, and generalization. We believe this approach paves the way for guidance-driven RL to boost natural skill acquisition in a wide array of robotic tasks, from locomotion to manipulation.

Illustration of APEX's decaying action priors: Like the braces that stabilize early motion before breaking away, the priors guide exploration at the start of training and fade to zero, enabling a pure RL policy that runs independently at deployment. Images inspired by Forrest Gump (1994).

Learning Diverse Animal-Like Skills

Multiple gaits with a single policy

Highest recorded speed: 3.39 m/s

Imitation Beyond Data

APEX's exploration mechanism enables policies to generalize to uneven terrains from kinematic flat-ground data without deviating from the intended gait styles.

Trot on Stairs

Canter on Slope (1)

Canter on Slope (2)

Robustness Tests on Rough Terrain

Trot from a very happy dog

Pace

Quantitative Results

Implementation Differences Across Policy Variants

● = required, ● = not required. APEX removes all runtime reference dependencies. Further APEX is intentally simplified (without reference state initialization and less training iterations) to show the method's robustness.

Implementation requires	DM-Full	APEX-Full	DM-NIA	APEX
Runtime ref. motions	●	●	●	●
Phase variable	●	●	●	●
Ref. state init.	●	●	●	●
Critic imitation data	●	●	●	●

APEX vs DeepMimic (DM) Reward Comparison

General trend of rewards averaged across seeds for the trot motion. APEX-Full achieves faster convergence than DM-Full, while APEX outperforms DM-NIA without requiring imitation references during deployment. The same trend is observed across other motions.

Why have Multi-Critic?

APEX's multi-critic variants sustain high tracking accuracy across all metrics and reward magnitudes, whereas single-critic variants degrade under strong velocity rewards. This demonstrates the robustness of the multi-critic learning approach in maintaining imitation quality while optimizing for task performance.

Individual Motions on Go2

Pace

Trot

Canter

Pronk

Hopturn

APEX Framework

Overview of APEX. Only components along dashed lines are required during deployment.

APEX uses expert-informed action priors to guide early-stage exploration, which decay over time, allowing the policy to gradually explore independently. This is complemented by a multi-critic reinforcement learning setup that balances stylistic imitation with task performance. The process involves four key stages:

Imitation Data Collection: Demonstrations can be collected from a privileged teacher policy, motion capture recordings, or other sources such as animation. Only kinematic joint data is required.
Action Priors: Feed-forward torques are derived from the collected kinematic joint data and added to the action space to bias policy exploration. These priors are decayed over time to allow the policy to explore independently.
Multi-Critic Reinforcement Learning: A PPO-based learning algorithm trains the final policy using both style-based and task-specific rewards, including regularization terms.
Zero-Shot Hardware Transfer: The trained policy is directly deployed on Unitree Go2 without the need for fine-tuning.

Frequently Asked Questions

Q: What makes APEX different from existing motion tracking methods?

APEX builds on reward-based tracking methods (like DeepMimic), with several practical advantages: it avoids extensive hyperparameter tuning, is significantly more sample-efficient, and makes it straightforward to learn motions without including reference trajectories in the actor’s observations. This leads to more reliable sim-to-real transfer and allows controlled deviations from the reference when needed, as shown in the imitation-beyond-demonstration section. The cool part is that the action priors can be added with just a few lines of code to any existing reward-based motion-tracking approach.

Q: Can APEX be applied to other robots beyond quadrupeds?

Yes :) While our main experiments focus on quadrupedal locomotion, APEX is a general framework that can be applied to other robotic systems. We also have initial results on humanoid motions using BeyondMimic, and we will release the humanoid locomotion code soon.

BibTeX

@misc{sood2025apexactionpriorsenable,
      title={APEX: Action Priors Enable Efficient Exploration for Robust Motion Tracking on Legged Robots}, 
      author={Shivam Sood and Laukik Nakhwa and Sun Ge and Yuhong Cao and Jin Cheng and Fatemah Zargarbashi and Taerim Yoon and Sungjoon Choi and Stelian Coros and Guillaume Sartoretti},
      year={2025},
      eprint={2505.10022},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.10022}, 
}