APEX: Action Priors Enable Efficient Exploration for Skill Imitation on Articulated Robots

Learning Diverse Animal Like Skills

Gait transitions with a single policy

Highest recorded speed: 3.39 m/s

Imitation Beyond Data

APEX's exploration mechanism enables policies to generalize to uneven terrains from kinematic flat-ground data without deviating from the intended gait styles.

Trot on Stairs

Canter on Slope(1)

Canter on Slope(2)

Extension to Humanoids

Robustness Tests on Rough Terrain

Trot from a very happy dog

Pace

Real-World Skill Comparison

Pace

Reference

APEX

AMP

Trot

Reference

APEX

AMP

Canter

Reference

APEX

AMP

Pronk

Reference

APEX

AMP

Sim-to-Sim Skill Comparison

Pace

APEX

AMP

Pronk

APEX

AMP

Abstract

Learning by imitation provides an effective way for robots to develop well-regulated complex behaviors and directly benefit from natural demonstrations. State-of-the-art imitation learning (IL) approaches typically leverage Adversarial Motion Priors (AMP), which, despite their impressive results, suffer from two key limitations. They are prone to mode collapse, which often leads to overfitting to the simulation environment and thus increased sim-to-real gap, and they struggle to learn diverse behaviors effectively.

To overcome these limitations, we introduce APEX (Action Priors enable Efficient Exploration): a simple yet versatile imitation learning framework that integrates demonstrations directly into reinforcement learning (RL), maintaining high exploration while grounding behavior with expert-informed priors. We achieve this through a combination of decaying action priors, which initially bias exploration toward expert demonstrations but gradually allow the policy to explore independently. This is complemented by a multi-critic RL framework that effectively balances stylistic consistency with task performance.

Our approach achieves sample-efficient imitation learning and enables the acquisition of diverse skills within a single policy. APEX generalizes to varying velocities and preserves reference-like styles across complex tasks such as navigating rough terrain and climbing stairs, utilizing only flat-terrain kinematic motion data as a prior. We validate our framework through extensive hardware experiments on the Unitree Go2 quadruped. There, APEX yields diverse and agile locomotion gaits, inherent gait transitions, and the highest reported speed for the platform to the best of our knowledge (∼ 4.5m/s in sim-to-sim transfer, and a peak velocity of ∼ 3.3m/s on hardware). Our results establish APEX as a compelling alternative to existing IL methods, offering better efficiency, adaptability, and real-world performance

Method

Figure: Overview of APEX. Only components along dashed lines are required during deployment.

APEX uses expert-informed action priors to guide early-stage exploration, which decay over time, allowing the policy to gradually explore independently. This is complemented by a multi-critic reinforcement learning setup that balances stylistic imitation with task performance. The process involves four key stages:

Imitation Data Collection: Demonstrations can be collected from a privileged teacher policy, motion capture recordings, or other sources such as animation. Only kinematicjoint data is required.
Action Priors: Feed-forward torques are derived from the collected kinematic joint data and added to the action space to bias policy exploration. These priors are decayed over time to allow the policy to explore independently.
Multi-Critic Reinforcement Learning: A PPO-based learning algorithm trains the final policy using both style-based and task-specific rewards, including regularization terms.
Zero-Shot Hardware Transfer: The trained policy is directly deployed on Unitree Go2 without the need for fine-tuning.

More implementation details and the math behind why APEX works can be found in the paper.

BibTeX

@misc{sood2025apexactionpriorsenable,
      title={APEX: Action Priors Enable Efficient Exploration for Skill Imitation on Articulated Robots}, 
      author={Shivam Sood and Laukik B Nakhwa and Yuhong Cao and Sun Ge and Guillaume Sartoretti},
      year={2025},
      eprint={2505.10022},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.10022}, 
}