KiVi: Kinesthetic-Visuospatial Integration for Dynamic and Safe Egocentric Legged Locomotion

A bio-inspired quadruped locomotion framework that separates proprioceptive kinesthetics from visuospatial terrain reasoning, enabling dynamic traversal and graceful fallback under corrupted or unavailable vision.

Peizhuo Li1,*, Hongyi Li1,2,*, Yuxuan Ma1,*, Linnan Chang1, Xinrong Yang1, Ruiqi Yu3, Shuhao Liao1, Yifeng Zhang1, Yuhong Cao1,†, Qiuguo Zhu3, Guillaume Sartoretti1
1MARMot Lab, National University of Singapore    2Center of X-Mechanics, Zhejiang University    3Robot and Robot Intelligence Lab, Zhejiang University
*Equal contribution    Corresponding author    IROS 2026

Video

KiVi robust quadruped locomotion under visual disturbances and diverse terrains.

KiVi enables robust locomotion and obstacle avoidance on a DeepRobotics Lite3 quadruped across diverse terrains and under severe visual disturbances.

Abstract

Vision-based locomotion has shown great promise in enabling legged robots to perceive and adapt to complex environments. However, visual information is inherently fragile, being vulnerable to occlusions, reflections, and lighting changes, which often cause instability in locomotion. Inspired by animal sensorimotor integration, we propose KiVi, a Kinesthetic-Visuospatial integration framework, where kinesthetics encodes proprioceptive sensing of body motion and visuospatial reasoning captures visual perception of surrounding terrain. KiVi separates these pathways, leveraging proprioception as a stable backbone while selectively incorporating vision for terrain awareness and obstacle avoidance. Combined with memory-enhanced attention, this design allows robust interpretation of visual cues while maintaining fallback stability through proprioception. Experiments show that KiVi enables quadruped robots to traverse diverse terrains and operate reliably in unstructured outdoor environments, remaining robust to out-of-distribution visual noise and occlusion unseen during training.

Method Overview

KiVi framework overview

Modality-separated robust control

KiVi uses a dual-branch estimator with a Kinesthetic Module for proprioceptive body-motion sensing and a Visuospatial Module for visual terrain reasoning.

The kinesthetic branch provides a stable locomotion backbone, while the visuospatial branch uses memory-enhanced attention to reconstruct terrain structure and anticipate obstacles. Their latent representations are integrated by the downstream actor for dynamic, terrain-aware control.

  • Single-stage actor-critic training with privileged critic information.
  • MemTransformer-based temporal memory for visual terrain understanding.
  • Graceful fallback to proprioception when vision is unreliable.

Simulation and Real-World Results

KiVi is evaluated in simulation and on DeepRobotics Lite3 hardware across visual corruption, terrain traversability, and outdoor disturbance tests.

Training terrains used by KiVi

Diverse Terrain Curriculum

Training spans stairs, platforms, random rough terrain, slopes, gaps, and high walls with increasing procedural difficulty.

Outdoor hardware experiments

Outdoor Traversability

With a constant forward command, the robot traverses tree roots, stairs, elevated platforms, and dynamic pedestrian scenarios.

KiVi under tall grass and camera occlusion

Visual Robustness

Under tall grass and complete camera occlusion, KiVi maintains stable locomotion by falling back to proprioceptive control.

Robustness Under Corrupted Vision

Total joint power comparison
Joint power variance comparison

Compared with a fused visual-proprioceptive baseline, KiVi keeps joint power and variance closer to the blind locomotion baseline under severe visual disturbances, indicating stable and energy-efficient control.

Real-World Visual Disturbances

KiVi locomotion on reflective surfaces

Reflective surfaces create structured depth artifacts, yet KiVi maintains stable locomotion.

Key findings

  • 5/5 success on high platforms, obstacle avoidance, tall grass, and camera-blocking tests.
  • Robust to out-of-distribution visual noise such as reflections, vegetation, and complete camera occlusion.
  • Runs onboard with depth acquisition at 10 Hz, policy inference at 50 Hz, and low-level PD control at 200 Hz.

BibTeX

@inproceedings{li2026kivi,
  title={KiVi: Kinesthetic-Visuospatial Integration for Dynamic and Safe Egocentric Legged Locomotion},
  author={Li, Peizhuo and Li, Hongyi and Ma, Yuxuan and Chang, Linnan and Yang, Xinrong and Yu, Ruiqi and Liao, Shuhao and Zhang, Yifeng and Cao, Yuhong and Zhu, Qiuguo and Sartoretti, Guillaume},
  booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  year={2026},
  eprint={2509.23650},
  archivePrefix={arXiv},
  primaryClass={cs.RO}
}