CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

CanonSwap decouples motion information from appearance to enable high-fidelity and consistent video face swapping.

Abstract

Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation.

CanonSwap decouples motion information from appearance to enable high-fidelity and consistent video face swapping.

Pipeline

Given a source image and a target video, our method first extracts identity features through an ID encoder from the source image. Each frame of the target video is warped to a canonical space using transformation $\mathbf{M_{o\rightarrow c}}$ estimated by the motion extractor. In this canonical space, we perform identity transfer using the Partial Identity Modulation module. The transformed features are then refined by a refine module. Finally, the refined feature is warped back to the target pose using $\mathbf{M_{c \rightarrow o}}$ and to generate the swapped results.

Qualitative Comparison

Quantitative Comparison

Images are rendered at the same camera pose with increasing weight of features extracted from the image. Our method incorporates environmental factors, like highlights on pillars and enhancing illumination, in a manner that is closer to human understanding.

End2End Face Swapping and Animation

Face swapping and animation results. Both identity and expression of result video come from the source video.

BibTeX


      @article{luo2025canonswap,
  title={CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation},
  author={Luo, Xiangyang and Zhu, Ye and Liu, Yunfei and Lin, Lijian and Wan, Cong and Cai, Zijian and Huang, Shao-Lun and Li, Yu},
  journal={arXiv preprint arXiv:2507.02691},
  year={2025}
  }