Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Zhengze Xu1    Mengting Chen2    Zhao Wang2    Linyu Xing2    Zhonghua Zhai2    Nong Sang1    Jinsong Lan2    Shuai Xiao2    Changxin Gao1   
1Huazhong University of Science and Technology  
2Alibaba Group  


Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on". The core idea is excavating a ``focus tunnel'' in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.

Adapting to Varied Camera-Person Relationships

Tunnel Try-on can not only handle complex clothing and backgrounds but also adapt to different types of movements in the video.

Person-to-Camera Distance Variation

Parallel Motion Relative to the Camera

Camera Angle Dynamics

Adapting to Different Clothing Styles

Unlike previous video try-on methods limited to fitting tight-fitting tops, Our Tunnel Try-on can perform try-on tasks for different types of tops and bottoms.


Given an input video and a clothing image, Tunnel Try-on first extracts a focus tunnel to zoom in on the region around the garments to better preserve the details. The zoomed region is represented by a sequence of tensors consisting of the background latent, latent noise, and the garment mask. Human pose information is added to the latent noise to assist the generation. Afterward, the 9-channel tensor is fed into the Main U-Net while a Ref U-Net and a CLIP Encoder are used to extract the representations of the clothing image, the clothing representations are added to the Main U-Net with the ref-attention. At the same time, Tunnel Try-on utilizes the tunnel embedding into temporal attention to generate more consistent motions and develop an environment encoder to extract the global context as additional guidance.


    title={Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos},
    author={Xu, Zhengze and Chen, Mengting and Wang, Zhao and Xing, Linyu and Zhai, Zhonghua and Sang, Nong and Lan, Jinsong and Xiao, Shuai and Gao, Changxin},
    journal={arXiv preprint arXiv:2404.17571},