Figure 1: We propose an Editing-awaRE (REE) feature injection method for pre-trained rectified Flow models to conduct zero-shot image-driven video editing, dubbed FREE-Edit. Given the edited first frame, our method generates the output video, which propagates editing contents while leaving other parts unchanged (e.g., motion). Here, the commonly employed vanilla feature injection always leads to conflicting semantics with the edited first frame. We also provide the w/o injection results, which cannot preserve motion well. In contrast, our Editing-awaRE (REE) feature injection method modulates the injection intensity of each token adaptively, which achieves the most elegant results.
Image-driven video editing aims to propagate edit contents from the modified first frame to the rest frames. The existing methods usually invert the source video to noise using a pre-trained image-to-video (I2V) model and then guide the sampling process using the edited first frame. Generally, a popular choice for maintaining motion and layout from the source video is intervening in the denoising process by injecting attention during reconstruction. However, such injection often leads to unsatisfactory results, where excessive injection leads to conflicting semantics from the source video while insufficient injection brings limited source representation. Recognizing this, we propose an Editing-awaRE (REE) injection method to modulate injection intensity of each token. Specifically, we first compute the pixel difference between the source and edited first frame to form a corresponding editing mask. Next, we track the editing area throughout the entire video by using optical flow to warp the first-frame mask. Then, editing-aware feature injection intensity for each token is generated accordingly, where injection is not conducted on editing areas. Building upon REE injection, we further propose a zero-shot image-driven video editing framework with recent-emerging rectified-Flow models, dubbed FREE-Edit. Without fine-tuning or training, our FREE-Edit demonstrates effectiveness in various image-driven video editing scenarios, showing its capability to produce higher-quality outputs compared with existing techniques.
Figure 2: The pipeline illustration of our FREE-Edit. Top: It obeys an ``inversion-then-editing'' pipeline. Starting from inverted noisy latent $z_1$, the reconstructed video $z_0$ and edited video $\tilde{z}_0$ takes the source $X^1$ and edited first frame $\hat{X^1}$ as condition signal, respectively. Bottom: We design an Editing-awaRE (REE) injection method, which designs a modulation weight $\lambda$ to adaptively replaces the intermediate model representations ($\tilde{Q}$ and $\tilde{K}$) in the editing process with those ($Q$ and $K$) in the reconstruction process through self-attention blocks. Here, we first use optical flow to warp the automatically calculated first-frame editing mask, which yields tracked editing masks for subsequent frames. Based on them, we compute the modulation weight $\lambda$ for each token, where no injection is performed in the editing area.
Qualitative comparison between standard FREE-Edit w/ REE injection), and that of w/o injection, and w/ vanilla injection.
Qualitative comparison to existing image-guided video editing methods Go-with-the-Flow [1], I2VEdit [2], AnyV2V [3], and VideoShop [4] on various editing scenarios.
@article{free-edit,
title = {FREE-Edit: Using Editing-aware Injection in Rectified Flow Models for Zero-shot Image-Driven Video Editing},
author = {Maomao Li, Yunfei Liu, Yu Li},
journal={arXiv preprint arXiv:2603.01164},
year={2026}
}
[1] yan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. In CVPR, pages 13–23, 2025.
[2] Wenqi Ouyang, Yi Dong, Lei Yang, Jianlou Si, and Xingang Pan. I2vedit: First-frame-guided video editing via image-to-video diffusion models. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024.
[3] Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. Anyv2v: A plug-and-play framework for any video-to-video editing tasks. TMLR, 2024.
[4] H. Ouyang, Q. Wang, Y. Xiao, Q. Bai, J. Zhang, K. Zheng, X. Zhou, Q. Chen, and Y. Shen, “Codef: Content deformation fields for temporally consistent video processing,” CVPR, 2024.