SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang^1,2,*, Wenkai Dong^1,*, Yuxin Song^1,*,†, Bo Fang^1,3, Qi Zhang¹, Jing Wang^1,2, Fan Chen¹, Hui Zhang¹, Haocheng Feng¹, Yu Lu^4,‡, Hang Zhou¹, Chun Yuan², Jingdong Wang¹

¹ Baidu Inc ² Tsinghua University ³ City University of Hong Kong ⁴ Zhejiang University

* Equal Contribution † Project leader ‡ Corresponding Author

Official inference code for SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing.

SAMA factorizes instruction-guided video editing into semantic anchoring and motion alignment, improving edit precision while preserving temporal dynamics from the source video.

🧾 Abstract

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g. Kling-Omni). Code, models, and datasets will be released.

🖼️ Overview

📰 News

🔥 2026.03.24 SAMA-ComfyUI is open-sourced at Cynthiazxy123/SAMA-ComfyUI-official.
🔥 2026.03.21 SAMA-14B is released at syxbb/SAMA-14B.
🔥 2026.03.20 Release paper.

📊 Benchmark Highlight

🚀 Quick Start

🛠️ Installation

Recommended environment:

Linux
NVIDIA GPU
CUDA 12.1 or a compatible environment
Python 3.10

git clone https://github.com/Cynthiazxy123/SAMA
cd SAMA

conda create -n sama python=3.10 -y
conda activate sama

pip install --upgrade pip
pip install -r requirements.txt

▶️ Inference

Prepare:

The base Wan2.1-T2V-14B model directory.
A SAMA checkpoint from Hugging Face.
A source video and an edit instruction.

The inference script is:

infer_sh/run_sama.sh

Edit the variables at the top of that script before running:

MODEL_ROOT
STATE_DICT
SRC_VIDEO
PROMPT
OUTPUT_DIR

Then run:

bash infer_sh/run_sama.sh

The generated result will be saved to:

outputs/seed_1/<input_video_filename>

A recommended local model layout is:

models/
├── Wan2.1-T2V-14B/
│   ├── diffusion_pytorch_model-00001-of-00006.safetensors
│   ├── diffusion_pytorch_model-00002-of-00006.safetensors
│   ├── diffusion_pytorch_model-00003-of-00006.safetensors
│   ├── diffusion_pytorch_model-00004-of-00006.safetensors
│   ├── diffusion_pytorch_model-00005-of-00006.safetensors
│   ├── diffusion_pytorch_model-00006-of-00006.safetensors
│   ├── models_t5_umt5-xxl-enc-bf16.pth
│   ├── Wan2.1_VAE.pth
│   └── google/
└── SAMA-14B/
    └── <downloaded_checkpoint>.safetensors

If you have huggingface_hub installed, you can download the released checkpoint with:

huggingface-cli download syxbb/SAMA-14B --local-dir ./models/SAMA-14B

📝 Notes

Input frames are automatically padded to satisfy the 4k+1 frame requirement used by Wan video inference.
The output video uses the source video FPS when available; otherwise it falls back to --fps.
If --model-root is incomplete, the script will stop and report the missing files or directories.

🤗 Available Models

Model	Status	Link
SAMA-5B	Coming soon	Coming soon
SAMA-14B	Available	syxbb/SAMA-14B

🎛️ ComfyUI Workflow

We also released an official ComfyUI integration for SAMA:

Repository: Cynthiazxy123/SAMA-ComfyUI-official
Provides a ready-to-use SAMA workflow for ComfyUI video editing
Supports loading the released SAMA-14B checkpoint with the Wan base model
Includes video input, editing, export, and preview nodes for an end-to-end editing workflow

🙏 Acknowledgement

Wan: We build SAMA on top of the Wan video generation backbone and follow its model ecosystem for video synthesis and editing.
DiffSynth: We use DiffSynth as the underlying implementation framework for model components, inference utilities, and training-related infrastructure.

📚 Citation

@article{zhang2026sama,
  title={SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing},
  author={Zhang, Xinyao and Dong, Wenkai and Song, Yuxin and Fang, Bo and Zhang, Qi and Wang, Jing and Chen, Fan and Zhang, Hui and Feng, Haocheng and Lu, Yu and others},
  journal={arXiv preprint arXiv:2603.19228},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
diffsynth		diffsynth
examples/wanvideo/model_training		examples/wanvideo/model_training
infer_sh		infer_sh
inference_example		inference_example
train		train
.gitignore		.gitignore
.nojekyll		.nojekyll
LICENSE		LICENSE
README.md		README.md
compare_siglip.py		compare_siglip.py
index.html		index.html
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

🧾 Abstract

🖼️ Overview

📰 News

📊 Benchmark Highlight

🚀 Quick Start

🛠️ Installation

▶️ Inference

📝 Notes

🤗 Available Models

🎛️ ComfyUI Workflow

🙏 Acknowledgement

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

🧾 Abstract

🖼️ Overview

📰 News

📊 Benchmark Highlight

🚀 Quick Start

🛠️ Installation

▶️ Inference

📝 Notes

🤗 Available Models

🎛️ ComfyUI Workflow

🙏 Acknowledgement

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages