Skip to content

thrumdev/SAMA

 
 

Repository files navigation

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

1 Baidu Inc    2 Tsinghua University    3 City University of Hong Kong    4 Zhejiang University
* Equal Contribution    † Project leader    ‡ Corresponding Author

Official inference code for SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing.

SAMA factorizes instruction-guided video editing into semantic anchoring and motion alignment, improving edit precision while preserving temporal dynamics from the source video.

🧾 Abstract

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g. Kling-Omni). Code, models, and datasets will be released.

🖼️ Overview

SAMA teaser overview

📰 News

📊 Benchmark Highlight

VIE-Bench results

OpenVE-Bench results

ReCo-Bench results

🚀 Quick Start

🛠️ Installation

Recommended environment:

  • Linux
  • NVIDIA GPU
  • CUDA 12.1 or a compatible environment
  • Python 3.10
git clone https://github.com/Cynthiazxy123/SAMA
cd SAMA

conda create -n sama python=3.10 -y
conda activate sama

pip install --upgrade pip
pip install -r requirements.txt

▶️ Inference

Prepare:

  1. The base Wan2.1-T2V-14B model directory.
  2. A SAMA checkpoint from Hugging Face.
  3. A source video and an edit instruction.

The inference script is:

infer_sh/run_sama.sh

Edit the variables at the top of that script before running:

  • MODEL_ROOT
  • STATE_DICT
  • SRC_VIDEO
  • PROMPT
  • OUTPUT_DIR

Then run:

bash infer_sh/run_sama.sh

The generated result will be saved to:

outputs/seed_1/<input_video_filename>

A recommended local model layout is:

models/
├── Wan2.1-T2V-14B/
│   ├── diffusion_pytorch_model-00001-of-00006.safetensors
│   ├── diffusion_pytorch_model-00002-of-00006.safetensors
│   ├── diffusion_pytorch_model-00003-of-00006.safetensors
│   ├── diffusion_pytorch_model-00004-of-00006.safetensors
│   ├── diffusion_pytorch_model-00005-of-00006.safetensors
│   ├── diffusion_pytorch_model-00006-of-00006.safetensors
│   ├── models_t5_umt5-xxl-enc-bf16.pth
│   ├── Wan2.1_VAE.pth
│   └── google/
└── SAMA-14B/
    └── <downloaded_checkpoint>.safetensors

If you have huggingface_hub installed, you can download the released checkpoint with:

huggingface-cli download syxbb/SAMA-14B --local-dir ./models/SAMA-14B

📝 Notes

  • Input frames are automatically padded to satisfy the 4k+1 frame requirement used by Wan video inference.
  • The output video uses the source video FPS when available; otherwise it falls back to --fps.
  • If --model-root is incomplete, the script will stop and report the missing files or directories.

🤗 Available Models

Model Status Link
SAMA-5B Coming soon Coming soon
SAMA-14B Available syxbb/SAMA-14B

🎛️ ComfyUI Workflow

We also released an official ComfyUI integration for SAMA:

  • Repository: Cynthiazxy123/SAMA-ComfyUI-official
  • Provides a ready-to-use SAMA workflow for ComfyUI video editing
  • Supports loading the released SAMA-14B checkpoint with the Wan base model
  • Includes video input, editing, export, and preview nodes for an end-to-end editing workflow

🙏 Acknowledgement

  • Wan: We build SAMA on top of the Wan video generation backbone and follow its model ecosystem for video synthesis and editing.
  • DiffSynth: We use DiffSynth as the underlying implementation framework for model components, inference utilities, and training-related infrastructure.

📚 Citation

@article{zhang2026sama,
  title={SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing},
  author={Zhang, Xinyao and Dong, Wenkai and Song, Yuxin and Fang, Bo and Zhang, Qi and Wang, Jing and Chen, Fan and Zhang, Hui and Feng, Haocheng and Lu, Yu and others},
  journal={arXiv preprint arXiv:2603.19228},
  year={2026}
}

About

Official inference code for SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 97.4%
  • HTML 2.0%
  • Other 0.6%