Diffusers documentation

Motif-Video

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.38.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Motif-Video

Technical Report

Motif-Video is a 2B parameter diffusion transformer designed for text-to-video and image-to-video generation. It features a three-stage architecture with 12 dual-stream + 16 single-stream + 8 DDT decoder layers, Shared Cross-Attention for stable text-video alignment under long video sequences, T5Gemma2 text encoder, and rectified flow matching for velocity prediction.

Motif-Video architecture

Text-to-Video Generation

Use MotifVideoPipeline for text-to-video generation:

import torch
from diffusers import MotifVideoPipeline
from diffusers.utils import export_to_video


pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1280,
    height=736,
    num_frames=121,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

Image-to-Video Generation

Use MotifVideoImage2VideoPipeline for image-to-video generation:

import torch
from diffusers import MotifVideoImage2VideoPipeline
from diffusers.utils import export_to_video, load_image


pipe = MotifVideoImage2VideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

image = load_image("input_image.png")
prompt = "A cinematic scene with vivid colors."
negative_prompt = "worst quality, blurry, jittery, distorted"

video = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1280,
    height=736,
    num_frames=121,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "i2v_output.mp4", fps=24)

Memory-efficient Inference

For GPUs with less than 30GB VRAM (e.g., RTX 4090), use model CPU offloading:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
import torch
from diffusers import MotifVideoPipeline
from diffusers.utils import export_to_video


pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1280,
    height=736,
    num_frames=121,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

MotifVideoPipeline

class diffusers.MotifVideoPipeline

< >

( scheduler: SchedulerMixin vae: AutoencoderKLWan text_encoder: T5Gemma2Encoder tokenizer: PreTrainedTokenizerBase transformer: MotifVideoTransformer3DModel guider: BaseGuidance feature_extractor: typing.Optional[transformers.models.siglip.image_processing_siglip.SiglipImageProcessor] = None )

Parameters

  • transformer (MotifVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
  • scheduler (SchedulerMixin) — A scheduler to be used in combination with transformer to denoise the encoded video latents. Should be an instance of a class inheriting from SchedulerMixin, such as DPMSolverMultistepScheduler. If not provided, uses the scheduler attached to the pretrained model.
  • vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
  • text_encoder (T5Gemma2Encoder) — Primary text encoder for encoding text prompts into embeddings.
  • tokenizer (PreTrainedTokenizerBase) — Tokenizer corresponding to the primary text encoder.
  • guider (BaseGuidance) — The guidance method to use. Should be an instance of a class inheriting from BaseGuidance, such as ClassifierFreeGuidance, AdaptiveProjectedGuidance, or SkipLayerGuidance. If not provided, defaults to ClassifierFreeGuidance.

Pipeline for text-to-video generation using Motif-Video.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< >

( prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 736 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 vae_batch_size: int | None = None ) ~MotifVideoPipelineOutput or tuple

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to guide the video generation. If not defined, one has to pass prompt_embeds.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the video generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance.
  • height (int, defaults to 736) — The height in pixels of the generated video.
  • width (int, defaults to 1280) — The width in pixels of the generated video.
  • num_frames (int, defaults to 121) — The number of video frames to generate.
  • num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference.
  • timesteps (List[int], optional) — Custom timesteps to use for the denoising process.
  • num_videos_per_prompt (int, optional, defaults to 1) — The number of videos to generate per prompt.
  • generator (torch.Generator or List[torch.Generator], optional) — PyTorch Generator object(s) for deterministic generation.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
  • prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings.
  • negative_prompt_attention_mask (torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings.
  • output_type (str, optional, defaults to "pil") — The output format of the generated video. Choose between "pil", "np", or "latent".
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~MotifVideoPipelineOutput instead of a plain tuple.
  • attention_kwargs (dict, optional) — Arguments passed to the attention processor.
  • callback_on_step_end (Callable, optional) — A function or subclass of PipelineCallback or MultiPipelineCallbacks called at the end of each denoising step.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function.
  • max_sequence_length (int, defaults to 512) — Maximum sequence length for the tokenizer.
  • vae_batch_size (int, optional) — Batch size for VAE decoding. If provided and latents batch size is larger, VAE decoding will be done in chunks.

Returns

~MotifVideoPipelineOutput or tuple

If return_dict is True, ~MotifVideoPipelineOutput is returned, otherwise a tuple is returned where the first element is a list of generated video frames.

The call function to the pipeline for text-to-video generation.

Examples:

>>> import torch
>>> from diffusers import MotifVideoPipeline
>>> from diffusers.utils import export_to_video

>>> # Load the Motif-Video pipeline
>>> motif_video_model_id = "Motif-Technologies/Motif-Video-2B"
>>> pipe = MotifVideoPipeline.from_pretrained(motif_video_model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

>>> video = pipe(
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     width=1280,
...     height=736,
...     num_frames=121,
...     num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=24)

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None ) tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to be encoded.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_videos_per_prompt (int, optional, defaults to 1) — Number of videos to generate per prompt.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
  • negative_prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings.
  • max_sequence_length (int, defaults to 512) — Maximum sequence length for the tokenizer.
  • device (torch.device, optional) — Device to place tensors on.
  • dtype (torch.dtype, optional) — Data type for tensors.

Returns

tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

A tuple containing:

  • prompt_embeds: The text embeddings for the positive prompt
  • negative_prompt_embeds: The text embeddings for the negative prompt (None if not using guidance)
  • prompt_attention_mask: The attention mask for the positive prompt
  • negative_prompt_attention_mask: The attention mask for the negative prompt (None if not using guidance)

Encodes the prompt into text encoder hidden states.

MotifVideoImage2VideoPipeline

class diffusers.MotifVideoImage2VideoPipeline

< >

( scheduler: SchedulerMixin vae: AutoencoderKLWan text_encoder: T5Gemma2Encoder tokenizer: PreTrainedTokenizerBase transformer: MotifVideoTransformer3DModel guider: BaseGuidance feature_extractor: SiglipImageProcessor )

Parameters

  • transformer (MotifVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
  • scheduler (SchedulerMixin) — A scheduler to be used in combination with transformer to denoise the encoded video latents. Should be an instance of a class inheriting from SchedulerMixin, such as DPMSolverMultistepScheduler. If not provided, uses the scheduler attached to the pretrained model.
  • vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
  • text_encoder (T5Gemma2Encoder) — Primary text encoder for encoding text prompts into embeddings.
  • tokenizer (PreTrainedTokenizerBase) — Tokenizer corresponding to the primary text encoder.
  • feature_extractor (SiglipImageProcessor) — Image processor for the SigLIP vision encoder.
  • guider (BaseGuidance) — The guidance method to use. Should be an instance of a class inheriting from BaseGuidance, such as ClassifierFreeGuidance, AdaptiveProjectedGuidance, or SkipLayerGuidance. If not provided, defaults to ClassifierFreeGuidance.

Pipeline for image-to-video generation using Motif-Video with first frame conditioning.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< >

( image: PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 736 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) ~MotifVideoPipelineOutput or tuple

Parameters

  • image (PipelineImageInput) — The input image to use as the first frame for video generation.
  • prompt (str or List[str]) — The prompt or prompts to guide the video generation.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the video generation.
  • height (int, defaults to 736) — The height in pixels of the generated video.
  • width (int, defaults to 1280) — The width in pixels of the generated video.
  • num_frames (int, defaults to 121) — The number of video frames to generate.
  • num_inference_steps (int, optional, defaults to 50) — The number of denoising steps.
  • timesteps (List[int], optional) — Custom timesteps to use for the denoising process.
  • num_videos_per_prompt (int, optional, defaults to 1) — The number of videos to generate per prompt.
  • generator (torch.Generator or List[torch.Generator], optional) — PyTorch Generator object(s) for deterministic generation.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
  • prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
  • negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings.
  • negative_prompt_attention_mask (torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings.
  • output_type (str, optional, defaults to "pil") — The output format of the generated video.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~MotifVideoPipelineOutput instead of a plain tuple.
  • attention_kwargs (dict, optional) — Arguments passed to the attention processor.
  • callback_on_step_end (Callable, optional) — A function or subclass of PipelineCallback called at the end of each denoising step.
  • callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function.
  • max_sequence_length (int, defaults to 512) — Maximum sequence length for the tokenizer.

Returns

~MotifVideoPipelineOutput or tuple

If return_dict is True, ~MotifVideoPipelineOutput is returned, otherwise a tuple is returned where the first element is a list of generated video frames.

The call function to the pipeline for image-to-video generation.

Examples:

>>> import torch
>>> from PIL import Image
>>> from diffusers import MotifVideoImage2VideoPipeline
>>> from diffusers.utils import export_to_video, load_image

>>> # Load the Motif-Video image-to-video pipeline
>>> motif_video_model_id = "Motif-Technologies/Motif-Video-2B"
>>> pipe = MotifVideoImage2VideoPipeline.from_pretrained(motif_video_model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> # Load an image
>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.png"
... )

>>> prompt = "An astronaut is walking on the moon surface, kicking up dust with each step"
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

>>> video = pipe(
...     image=image,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     width=1280,
...     height=736,
...     num_frames=121,
...     num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=24)

encode_prompt

< >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None ) tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Parameters

  • prompt (str or List[str], optional) — The prompt or prompts to be encoded.
  • negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
  • num_videos_per_prompt (int, optional, defaults to 1) — Number of videos to generate per prompt.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
  • prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
  • negative_prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings.
  • max_sequence_length (int, defaults to 512) — Maximum sequence length for the tokenizer.
  • device (torch.device, optional) — Device to place tensors on.
  • dtype (torch.dtype, optional) — Data type for tensors.

Returns

tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

A tuple containing:

  • prompt_embeds: The text embeddings for the positive prompt
  • negative_prompt_embeds: The text embeddings for the negative prompt (None if not using guidance)
  • prompt_attention_mask: The attention mask for the positive prompt
  • negative_prompt_attention_mask: The attention mask for the negative prompt (None if not using guidance)

Encodes the prompt into text encoder hidden states.

MotifVideoPipelineOutput

class diffusers.MotifVideoPipelineOutput

< >

( frames: Tensor )

Parameters

  • frames (torch.Tensor, np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of length batch_size, with each sub-list containing denoised PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape (batch_size, num_frames, channels, height, width).

Output class for Motif-Video pipelines.

Update on GitHub