Motif-Video

Technical Report

Motif-Video is a 2B parameter diffusion transformer designed for text-to-video and image-to-video generation. It features a three-stage architecture with 12 dual-stream + 16 single-stream + 8 DDT decoder layers, Shared Cross-Attention for stable text-video alignment under long video sequences, T5Gemma2 text encoder, and rectified flow matching for velocity prediction.

Motif-Video architecture

Text-to-Video Generation

Use MotifVideoPipeline for text-to-video generation:

import torch
from diffusers import MotifVideoPipeline
from diffusers.utils import export_to_video


pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1280,
    height=736,
    num_frames=121,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

Image-to-Video Generation

Use MotifVideoImage2VideoPipeline for image-to-video generation:

import torch
from diffusers import MotifVideoImage2VideoPipeline
from diffusers.utils import export_to_video, load_image


pipe = MotifVideoImage2VideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

image = load_image("input_image.png")
prompt = "A cinematic scene with vivid colors."
negative_prompt = "worst quality, blurry, jittery, distorted"

video = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1280,
    height=736,
    num_frames=121,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "i2v_output.mp4", fps=24)

Memory-efficient Inference

For GPUs with less than 30GB VRAM (e.g., RTX 4090), use model CPU offloading:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

import torch
from diffusers import MotifVideoPipeline
from diffusers.utils import export_to_video


pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1280,
    height=736,
    num_frames=121,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

MotifVideoPipeline

class diffusers.MotifVideoPipeline

< source >

( scheduler: SchedulerMixin vae: AutoencoderKLWan text_encoder: T5Gemma2Encoder tokenizer: PreTrainedTokenizerBase transformer: MotifVideoTransformer3DModel guider: BaseGuidance feature_extractor: typing.Optional[transformers.models.siglip.image_processing_siglip.SiglipImageProcessor] = None )

Parameters

transformer (MotifVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
scheduler (SchedulerMixin) — A scheduler to be used in combination with transformer to denoise the encoded video latents. Should be an instance of a class inheriting from SchedulerMixin, such as DPMSolverMultistepScheduler. If not provided, uses the scheduler attached to the pretrained model.
vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
text_encoder (T5Gemma2Encoder) — Primary text encoder for encoding text prompts into embeddings.
tokenizer (PreTrainedTokenizerBase) — Tokenizer corresponding to the primary text encoder.
guider (BaseGuidance) — The guidance method to use. Should be an instance of a class inheriting from BaseGuidance, such as ClassifierFreeGuidance, AdaptiveProjectedGuidance, or SkipLayerGuidance. If not provided, defaults to ClassifierFreeGuidance.

Pipeline for text-to-video generation using Motif-Video.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

call

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 736 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 vae_batch_size: int | None = None ) → ~MotifVideoPipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the video generation. If not defined, one has to pass prompt_embeds.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the video generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance.
height (int, defaults to 736) — The height in pixels of the generated video.
width (int, defaults to 1280) — The width in pixels of the generated video.
num_frames (int, defaults to 121) — The number of video frames to generate.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference.
timesteps (List[int], optional) — Custom timesteps to use for the denoising process.
num_videos_per_prompt (int, optional, defaults to 1) — The number of videos to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — PyTorch Generator object(s) for deterministic generation.
latents (torch.Tensor, optional) — Pre-generated noisy latents.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings.
negative_prompt_attention_mask (torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings.
output_type (str, optional, defaults to "pil") — The output format of the generated video. Choose between "pil", "np", or "latent".
return_dict (bool, optional, defaults to True) — Whether or not to return a ~MotifVideoPipelineOutput instead of a plain tuple.
attention_kwargs (dict, optional) — Arguments passed to the attention processor.
callback_on_step_end (Callable, optional) — A function or subclass of PipelineCallback or MultiPipelineCallbacks called at the end of each denoising step.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function.
max_sequence_length (int, defaults to 512) — Maximum sequence length for the tokenizer.
vae_batch_size (int, optional) — Batch size for VAE decoding. If provided and latents batch size is larger, VAE decoding will be done in chunks.

Returns

~MotifVideoPipelineOutput or tuple

If return_dict is True, ~MotifVideoPipelineOutput is returned, otherwise a tuple is returned where the first element is a list of generated video frames.

The call function to the pipeline for text-to-video generation.

Examples:

>>> import torch
>>> from diffusers import MotifVideoPipeline
>>> from diffusers.utils import export_to_video

>>> # Load the Motif-Video pipeline
>>> motif_video_model_id = "Motif-Technologies/Motif-Video-2B"
>>> pipe = MotifVideoPipeline.from_pretrained(motif_video_model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

>>> video = pipe(
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     width=1280,
...     height=736,
...     num_frames=121,
...     num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=24)

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None ) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

Parameters

prompt (str or List[str], optional) — The prompt or prompts to be encoded.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_videos_per_prompt (int, optional, defaults to 1) — Number of videos to generate per prompt.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
negative_prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings.
max_sequence_length (int, defaults to 512) — Maximum sequence length for the tokenizer.
device (torch.device, optional) — Device to place tensors on.
dtype (torch.dtype, optional) — Data type for tensors.

Returns

tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

A tuple containing:

prompt_embeds: The text embeddings for the positive prompt
negative_prompt_embeds: The text embeddings for the negative prompt (None if not using guidance)
prompt_attention_mask: The attention mask for the positive prompt
negative_prompt_attention_mask: The attention mask for the negative prompt (None if not using guidance)

Encodes the prompt into text encoder hidden states.

MotifVideoImage2VideoPipeline

class diffusers.MotifVideoImage2VideoPipeline

< source >

Parameters

transformer (MotifVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
scheduler (SchedulerMixin) — A scheduler to be used in combination with transformer to denoise the encoded video latents. Should be an instance of a class inheriting from SchedulerMixin, such as DPMSolverMultistepScheduler. If not provided, uses the scheduler attached to the pretrained model.
vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
text_encoder (T5Gemma2Encoder) — Primary text encoder for encoding text prompts into embeddings.
tokenizer (PreTrainedTokenizerBase) — Tokenizer corresponding to the primary text encoder.
feature_extractor (SiglipImageProcessor) — Image processor for the SigLIP vision encoder.
guider (BaseGuidance) — The guidance method to use. Should be an instance of a class inheriting from BaseGuidance, such as ClassifierFreeGuidance, AdaptiveProjectedGuidance, or SkipLayerGuidance. If not provided, defaults to ClassifierFreeGuidance.

Pipeline for image-to-video generation using Motif-Video with first frame conditioning.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

call

< source >

( image: PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 736 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~MotifVideoPipelineOutput or tuple

Parameters

image (PipelineImageInput) — The input image to use as the first frame for video generation.
prompt (str or List[str]) — The prompt or prompts to guide the video generation.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the video generation.
height (int, defaults to 736) — The height in pixels of the generated video.
width (int, defaults to 1280) — The width in pixels of the generated video.
num_frames (int, defaults to 121) — The number of video frames to generate.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps.
timesteps (List[int], optional) — Custom timesteps to use for the denoising process.
num_videos_per_prompt (int, optional, defaults to 1) — The number of videos to generate per prompt.
generator (torch.Generator or List[torch.Generator], optional) — PyTorch Generator object(s) for deterministic generation.
latents (torch.Tensor, optional) — Pre-generated noisy latents.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings.
negative_prompt_attention_mask (torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings.
output_type (str, optional, defaults to "pil") — The output format of the generated video.
return_dict (bool, optional, defaults to True) — Whether or not to return a ~MotifVideoPipelineOutput instead of a plain tuple.
attention_kwargs (dict, optional) — Arguments passed to the attention processor.
callback_on_step_end (Callable, optional) — A function or subclass of PipelineCallback called at the end of each denoising step.
callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for the callback_on_step_end function.
max_sequence_length (int, defaults to 512) — Maximum sequence length for the tokenizer.

Returns

~MotifVideoPipelineOutput or tuple

If return_dict is True, ~MotifVideoPipelineOutput is returned, otherwise a tuple is returned where the first element is a list of generated video frames.

The call function to the pipeline for image-to-video generation.

Examples:

>>> import torch
>>> from PIL import Image
>>> from diffusers import MotifVideoImage2VideoPipeline
>>> from diffusers.utils import export_to_video, load_image

>>> # Load the Motif-Video image-to-video pipeline
>>> motif_video_model_id = "Motif-Technologies/Motif-Video-2B"
>>> pipe = MotifVideoImage2VideoPipeline.from_pretrained(motif_video_model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> # Load an image
>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.png"
... )

>>> prompt = "An astronaut is walking on the moon surface, kicking up dust with each step"
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

>>> video = pipe(
...     image=image,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     width=1280,
...     height=736,
...     num_frames=121,
...     num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=24)

encode_prompt

< source >

Parameters

prompt (str or List[str], optional) — The prompt or prompts to be encoded.
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
num_videos_per_prompt (int, optional, defaults to 1) — Number of videos to generate per prompt.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for text embeddings.
negative_prompt_attention_mask (torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings.
max_sequence_length (int, defaults to 512) — Maximum sequence length for the tokenizer.
device (torch.device, optional) — Device to place tensors on.
dtype (torch.dtype, optional) — Data type for tensors.

Returns

tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]

A tuple containing:

prompt_embeds: The text embeddings for the positive prompt
negative_prompt_embeds: The text embeddings for the negative prompt (None if not using guidance)
prompt_attention_mask: The attention mask for the positive prompt
negative_prompt_attention_mask: The attention mask for the negative prompt (None if not using guidance)

Encodes the prompt into text encoder hidden states.

MotifVideoPipelineOutput

class diffusers.MotifVideoPipelineOutput

< source >

( frames: Tensor )

Parameters

frames (torch.Tensor, np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of length batch_size, with each sub-list containing denoised PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape (batch_size, num_frames, channels, height, width).

Output class for Motif-Video pipelines.

Update on GitHub

Diffusers

Motif-Video

Text-to-Video Generation

Image-to-Video Generation

Memory-efficient Inference

MotifVideoPipeline

class diffusers.MotifVideoPipeline

__call__

encode_prompt

MotifVideoImage2VideoPipeline

class diffusers.MotifVideoImage2VideoPipeline

__call__

encode_prompt

MotifVideoPipelineOutput

class diffusers.MotifVideoPipelineOutput

call

call