Diffusers documentation
Motif-Video
Motif-Video
Motif-Video is a 2B parameter diffusion transformer designed for text-to-video and image-to-video generation. It features a three-stage architecture with 12 dual-stream + 16 single-stream + 8 DDT decoder layers, Shared Cross-Attention for stable text-video alignment under long video sequences, T5Gemma2 text encoder, and rectified flow matching for velocity prediction.

Text-to-Video Generation
Use MotifVideoPipeline for text-to-video generation:
import torch
from diffusers import MotifVideoPipeline
from diffusers.utils import export_to_video
pipe = MotifVideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=1280,
height=736,
num_frames=121,
num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)Image-to-Video Generation
Use MotifVideoImage2VideoPipeline for image-to-video generation:
import torch
from diffusers import MotifVideoImage2VideoPipeline
from diffusers.utils import export_to_video, load_image
pipe = MotifVideoImage2VideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")
image = load_image("input_image.png")
prompt = "A cinematic scene with vivid colors."
negative_prompt = "worst quality, blurry, jittery, distorted"
video = pipe(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
width=1280,
height=736,
num_frames=121,
num_inference_steps=50,
).frames[0]
export_to_video(video, "i2v_output.mp4", fps=24)Memory-efficient Inference
For GPUs with less than 30GB VRAM (e.g., RTX 4090), use model CPU offloading:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueimport torch
from diffusers import MotifVideoPipeline
from diffusers.utils import export_to_video
pipe = MotifVideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=1280,
height=736,
num_frames=121,
num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)MotifVideoPipeline
class diffusers.MotifVideoPipeline
< source >( scheduler: SchedulerMixin vae: AutoencoderKLWan text_encoder: T5Gemma2Encoder tokenizer: PreTrainedTokenizerBase transformer: MotifVideoTransformer3DModel guider: BaseGuidance feature_extractor: typing.Optional[transformers.models.siglip.image_processing_siglip.SiglipImageProcessor] = None )
Parameters
- transformer (MotifVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
- scheduler (SchedulerMixin) —
A scheduler to be used in combination with
transformerto denoise the encoded video latents. Should be an instance of a class inheriting fromSchedulerMixin, such as DPMSolverMultistepScheduler. If not provided, uses the scheduler attached to the pretrained model. - vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder (
T5Gemma2Encoder) — Primary text encoder for encoding text prompts into embeddings. - tokenizer (
PreTrainedTokenizerBase) — Tokenizer corresponding to the primary text encoder. - guider (BaseGuidance) —
The guidance method to use. Should be an instance of a class inheriting from
BaseGuidance, such as ClassifierFreeGuidance, AdaptiveProjectedGuidance, or SkipLayerGuidance. If not provided, defaults toClassifierFreeGuidance.
Pipeline for text-to-video generation using Motif-Video.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 736 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 vae_batch_size: int | None = None ) → ~MotifVideoPipelineOutput or tuple
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to guide the video generation. If not defined, one has to passprompt_embeds. - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the video generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance. - height (
int, defaults to736) — The height in pixels of the generated video. - width (
int, defaults to1280) — The width in pixels of the generated video. - num_frames (
int, defaults to121) — The number of video frames to generate. - num_inference_steps (
int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality video at the expense of slower inference. - timesteps (
List[int], optional) — Custom timesteps to use for the denoising process. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — PyTorch Generator object(s) for deterministic generation. - latents (
torch.Tensor, optional) — Pre-generated noisy latents. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. - prompt_attention_mask (
torch.Tensor, optional) — Pre-generated attention mask for text embeddings. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. - negative_prompt_attention_mask (
torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings. - output_type (
str, optional, defaults to"pil") — The output format of the generated video. Choose between"pil","np", or"latent". - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a ~MotifVideoPipelineOutput instead of a plain tuple. - attention_kwargs (
dict, optional) — Arguments passed to the attention processor. - callback_on_step_end (
Callable, optional) — A function or subclass ofPipelineCallbackorMultiPipelineCallbackscalled at the end of each denoising step. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. - max_sequence_length (
int, defaults to512) — Maximum sequence length for the tokenizer. - vae_batch_size (
int, optional) — Batch size for VAE decoding. If provided and latents batch size is larger, VAE decoding will be done in chunks.
Returns
~MotifVideoPipelineOutput or tuple
If return_dict is True, ~MotifVideoPipelineOutput is returned, otherwise a tuple is returned
where the first element is a list of generated video frames.
The call function to the pipeline for text-to-video generation.
Examples:
>>> import torch
>>> from diffusers import MotifVideoPipeline
>>> from diffusers.utils import export_to_video
>>> # Load the Motif-Video pipeline
>>> motif_video_model_id = "Motif-Technologies/Motif-Video-2B"
>>> pipe = MotifVideoPipeline.from_pretrained(motif_video_model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
>>> video = pipe(
... prompt=prompt,
... negative_prompt=negative_prompt,
... width=1280,
... height=736,
... num_frames=121,
... num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=24)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None ) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to be encoded. - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos to generate per prompt. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - prompt_attention_mask (
torch.Tensor, optional) — Pre-generated attention mask for text embeddings. - negative_prompt_attention_mask (
torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings. - max_sequence_length (
int, defaults to 512) — Maximum sequence length for the tokenizer. - device (
torch.device, optional) — Device to place tensors on. - dtype (
torch.dtype, optional) — Data type for tensors.
Returns
tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
A tuple containing:
prompt_embeds: The text embeddings for the positive promptnegative_prompt_embeds: The text embeddings for the negative prompt (None if not using guidance)prompt_attention_mask: The attention mask for the positive promptnegative_prompt_attention_mask: The attention mask for the negative prompt (None if not using guidance)
Encodes the prompt into text encoder hidden states.
MotifVideoImage2VideoPipeline
class diffusers.MotifVideoImage2VideoPipeline
< source >( scheduler: SchedulerMixin vae: AutoencoderKLWan text_encoder: T5Gemma2Encoder tokenizer: PreTrainedTokenizerBase transformer: MotifVideoTransformer3DModel guider: BaseGuidance feature_extractor: SiglipImageProcessor )
Parameters
- transformer (MotifVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
- scheduler (SchedulerMixin) —
A scheduler to be used in combination with
transformerto denoise the encoded video latents. Should be an instance of a class inheriting fromSchedulerMixin, such as DPMSolverMultistepScheduler. If not provided, uses the scheduler attached to the pretrained model. - vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder (
T5Gemma2Encoder) — Primary text encoder for encoding text prompts into embeddings. - tokenizer (
PreTrainedTokenizerBase) — Tokenizer corresponding to the primary text encoder. - feature_extractor (
SiglipImageProcessor) — Image processor for the SigLIP vision encoder. - guider (BaseGuidance) —
The guidance method to use. Should be an instance of a class inheriting from
BaseGuidance, such as ClassifierFreeGuidance, AdaptiveProjectedGuidance, or SkipLayerGuidance. If not provided, defaults toClassifierFreeGuidance.
Pipeline for image-to-video generation using Motif-Video with first frame conditioning.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( image: PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 736 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~MotifVideoPipelineOutput or tuple
Parameters
- image (
PipelineImageInput) — The input image to use as the first frame for video generation. - prompt (
strorList[str]) — The prompt or prompts to guide the video generation. - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the video generation. - height (
int, defaults to736) — The height in pixels of the generated video. - width (
int, defaults to1280) — The width in pixels of the generated video. - num_frames (
int, defaults to121) — The number of video frames to generate. - num_inference_steps (
int, optional, defaults to 50) — The number of denoising steps. - timesteps (
List[int], optional) — Custom timesteps to use for the denoising process. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — PyTorch Generator object(s) for deterministic generation. - latents (
torch.Tensor, optional) — Pre-generated noisy latents. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. - prompt_attention_mask (
torch.Tensor, optional) — Pre-generated attention mask for text embeddings. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. - negative_prompt_attention_mask (
torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings. - output_type (
str, optional, defaults to"pil") — The output format of the generated video. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a ~MotifVideoPipelineOutput instead of a plain tuple. - attention_kwargs (
dict, optional) — Arguments passed to the attention processor. - callback_on_step_end (
Callable, optional) — A function or subclass ofPipelineCallbackcalled at the end of each denoising step. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. - max_sequence_length (
int, defaults to512) — Maximum sequence length for the tokenizer.
Returns
~MotifVideoPipelineOutput or tuple
If return_dict is True, ~MotifVideoPipelineOutput is returned, otherwise a tuple is returned
where the first element is a list of generated video frames.
The call function to the pipeline for image-to-video generation.
Examples:
>>> import torch
>>> from PIL import Image
>>> from diffusers import MotifVideoImage2VideoPipeline
>>> from diffusers.utils import export_to_video, load_image
>>> # Load the Motif-Video image-to-video pipeline
>>> motif_video_model_id = "Motif-Technologies/Motif-Video-2B"
>>> pipe = MotifVideoImage2VideoPipeline.from_pretrained(motif_video_model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> # Load an image
>>> image = load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.png"
... )
>>> prompt = "An astronaut is walking on the moon surface, kicking up dust with each step"
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
>>> video = pipe(
... image=image,
... prompt=prompt,
... negative_prompt=negative_prompt,
... width=1280,
... height=736,
... num_frames=121,
... num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=24)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None ) → tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to be encoded. - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos to generate per prompt. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - prompt_attention_mask (
torch.Tensor, optional) — Pre-generated attention mask for text embeddings. - negative_prompt_attention_mask (
torch.Tensor, optional) — Pre-generated attention mask for negative text embeddings. - max_sequence_length (
int, defaults to 512) — Maximum sequence length for the tokenizer. - device (
torch.device, optional) — Device to place tensors on. - dtype (
torch.dtype, optional) — Data type for tensors.
Returns
tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]
A tuple containing:
prompt_embeds: The text embeddings for the positive promptnegative_prompt_embeds: The text embeddings for the negative prompt (None if not using guidance)prompt_attention_mask: The attention mask for the positive promptnegative_prompt_attention_mask: The attention mask for the negative prompt (None if not using guidance)
Encodes the prompt into text encoder hidden states.
MotifVideoPipelineOutput
class diffusers.MotifVideoPipelineOutput
< source >( frames: Tensor )
Parameters
- frames (
torch.Tensor,np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of lengthbatch_size,with each sub-list containing denoised PIL image sequences of lengthnum_frames.It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width).
Output class for Motif-Video pipelines.