CogVideoX-Fun-V1.5
Collection
2 items
•
Updated
We explore the Reward Backpropagation technique 1 2 to optimized the generated videos by CogVideoX-Fun-V1.5 for better alignment with human preferences. We provide the following pre-trained models (i.e. LoRAs) along with the training script. You can use these LoRAs to enhance the corresponding base model as a plug-in or train your own reward LoRA.
For more details, please refer to our GitHub repo.
Name | Base Model | Reward Model | Hugging Face | Description |
---|---|---|---|---|
CogVideoX-Fun-V1.5-5b-InP-HPS2.1.safetensors | CogVideoX-Fun-V1.5-5b | HPS v2.1 | 🤗Link | Official HPS v2.1 reward LoRA (rank=128 and network_alpha=64 ) for CogVideoX-Fun-V1.5-5b-InP. It is trained with a batch size of 8 for 1,500 steps. |
CogVideoX-Fun-V1.5-5b-InP-MPS.safetensors | CogVideoX-Fun-V1.5-5b | MPS | 🤗Link | Official MPS reward LoRA (rank=128 and network_alpha=64 ) for CogVideoX-Fun-V1.5-5b-InP. It is trained with a batch size of 8 for 5,500 steps. |
Prompt | CogVideoX-Fun-V1.5-5B | CogVideoX-Fun-V1.5-5B HPSv2.1 Reward LoRA |
CogVideoX-Fun-V1.5-5B MPS Reward LoRA |
---|---|---|---|
A panda eats bamboo while a monkey swings from branch to branch | |||
A penguin waddles on the ice, a camel treks by | |||
Elderly artist with a white beard painting on a white canvas | |||
Crystal cake shimmering beside a metal apple |
The above test prompts are from T2V-CompBench. All videos are generated with lora weight 0.7.
We provide a simple inference code to run CogVideoX-Fun-V1.5-5b-InP with its HPS2.1 reward LoRA.
import torch
from diffusers import CogVideoXDDIMScheduler
from cogvideox.models.transformer3d import CogVideoXTransformer3DModel
from cogvideox.pipeline.pipeline_cogvideox_inpaint import CogVideoX_Fun_Pipeline_Inpaint
from cogvideox.utils.lora_utils import merge_lora
from cogvideox.utils.utils import get_image_to_video_latent, save_videos_grid
model_path = "alibaba-pai/CogVideoX-Fun-V1.5-5b-InP"
lora_path = "alibaba-pai/CogVideoX-Fun-V1.5-Reward-LoRAs/CogVideoX-Fun-V1.5-5b-InP-HPS2.1.safetensors"
lora_weight = 0.7
prompt = "Pig with wings flying above a diamond mountain"
sample_size = [512, 512]
video_length = 85
transformer = CogVideoXTransformer3DModel.from_pretrained_2d(model_path, subfolder="transformer").to(torch.bfloat16)
scheduler = CogVideoXDDIMScheduler.from_pretrained(model_path, subfolder="scheduler")
pipeline = CogVideoX_Fun_Pipeline_Inpaint.from_pretrained(
model_path, transformer=transformer, scheduler=scheduler, torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()
pipeline = merge_lora(pipeline, lora_path, lora_weight)
generator = torch.Generator(device="cuda").manual_seed(42)
input_video, input_video_mask, _ = get_image_to_video_latent(None, None, video_length=video_length, sample_size=sample_size)
sample = pipeline(
prompt,
num_frames = video_length,
negative_prompt = "bad detailed",
height = sample_size[0],
width = sample_size[1],
generator = generator,
guidance_scale = 7.0,
num_inference_steps = 50,
video = input_video,
mask_video = input_video_mask,
).videos
save_videos_grid(sample, "samples/output.mp4", fps=8)