Pixel Text-to-Video Generation

This repository contains the necessary steps and scripts to generate videos using the Pixel text-to-video model. The model leverages LoRA (Low-Rank Adaptation) weights and pre-trained components to create high-quality anime-style videos based on textual prompts.

Prerequisites

Before proceeding, ensure that you have the following installed on your system:

• Ubuntu (or a compatible Linux distribution) • Python 3.x • pip (Python package manager) • Git • Git LFS (Git Large File Storage) • FFmpeg

Installation

Update and Install Dependencies

sudo apt-get update && sudo apt-get install cbm git-lfs ffmpeg

Clone the Repository

git clone https://huggingface.co./svjack/Pixel_wan_2_1_1_3_B_text2video_lora
cd Pixel_wan_2_1_1_3_B_text2video_lora

Install Python Dependencies

pip install torch torchvision
pip install -r requirements.txt
pip install ascii-magic matplotlib tensorboard huggingface_hub datasets
pip install moviepy==1.0.3
pip install sageattention==1.0.6

Download Model Weights

wget https://huggingface.co./Wan-AI/Wan2.1-T2V-14B/resolve/main/models_t5_umt5-xxl-enc-bf16.pth
wget https://huggingface.co./DeepBeepMeep/Wan2.1/resolve/main/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
wget https://huggingface.co./Wan-AI/Wan2.1-T2V-14B/resolve/main/Wan2.1_VAE.pth
wget https://huggingface.co./Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/diffusion_models/wan2.1_t2v_1.3B_bf16.safetensors
wget https://huggingface.co./Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/diffusion_models/wan2.1_t2v_14B_bf16.safetensors

Usage

To generate a video, use the wan_generate_video.py script with the appropriate parameters. Below are examples of how to generate videos using the Pixel model.

Woods

python wan_generate_video.py --fp8 --task t2v-1.3B --video_size 768 1024 --video_length 81 --infer_steps 20 \
--save_path save --output_type both \
--dit wan2.1_t2v_1.3B_bf16.safetensors --vae Wan2.1_VAE.pth \
--t5 models_t5_umt5-xxl-enc-bf16.pth \
--attn_mode torch \
--lora_weight pixel_outputs/pixel_w1_3_lora-000010.safetensors \
--lora_multiplier 1.0 \
--prompt "The video showcases a pixel art scene from a video game. Golden light filters through the canopy, illuminating soft moss and fallen leaves. Wildflowers bloom nearby, and glowing fireflies hover in the air. A gentle stream flows in the background, its murmur blending with birdsong. The scene radiates tranquility and natural charm."

Castle

python wan_generate_video.py --fp8 --task t2v-1.3B --video_size 768 1024 --video_length 81 --infer_steps 20 \
--save_path save --output_type both \
--dit wan2.1_t2v_1.3B_bf16.safetensors --vae Wan2.1_VAE.pth \
--t5 models_t5_umt5-xxl-enc-bf16.pth \
--attn_mode torch \
--lora_weight pixel_outputs/pixel_w1_3_lora-000010.safetensors \
--lora_multiplier 1.0 \
--prompt "The video showcases a pixel art scene from a video game. the video shifts to a majestic castle under a starry sky. Silvery moonlight bathes the ancient stone walls, casting soft shadows on the surrounding landscape. Towering spires rise into the night, their peaks adorned with glowing orbs that mimic the stars above. A tranquil moat reflects the shimmering heavens, its surface rippling gently in the cool breeze. Fireflies dance around the castle’s ivy-covered arches, adding a touch of magic to the scene. In the distance, a faint aurora paints the horizon with hues of green and purple, blending seamlessly with the celestial tapestry. The scene exudes an aura of timeless wonder and serene beauty."

City

python wan_generate_video.py --fp8 --task t2v-1.3B --video_size 768 1024 --video_length 81 --infer_steps 20 \
--save_path save --output_type both \
--dit wan2.1_t2v_1.3B_bf16.safetensors --vae Wan2.1_VAE.pth \
--t5 models_t5_umt5-xxl-enc-bf16.pth \
--attn_mode torch \
--lora_weight pixel_outputs/pixel_w1_3_lora-000010.safetensors \
--lora_multiplier 1.0 \
--prompt "The video showcases a pixel art scene from a video game. the video showcases a vibrant urban landscape. The city skyline is dominated by towering skyscrapers, their glass facades reflecting the sunlight. The streets are bustling with activity, filled with cars, buses, and pedestrians. Parks and green spaces are scattered throughout, offering a refreshing contrast to the concrete jungle. The architecture is a mix of modern and historic buildings, each telling a story of the city's evolution. The overall scene is alive with energy, capturing the essence of urban life."

Girl

python wan_generate_video.py --fp8 --task t2v-1.3B --video_size 768 1024 --video_length 81 --infer_steps 20 \
--save_path save --output_type both \
--dit wan2.1_t2v_1.3B_bf16.safetensors --vae Wan2.1_VAE.pth \
--t5 models_t5_umt5-xxl-enc-bf16.pth \
--attn_mode torch \
--lora_weight pixel_outputs/pixel_w1_3_lora-000010.safetensors \
--lora_multiplier 1.0 \
--prompt "The video showcases a pixel art scene from a video game. .The video showcases a animation featuring charming anime-style scene featuring a pink-haired girl with angel wings. She's seated at a desk, enjoying a donut while working on a laptop. The setting is a cozy, pastel-colored room with a pink chair, a milk carton, and a coffee cup. The girl's expression is one of delight as she savors her treat."

Squirrel

python wan_generate_video.py --fp8 --task t2v-1.3B --video_size 768 1024 --video_length 81 --infer_steps 20 \
--save_path save --output_type both \
--dit wan2.1_t2v_1.3B_bf16.safetensors --vae Wan2.1_VAE.pth \
--t5 models_t5_umt5-xxl-enc-bf16.pth \
--attn_mode torch \
--lora_weight pixel_outputs/pixel_w1_3_lora-000010.safetensors \
--lora_multiplier 1.0 \
--prompt "The video showcases a pixel art scene from a video game. The video showcases an animation featuring a vibrant and lively forest scene. The scene is centered around a curious squirrel perched on a moss-covered tree branch, its bushy tail flicking with excitement. The squirrel holds a glossy brown chestnut in its tiny paws, nibbling intently as its whiskers twitch. Surrounding the squirrel are lush green leaves and dappled sunlight filtering through the canopy, creating a warm and inviting atmosphere. A few scattered chestnuts and acorns lie on the forest floor, hinting at the squirrel’s recent foraging. The atmosphere is playful and charming, evoking a sense of nature’s simple joys and the industrious spirit of woodland creatures."

Parameters

--fp8: Enable FP8 precision (optional).
--task: Specify the task (e.g., t2v-1.3B).
--video_size: Set the resolution of the generated video (e.g., 1024 1024).
--video_length: Define the length of the video in frames.
--infer_steps: Number of inference steps.
--save_path: Directory to save the generated video.
--output_type: Output type (e.g., both for video and frames).
--dit: Path to the diffusion model weights.
--vae: Path to the VAE model weights.
--t5: Path to the T5 model weights.
--attn_mode: Attention mode (e.g., torch).
--lora_weight: Path to the LoRA weights.
--lora_multiplier: Multiplier for LoRA weights.
--prompt: Textual prompt for video generation.

Output

The generated video and frames will be saved in the specified save_path directory.

Troubleshooting

• Ensure all dependencies are correctly installed. • Verify that the model weights are downloaded and placed in the correct locations. • Check for any missing Python packages and install them using pip.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

• Hugging Face for hosting the model weights. • Wan-AI for providing the pre-trained models. • DeepBeepMeep for contributing to the model weights.

Contact

For any questions or issues, please open an issue on the repository or contact the maintainer.