Post
1238
Best open source Image to Video CogVideoX1.5-5B-I2V is pretty decent and optimized for low VRAM machines with high resolution - native resolution is 1360px and up to 10 seconds 161 frames - audios generated with new open source audio model
Full YouTube tutorial for CogVideoX1.5-5B-I2V : https://youtu.be/5UCkMzP2VLE
1-Click Windows, RunPod and Massed Compute installers : https://www.patreon.com/posts/112848192
https://www.patreon.com/posts/112848192 - installs into Python 3.11 VENV
Official Hugging Face repo of CogVideoX1.5-5B-I2V : THUDM/CogVideoX1.5-5B-I2V
Official github repo : https://github.com/THUDM/CogVideo
Used prompts to generate videos txt file : https://gist.github.com/FurkanGozukara/471db7b987ab8d9877790358c126ac05
Demo images shared in : https://www.patreon.com/posts/112848192
I used 1360x768px images at 16 FPS and 81 frames = 5 seconds
+1 frame coming from initial image
Also I have enabled all the optimizations shared on Hugging Face
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
quantization = int8_weight_only - you need TorchAO and DeepSpeed works great on Windows with Python 3.11 VENV
Used audio model : https://github.com/hkchengrex/MMAudio
1-Click Windows, RunPod and Massed Compute Installers for MMAudio : https://www.patreon.com/posts/117990364
https://www.patreon.com/posts/117990364 - Installs into Python 3.10 VENV
Used very simple prompts - it fails when there is human in input video so use text to audio in such cases
I also tested some VRAM usages for CogVideoX1.5-5B-I2V
Resolutions and here their VRAM requirements - may work on lower VRAM GPUs too but slower
512x288 - 41 frames : 7700 MB , 576x320 - 41 frames : 7900 MB
576x320 - 81 frames : 8850 MB , 704x384 - 81 frames : 8950 MB
768x432 - 81 frames : 10600 MB , 896x496 - 81 frames : 12050 MB
896x496 - 81 frames : 12050 MB , 960x528 - 81 frames : 12850 MB
Full YouTube tutorial for CogVideoX1.5-5B-I2V : https://youtu.be/5UCkMzP2VLE
1-Click Windows, RunPod and Massed Compute installers : https://www.patreon.com/posts/112848192
https://www.patreon.com/posts/112848192 - installs into Python 3.11 VENV
Official Hugging Face repo of CogVideoX1.5-5B-I2V : THUDM/CogVideoX1.5-5B-I2V
Official github repo : https://github.com/THUDM/CogVideo
Used prompts to generate videos txt file : https://gist.github.com/FurkanGozukara/471db7b987ab8d9877790358c126ac05
Demo images shared in : https://www.patreon.com/posts/112848192
I used 1360x768px images at 16 FPS and 81 frames = 5 seconds
+1 frame coming from initial image
Also I have enabled all the optimizations shared on Hugging Face
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
quantization = int8_weight_only - you need TorchAO and DeepSpeed works great on Windows with Python 3.11 VENV
Used audio model : https://github.com/hkchengrex/MMAudio
1-Click Windows, RunPod and Massed Compute Installers for MMAudio : https://www.patreon.com/posts/117990364
https://www.patreon.com/posts/117990364 - Installs into Python 3.10 VENV
Used very simple prompts - it fails when there is human in input video so use text to audio in such cases
I also tested some VRAM usages for CogVideoX1.5-5B-I2V
Resolutions and here their VRAM requirements - may work on lower VRAM GPUs too but slower
512x288 - 41 frames : 7700 MB , 576x320 - 41 frames : 7900 MB
576x320 - 81 frames : 8850 MB , 704x384 - 81 frames : 8950 MB
768x432 - 81 frames : 10600 MB , 896x496 - 81 frames : 12050 MB
896x496 - 81 frames : 12050 MB , 960x528 - 81 frames : 12850 MB