CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Abstract
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation (2025)
- Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation (2025)
- BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations (2025)
- Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control (2025)
- MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent (2025)
- VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation (2025)
- LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper