Model Card for WaLa-MVDream-DM6

This model is part of the Wavelet Latent Diffusion (WaLa) paper, capable of generating six-view depth maps from text descriptions to support text-to-3D generation.

Model Details

Model Description

WaLa-MVDream-DM6 is a fine-tuned version of the MVDream model, adapted to generate six-view depth maps from text inputs. This model serves as an intermediate step in the text-to-3D generation pipeline of WaLa, producing multi-view depth maps that are then used by the WaLa-DM6-1B model to generate 3D shapes.

Developed by: Aditya Sanghi, Aliasghar Khani, Chinthala Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani
Model type: Text-to-Depth Map Generative Model
License: Autodesk Non-Commercial (3D Generative) v1.0

For more information please look at the Project Page and the paper.

Model Sources

Project Page: WaLa
Repository: Github
Paper: ArXiv
Demo: Colab

Uses

Direct Use

This model is released by Autodesk and intended for academic and research purposes only for the theoretical exploration and demonstration of the WaLa 3D generative framework. It is designed to be used in conjunction with WaLa-DM6-1B for text-to-3D generation. Please see here for inferencing instructions.

Out-of-Scope Use

The model should not be used for:

Commercial purposes
Generation of inappropriate or offensive content
Any usage not in compliance with the license, in particular, the "Acceptable Use" section.

Bias, Risks, and Limitations

Bias

The model may inherit biases present in the text-image datasets used for pre-training and fine-tuning.
The model's performance may vary depending on the complexity and specificity of the input text descriptions.

Risks and Limitations

The quality of the generated multi-view depth maps may impact the subsequent 3D shape generation.
The model may occasionally generate depth maps that do not accurately represent the input text or maintain consistency across views.

How to Get Started with the Model

Please refer to the instructions here

Training Details

Training Data

The model was fine-tuned using captions generated for the WaLa dataset. Captions were initially created using the Internvl 2.0 model and then augmented using LLaMA 3.1 to enhance diversity and richness.

Training Procedure

Preprocessing

Captions were generated for each 3D object in the dataset using four renderings and two distinct prompts. These captions were then augmented to increase diversity. For depth map generation, six views were used to ensure comprehensive coverage of the entire object.

Training Hyperparameters

Training regime: Please refer to the paper.

Technical Specifications

Model Architecture and Objective

The model is based on the MVDream architecture, fine-tuned to generate six-view depth maps from text inputs. It is designed to work in tandem with the WaLa-DM6-1B model for text-to-3D generation. The model uses the Stable Diffusion framework, initialized with weights from MVDream, and is fine-tuned on depth map-text paired data.

Compute Infrastructure

Hardware

The model was trained on NVIDIA H100 GPUs.

Citation

@misc{sanghi2024waveletlatentdiffusionwala,
      title={Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings}, 
      author={Aditya Sanghi and Aliasghar Khani and Pradyumna Reddy and Arianna Rampini and Derek Cheung and Kamal Rahimi Malekshan and Kanika Madan and Hooman Shayani},
      year={2024},
      eprint={2411.08017},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.08017}, 
}