Terra

Terra is a world model designed for autonomous driving and serves as a baseline model in th ACT-Bench framework. Terra generates video continuations based on short video clips of approximately three frames and trajectory instructions. A key feature of Terra is its high adherence to trajectory instructions, enabling accurate and reliable action-conditioned video generation.

We have developed two versions of the Terra model to date. The v1 model, as detailed in the paper, exhibits a bias towards generating videos that veer to the right. To address this issue, we introduced the v2 model, incorporating slight architectural modifications to mitigate this tendency and produce more balanced outputs. The performance of each model is outlined below.

Vista Terra(v1) Terra(v2)
Accuracy (โ†‘) 0.307 0.441 0.632
ADE (โ†“) 4.50 3.98 3.86
FDE (โ†“) 8.66 8.21 8.05

Related Links

For more technical details and discussions, please refer to:

How to use

We have verified the execution on a machine equipped with a single NVIDIA H100 80GB GPU. However, we believe it should be possible to run the model on any machine equipped with an NVIDIA GPU with 16GB or more of VRAM.

Terra consists of an Image Tokenizer, an Autoregressive Transformer, and a Video Refiner. Due to the complexity of setting up the Video Refiner, we have not include its implementation in this Hugging Face repository. Instead, the implementation and setup instructions for the Video Refiner are provided in ACT-Bench repository. Here, we provide an example of generating video continuations using the Image Tokenizer and the Autoregressive Transformer, conditioned on image frames and a template trajectory. The resulting video quality might seem suboptimal as each frame is decoded individually. To improve the visual quality, you can use Video Refiner.

Install Packages

We use uv to manage python packages. If you don't have uv installed in your environment, please see the document of it.

$ git clone https://huggingface.co./turing-motors/Terra
$ uv sync

Action-Conditioned Video Generation without Video Refiner

$ python inference.py

This command generates a video using three image frames located in assets/conditioning_frames and the curving_to_left/curving_to_left_moderate trajectory defined in the trajectory template file assets/template_trajectory.json.

You can find more details by referring to the inference.py script.

Citation

@misc{arai2024actbench,
      title={ACT-Bench: Towards Action Controllable World Models for Autonomous Driving}, 
      author={Hidehisa Arai and Keishi Ishihara and Tsubasa Takahashi and Yu Yamaguchi},
      year={2024},
      eprint={2412.05337},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05337}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .