arxiv:2411.19189

Video Depth without Video Models

Published on Nov 28

· Submitted by

toshas on Dec 2

#2 Paper of the day

Upvote

Authors:

Bingxin Ke ,

Dominik Narnhofer ,

Shengyu Huang ,

Lei Ke ,

Torben Peters ,

Anton Obukhov ,

Abstract

Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: rollingdepth.github.io.

View arXiv page View PDF Add to collection

Community

toshas

Paper author Paper submitter 24 days ago

🌍 Website: https://rollingdepth.github.io
🤗 Hugging Face Space: https://huggingface.co./spaces/prs-eth/rollingdepth
📕 Paper: https://arxiv.org/abs/2411.19189
👾 Code: https://github.com/prs-eth/rollingdepth

MichaelBarryUK

24 days ago

•

edited 24 days ago

Nice work. Are the colours set to specific distances or are they dynamic? If the training data synchronised the colours to the known distances then you might be able to use this for size/distance/area/volume estimation too. It would be also be interesting to play around with non-euclidean geometries

MichaelBarryUK

24 days ago

I'm imagining taking a photo of a half empty bottle of drink, clicking on the drink segment and having the model tell you it's 250ml etc, or getting the height of a person, the volume of house, distance of xyz etc

mikelabs

23 days ago

Added summary here: https://www.aimodels.fyi/papers/arxiv/video-depth-without-video-models

toshas

Paper author 23 days ago

This AI generated summary is not entirely correct. For example, "Depends on quality of optical flow estimation" -- the method does not come close to using optical flow.

librarian-bot

23 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2411.19189 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2411.19189 in a dataset README.md to link it from this page.

Video Depth without Video Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 1

Collections including this paper 5