Papers
arxiv:2306.17842

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Published on Jun 30, 2023
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.

Community

Can GPT solve visual tasks by in-context learning?

A remarkable talent of LLMs is their in-context learning capability. In-context learning does not update any parameters of LLMs. Yet it has demonstrated impressive results in various NLP tasks.

Can GPT solve visual tasks by in-context learning? The recent paper shows this seems plausible as long as we can translate the image (or other non-linguisitic modality) into a language that the LLM can comprehend.
New research by Google reveals the power of Language Models (LLMs) like PaLM 2 or GPT 3.5 in tackling visual tasks using in-context learning. This new method enables LLMs to perform image generation tasks without requiring any parameter updates.

mnist.jpg

With 50 handwritten images in the context, we ask PaLM 2 to answer complex queries that require generating digit images as the output. It's the first successful approach of its kind, as far as we know, that uses a frozen LLM to generate image content.

Paper: SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
https://arxiv.org/abs/2306.17842

Paper author

Note, for all the tasks, we are using a frozen LLM via in-context learning with no updated to its parameters.

Image captioning by PaLM 2 via in-context learning:

caption.png

Visual question answering by PaLM 2 via in-context learning:

vqa.png

Image generation by PaLM 2 via in-context learning:

generation_all.png

Can GPT solve visual tasks by in-context learning? The recent paper shows this seems plausible as long as we can translate the image (or other non-linguisitic modality) into a language that the LLM can comprehend.

Great paper on this not in the bibliography, since it was posted to Arxiv only 2 days before this one was (on June 28, 2023): "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language" https://huggingface.co./papers/2306.16410 (https://arxiv.org/abs/2306.16410)

Can GPT solve visual tasks by in-context learning? The recent paper shows this seems plausible as long as we can translate the image (or other non-linguisitic modality) into a language that the LLM can comprehend.

Great paper on this not in the bibliography, since it was posted to Arxiv only 2 days before this one was (on June 28, 2023): "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language" https://huggingface.co./papers/2306.16410 (https://arxiv.org/abs/2306.16410)

Thanks for sharing this paper. I think a key difference here is that we can ask GPT to generate an image through in-context learning. There have been several works that can carry out image understanding (we cited) and we will include this one.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.17842 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.17842 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.17842 in a Space README.md to link it from this page.

Collections including this paper 1