arxiv:2306.17842

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Published on Jun 30, 2023

Authors:

Lu Jiang

Abstract

In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.

View arXiv page View PDF Add to collection

Community

roadjiang

Paper author Jul 5, 2023

•

edited Jul 5, 2023

Can GPT solve visual tasks by in-context learning?

A remarkable talent of LLMs is their in-context learning capability. In-context learning does not update any parameters of LLMs. Yet it has demonstrated impressive results in various NLP tasks.

Can GPT solve visual tasks by in-context learning? The recent paper shows this seems plausible as long as we can translate the image (or other non-linguisitic modality) into a language that the LLM can comprehend.
New research by Google reveals the power of Language Models (LLMs) like PaLM 2 or GPT 3.5 in tackling visual tasks using in-context learning. This new method enables LLMs to perform image generation tasks without requiring any parameter updates.

With 50 handwritten images in the context, we ask PaLM 2 to answer complex queries that require generating digit images as the output. It's the first successful approach of its kind, as far as we know, that uses a frozen LLM to generate image content.

Paper: SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs
https://arxiv.org/abs/2306.17842

roadjiang

Paper author Jul 5, 2023

Note, for all the tasks, we are using a frozen LLM via in-context learning with no updated to its parameters.

Image captioning by PaLM 2 via in-context learning:

Visual question answering by PaLM 2 via in-context learning:

Image generation by PaLM 2 via in-context learning:

mmistele

Jul 6, 2023

Can GPT solve visual tasks by in-context learning? The recent paper shows this seems plausible as long as we can translate the image (or other non-linguisitic modality) into a language that the LLM can comprehend.

Great paper on this not in the bibliography, since it was posted to Arxiv only 2 days before this one was (on June 28, 2023): "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language" https://huggingface.co./papers/2306.16410 (https://arxiv.org/abs/2306.16410)

roadjiang

Paper author Jul 6, 2023

•

edited Jul 6, 2023

Can GPT solve visual tasks by in-context learning? The recent paper shows this seems plausible as long as we can translate the image (or other non-linguisitic modality) into a language that the LLM can comprehend.

Great paper on this not in the bibliography, since it was posted to Arxiv only 2 days before this one was (on June 28, 2023): "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language" https://huggingface.co./papers/2306.16410 (https://arxiv.org/abs/2306.16410)

Thanks for sharing this paper. I think a key difference here is that we can ask GPT to generate an image through in-context learning. There have been several works that can carry out image understanding (we cited) and we will include this one.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2306.17842 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.17842 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.17842 in a Space README.md to link it from this page.

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Abstract

Community

Note, for all the tasks, we are using a frozen LLM via in-context learning with no updated to its parameters.

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1