arxiv:2412.11863

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Published on Dec 16

· Submitted by

BoZhang on Dec 17

Upvote

Authors:

Renqiu Xia ,

Mingsheng Li ,

Hancheng Ye ,

Hongbin Zhou ,

Jiakang Yuan ,

Tianshuo Peng ,

Xinyu Cai ,

Xiangchao Yan ,

Conghui He ,

Botian Shi ,

Junchi Yan ,

Bo Zhang

Abstract

Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k.

View arXiv page View PDF Add to collection

Community

BoZhang

Paper author Paper submitter 8 days ago

GeoX is a multi-modal large model designed for automatic geometric problem solving, incorporating three progressive training stages to enhance diagram understanding and reasoning. In this paper, we validate that the formal vision-language training is a simple-yet-effective paradigm for complex mathematical diagram learning.

librarian-bot

8 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.11863 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.11863 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.11863 in a Space README.md to link it from this page.