arxiv:2502.11357

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

Published on Feb 17

· Submitted by

vardaan123 on Feb 18

Upvote

Authors:

Vardaan Pahuja ,

Yadong Lu ,

Boyu Gou ,

Abstract

Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.

View arXiv page View PDF Add to collection

Community

vardaan123

Paper author Paper submitter 2 days ago

Paper: Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
Link: https://arxiv.org/pdf/2502.11357.

We develop a scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis.
SOTA results on Mind2Web-Live for a 4B and 7B model.
Also outperforms other GUI agent baselines on Multimodal-Mind2Web and MiniWoB++.

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.11357 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.11357 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.11357 in a Space README.md to link it from this page.