From screenshots to HTML code: Introducing the WebSight dataset

Published March 15, 2024

Update on GitHub

Upvote

HugoLaurencon Hugo Laurençon

Leyo Leo Tronchon

VictorSanh Victor Sanh

In the world of web development, turning designs into functional websites usually involves a lot of coding and careful testing. What if we could simplify this process, making it possible to convert web designs into working websites more easily and quickly? WebSight is a new dataset that aims at building AI systems capable of transforming screenshots to HTML code.

The challenge

Turning a website design or screenshot into HTML code usually needs an experienced developer. But what if this could be more efficient? Motivated by this question, we investigated how vision-language models (VLMs) could be used in web development to create low-code solutions that improve efficiency.

Today, the main challenge towards that goal is the lack of high-quality datasets tailored for this task. WebSight aims to fill that gap.

WebSight: A large synthetic dataset of screenshot/HTML code pairs

In January 2024, we introduced WebSight-v0.1, a synthetic dataset that consists of 823,000 pairs of HTML codes and their corresponding screenshots. This dataset is designed to train AI models to process and translate visual web designs into functional HTML code. By focusing on synthetic data, we've managed to bypass the noise and complexity often found in real-world HTML, allowing AI models to learn efficiently.

Following our initial release and building on top of the community’s feedback, we've updated our dataset to WebSight-v0.2, introducing significant improvements. These enhancements feature using real images in the screenshots, and switching to Tailwind CSS (instead of traditional CSS). We further scaled the dataset to 2 million examples.

Examples of web pages included in WebSight.

Sightseer: A model fine-tuned on WebSight

Leveraging the WebSight dataset, we’ve fine-tuned our forthcoming foundation vision-language model to obtain Sightseer, a model capable of converting webpage screenshots into functional HTML code. Sightseer additionally demonstrates the capability to incorporate images into the generated HTML that closely resemble those in the original screenshots.

Comparison of an original web page (input) on the left, and the rendering of the code generated by our model, Sightseer, (output) on the right.

Towards more powerful tools unlocked by visual language models

By iterating over WebSight, our goal is to build more capable AI systems that simplify the process of turning UI designs into functional code. This could reduce iteration time for developers by transforming a paper UI sketch into functional code rapidly, while making this process more accessible for non-developers. This is one of the many real applications of visual language models.. By open-sourcing WebSight, we encourage the community to work with us toward building more powerful tools for UI development.

Resources

Dataset: https://huggingface.co./datasets/HuggingFaceM4/WebSight
Technical report: http://arxiv.org/abs/2403.09029
Google colab: https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

By April 15, 2024 • 174

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

By August 22, 2023 • 31

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote