Transformers
PyTorch
English
pixel
pretraining
Inference Endpoints
Edit model card

PIXEL (Pixel-based Encoder of Language)

PIXEL is a language model trained to reconstruct masked image patches that contain rendered text. PIXEL was pretrained on the English Wikipedia and Bookcorpus (in total around 3.2B words) but can theoretically be finetuned on data in any written language that can be typeset on a computer screen because it operates on rendered text as opposed to using a tokenizer with a fixed vocabulary.

It is not currently possible to use the Hosted Inference API with PIXEL.

Paper: Language Modelling with Pixels

Codebase: https://github.com/xplip/pixel

Model description

PIXEL consists of three major components: a text renderer, which draws text as an image; an encoder, which encodes the unmasked regions of the rendered image; and a decoder, which reconstructs the masked regions at the pixel level. It is built on ViT-MAE.

During pretraining, the renderer produces images containing the training sentences. Patches of these images are linearly projected to obtain patch embeddings (as opposed to having an embedding matrix like e.g. in BERT), and 25% of the patches are masked out. The encoder, which is a Vision Transformer (ViT), then only processes the unmasked patches. The lightweight decoder with hidden size 512 and 8 transformer layers inserts learnable mask tokens into the encoder's output sequence and learns to reconstruct the raw pixel values at the masked positions.

After pretraining, the decoder can be discarded leaving an 86M parameter encoder, upon which task-specific classification heads can be stacked. Alternatively, the decoder can be retained and PIXEL can be used as a pixel-level generative language model (see Figures 3 and 6 in the paper for examples).

For more details on how PIXEL works, please check the paper and the codebase linked above.

Intended uses

PIXEL is primarily intended to be finetuned to downstream NLP tasks. See the model hub to look for finetuned versions on a task that interests you. Otherwise, check out the PIXEL codebase on Github here to find out how to finetune PIXEL for your task.

How to use

Here is how to load PIXEL:

from pixel import PIXELConfig, PIXELForPreTraining

config = PIXELConfig.from_pretrained("Team-PIXEL/pixel-base")
model = PIXELForPreTraining.from_pretrained("Team-PIXEL/pixel-base", config=config)

Citing and Contact Author

@article{rust-etal-2022-pixel,
  title={Language Modelling with Pixels},
  author={Phillip Rust and Jonas F. Lotz and Emanuele Bugliarello and Elizabeth Salesky and Miryam de Lhoneux and Desmond Elliott},
  journal={arXiv preprint},
  year={2022},
  url={https://arxiv.org/abs/2207.06991}
}

Github: @xplip

Twitter: @rust_phillip

Downloads last month
224
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train Team-PIXEL/pixel-base

Space using Team-PIXEL/pixel-base 1