LayoutLMv3
Overview
The LayoutLMv3 model was proposed in LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3 simplifies LayoutLMv2 by using patch embeddings (as in ViT) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA).
The abstract from the paper is the following:
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.
Tips:
- In terms of data processing, LayoutLMv3 is identical to its predecessor LayoutLMv2, except that:
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. Due to these differences in data preprocessing, one can use LayoutLMv3Processor which internally combines a LayoutLMv3FeatureExtractor (for the image modality) and a LayoutLMv3Tokenizer/LayoutLMv3TokenizerFast (for the text modality) to prepare all data for the model.
- Regarding usage of LayoutLMv3Processor, we refer to the usage guide of its predecessor.
- Demo notebooks for LayoutLMv3 can be found here.
This model was contributed by nielsr. The original code can be found here.
LayoutLMv3Config
class transformers.LayoutLMv3Config
< source >( vocab_size = 50265 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-05 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 max_2d_position_embeddings = 1024 coordinate_size = 128 shape_size = 128 has_relative_attention_bias = True rel_pos_bins = 32 max_rel_pos = 128 rel_2d_pos_bins = 64 max_rel_2d_pos = 256 has_spatial_attention_bias = True text_embed = True visual_embed = True input_size = 224 num_channels = 3 patch_size = 16 classifier_dropout = None **kwargs )
Parameters
-
vocab_size (
int
, optional, defaults to 50265) — Vocabulary size of the LayoutLMv3 model. Defines the number of different tokens that can be represented by theinputs_ids
passed when calling LayoutLMv3Model. - hidden_size (
int
, optional, defaults to 768) — Dimension of the encoder layers and the pooler layer. - num_hidden_layers (
int
, optional, defaults to 12) — Number of hidden layers in the Transformer encoder. -
num_attention_heads (
int
, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder. -
intermediate_size (
int
, optional, defaults to 3072) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer encoder. - hidden_act (
str
orfunction
, optional, defaults to"gelu"
) — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu"
,"relu"
,"selu"
and"gelu_new"
are supported. - hidden_dropout_prob (
float
, optional, defaults to 0.1) — The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. -
attention_probs_dropout_prob (
float
, optional, defaults to 0.1) — The dropout ratio for the attention probabilities. -
max_position_embeddings (
int
, optional, defaults to 512) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). -
type_vocab_size (
int
, optional, defaults to 2) — The vocabulary size of thetoken_type_ids
passed when calling LayoutLMv3Model. -
initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. -
layer_norm_eps (
float
, optional, defaults to 1e-5) — The epsilon used by the layer normalization layers. -
max_2d_position_embeddings (
int
, optional, defaults to 1024) — The maximum value that the 2D position embedding might ever be used with. Typically set this to something large just in case (e.g., 1024). -
coordinate_size (
int
, optional, defaults to128
) — Dimension of the coordinate embeddings. -
shape_size (
int
, optional, defaults to128
) — Dimension of the width and height embeddings. -
has_relative_attention_bias (
bool
, optional, defaults toTrue
) — Whether or not to use a relative attention bias in the self-attention mechanism. -
rel_pos_bins (
int
, optional, defaults to 32) — The number of relative position bins to be used in the self-attention mechanism. -
max_rel_pos (
int
, optional, defaults to 128) — The maximum number of relative positions to be used in the self-attention mechanism. -
max_rel_2d_pos (
int
, optional, defaults to 256) — The maximum number of relative 2D positions in the self-attention mechanism. -
rel_2d_pos_bins (
int
, optional, defaults to 64) — The number of 2D relative position bins in the self-attention mechanism. -
has_spatial_attention_bias (
bool
, optional, defaults toTrue
) — Whether or not to use a spatial attention bias in the self-attention mechanism. -
visual_embed (
bool
, optional, defaults toTrue
) — Whether or not to add patch embeddings. -
input_size (
int
, optional, defaults to224
) — The size (resolution) of the images. -
num_channels (
int
, optional, defaults to3
) — The number of channels of the images. -
patch_size (
int
, optional, defaults to16
) — The size (resolution) of the patches. -
classifier_dropout (
float
, optional) — The dropout ratio for the classification head.
This is the configuration class to store the configuration of a LayoutLMv3Model. It is used to instantiate an LayoutLMv3 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LayoutLMv3 microsoft/layoutlmv3-base architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import LayoutLMv3Model, LayoutLMv3Config
>>> # Initializing a LayoutLMv3 microsoft/layoutlmv3-base style configuration
>>> configuration = LayoutLMv3Config()
>>> # Initializing a model from the microsoft/layoutlmv3-base style configuration
>>> model = LayoutLMv3Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
LayoutLMv3FeatureExtractor
class transformers.LayoutLMv3FeatureExtractor
< source >( do_resize = True size = 224 resample = <Resampling.BILINEAR: 2> do_normalize = True image_mean = None image_std = None apply_ocr = True ocr_lang = None **kwargs )
Parameters
-
do_resize (
bool
, optional, defaults toTrue
) — Whether to resize the input to a certainsize
. -
size (
int
orTuple(int)
, optional, defaults to 224) — Resize the input to the given size. If a tuple is provided, it should be (width, height). If only an integer is provided, then the input will be resized to (size, size). Only has an effect ifdo_resize
is set toTrue
. -
resample (
int
, optional, defaults toPIL.Image.BILINEAR
) — An optional resampling filter. This can be one ofPIL.Image.NEAREST
,PIL.Image.BOX
,PIL.Image.BILINEAR
,PIL.Image.HAMMING
,PIL.Image.BICUBIC
orPIL.Image.LANCZOS
. Only has an effect ifdo_resize
is set toTrue
. -
do_normalize (
bool
, optional, defaults toTrue
) — Whether or not to normalize the input with mean and standard deviation. -
image_mean (
List[int]
, defaults to[0.5, 0.5, 0.5]
) — The sequence of means for each channel, to be used when normalizing images. -
image_std (
List[int]
, defaults to[0.5, 0.5, 0.5]
) — The sequence of standard deviations for each channel, to be used when normalizing images. -
apply_ocr (
bool
, optional, defaults toTrue
) — Whether to apply the Tesseract OCR engine to get words + normalized bounding boxes. -
ocr_lang (
Optional[str]
, optional) — The language, specified by its ISO code, to be used by the Tesseract OCR engine. By default, English is used.LayoutLMv3FeatureExtractor uses Google’s Tesseract OCR engine under the hood.
Constructs a LayoutLMv3 feature extractor. This can be used to resize + normalize document images, as well as to apply OCR on them in order to get a list of words and normalized bounding boxes.
This feature extractor inherits from PreTrainedFeatureExtractor()
which contains most
of the main methods. Users should refer to this superclass for more information regarding those methods.
__call__
< source >( images: typing.Union[PIL.Image.Image, numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None **kwargs ) β BatchFeature
Parameters
-
images (
PIL.Image.Image
,np.ndarray
,torch.Tensor
,List[PIL.Image.Image]
,List[np.ndarray]
,List[torch.Tensor]
) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width. -
return_tensors (
str
or TensorType, optional, defaults to'np'
) — If set, will return tensors of a particular framework. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return NumPynp.ndarray
objects.'jax'
: Return JAXjnp.ndarray
objects.
Returns
A BatchFeature with the following fields:
- pixel_values β Pixel values to be fed to a model, of shape (batch_size, num_channels, height, width).
- words β Optional words as identified by Tesseract OCR (only when LayoutLMv3FeatureExtractor was
initialized with
apply_ocr
set toTrue
). - boxes β Optional bounding boxes as identified by Tesseract OCR, normalized based on the image size
(only when LayoutLMv3FeatureExtractor was initialized with
apply_ocr
set toTrue
).
Main method to prepare for the model one or several image(s).
Examples:
>>> from transformers import LayoutLMv3FeatureExtractor
>>> from PIL import Image
>>> image = Image.open("name_of_your_document - can be a png file, pdf, etc.").convert("RGB")
>>> # option 1: with apply_ocr=True (default)
>>> feature_extractor = LayoutLMv3FeatureExtractor()
>>> encoding = feature_extractor(image, return_tensors="pt")
>>> print(encoding.keys())
>>> # dict_keys(['pixel_values', 'words', 'boxes'])
>>> # option 2: with apply_ocr=False
>>> feature_extractor = LayoutLMv3FeatureExtractor(apply_ocr=False)
>>> encoding = feature_extractor(image, return_tensors="pt")
>>> print(encoding.keys())
>>> # dict_keys(['pixel_values'])
LayoutLMv3Tokenizer
class transformers.LayoutLMv3Tokenizer
< source >( vocab_file merges_file errors = 'replace' bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True cls_token_box = [0, 0, 0, 0] sep_token_box = [0, 0, 0, 0] pad_token_box = [0, 0, 0, 0] pad_token_label = -100 only_label_first_subword = True **kwargs )
Parameters
-
vocab_file (
str
) — Path to the vocabulary file. -
merges_file (
str
) — Path to the merges file. -
errors (
str
, optional, defaults to"replace"
) — Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information. -
bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the
cls_token
. -
eos_token (
str
, optional, defaults to"</s>"
) — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token
. -
sep_token (
str
, optional, defaults to"</s>"
) — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. -
cls_token (
str
, optional, defaults to"<s>"
) — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. -
unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. -
pad_token (
str
, optional, defaults to"<pad>"
) — The token used for padding, for example when batching sequences of different lengths. -
mask_token (
str
, optional, defaults to"<mask>"
) — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. -
add_prefix_space (
bool
, optional, defaults toFalse
) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (RoBERTa tokenizer detect beginning of words by the preceding space). -
cls_token_box (
List[int]
, optional, defaults to[0, 0, 0, 0]
) — The bounding box to use for the special [CLS] token. -
sep_token_box (
List[int]
, optional, defaults to[0, 0, 0, 0]
) — The bounding box to use for the special [SEP] token. -
pad_token_box (
List[int]
, optional, defaults to[0, 0, 0, 0]
) — The bounding box to use for the special [PAD] token. -
pad_token_label (
int
, optional, defaults to -100) — The label to use for padding tokens. Defaults to -100, which is theignore_index
of PyTorch’s CrossEntropyLoss. -
only_label_first_subword (
bool
, optional, defaults toTrue
) — Whether or not to only label the first subword, in case word labels are provided.
Construct a LayoutLMv3 tokenizer. Based on RoBERTatokenizer
(Byte Pair Encoding or BPE).
LayoutLMv3Tokenizer can be used to turn words, word-level bounding boxes and optional word labels to
token-level input_ids
, attention_mask
, token_type_ids
, bbox
, and optional labels
(for token
classification).
This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
LayoutLMv3Tokenizer runs end-to-end tokenization: punctuation splitting and wordpiece. It also turns the word-level bounding boxes into token-level bounding boxes.
__call__
< source >( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs )
Parameters
-
text (
str
,List[str]
,List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings (words of a single example or questions of a batch of examples) or a list of list of strings (batch of words). -
text_pair (
List[str]
,List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence should be a list of strings (pretokenized string). -
boxes (
List[List[int]]
,List[List[List[int]]]
) — Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale. -
word_labels (
List[int]
,List[List[int]]
, optional) — Word-level integer labels (for token classification tasks such as FUNSD, CORD). -
add_special_tokens (
bool
, optional, defaults toTrue
) — Whether or not to encode the sequences with the special tokens relative to their model. -
padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Activates and controls padding. Accepts the following values:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
-
truncation (
bool
,str
or TruncationStrategy, optional, defaults toFalse
) — Activates and controls truncation. Accepts the following values:True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
-
max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. -
stride (
int
, optional, defaults to 0) — If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. -
pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). -
return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
-
add_special_tokens (
bool
, optional, defaults toTrue
) — Whether or not to encode the sequences with the special tokens relative to their model. -
padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Activates and controls padding. Accepts the following values:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
-
truncation (
bool
,str
or TruncationStrategy, optional, defaults toFalse
) — Activates and controls truncation. Accepts the following values:True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
-
max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set toNone
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. -
stride (
int
, optional, defaults to 0) — If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. -
pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). -
return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.
LayoutLMv3TokenizerFast
class transformers.LayoutLMv3TokenizerFast
< source >( vocab_file = None merges_file = None tokenizer_file = None errors = 'replace' bos_token = '<s>' eos_token = '</s>' sep_token = '</s>' cls_token = '<s>' unk_token = '<unk>' pad_token = '<pad>' mask_token = '<mask>' add_prefix_space = True trim_offsets = True cls_token_box = [0, 0, 0, 0] sep_token_box = [0, 0, 0, 0] pad_token_box = [0, 0, 0, 0] pad_token_label = -100 only_label_first_subword = True **kwargs )
Parameters
-
vocab_file (
str
) — Path to the vocabulary file. -
merges_file (
str
) — Path to the merges file. -
errors (
str
, optional, defaults to"replace"
) — Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information. -
bos_token (
str
, optional, defaults to"<s>"
) — The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the
cls_token
. -
eos_token (
str
, optional, defaults to"</s>"
) — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token
. -
sep_token (
str
, optional, defaults to"</s>"
) — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. -
cls_token (
str
, optional, defaults to"<s>"
) — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. -
unk_token (
str
, optional, defaults to"<unk>"
) — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. -
pad_token (
str
, optional, defaults to"<pad>"
) — The token used for padding, for example when batching sequences of different lengths. -
mask_token (
str
, optional, defaults to"<mask>"
) — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. -
add_prefix_space (
bool
, optional, defaults toFalse
) — Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (RoBERTa tokenizer detect beginning of words by the preceding space). -
trim_offsets (
bool
, optional, defaults toTrue
) — Whether the post processing step should trim offsets to avoid including whitespaces. -
cls_token_box (
List[int]
, optional, defaults to[0, 0, 0, 0]
) — The bounding box to use for the special [CLS] token. -
sep_token_box (
List[int]
, optional, defaults to[0, 0, 0, 0]
) — The bounding box to use for the special [SEP] token. -
pad_token_box (
List[int]
, optional, defaults to[0, 0, 0, 0]
) — The bounding box to use for the special [PAD] token. -
pad_token_label (
int
, optional, defaults to -100) — The label to use for padding tokens. Defaults to -100, which is theignore_index
of PyTorch’s CrossEntropyLoss. -
only_label_first_subword (
bool
, optional, defaults toTrue
) — Whether or not to only label the first subword, in case word labels are provided.
Construct a βfastβ LayoutLMv3 tokenizer (backed by HuggingFaceβs tokenizers library). Based on BPE.
This tokenizer inherits from PreTrainedTokenizerFast which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
__call__
< source >( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs )
Parameters
-
text (
str
,List[str]
,List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence can be a string, a list of strings (words of a single example or questions of a batch of examples) or a list of list of strings (batch of words). -
text_pair (
List[str]
,List[List[str]]
) — The sequence or batch of sequences to be encoded. Each sequence should be a list of strings (pretokenized string). -
boxes (
List[List[int]]
,List[List[List[int]]]
) — Word-level bounding boxes. Each bounding box should be normalized to be on a 0-1000 scale. -
word_labels (
List[int]
,List[List[int]]
, optional) — Word-level integer labels (for token classification tasks such as FUNSD, CORD). -
add_special_tokens (
bool
, optional, defaults toTrue
) — Whether or not to encode the sequences with the special tokens relative to their model. -
padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Activates and controls padding. Accepts the following values:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
-
truncation (
bool
,str
or TruncationStrategy, optional, defaults toFalse
) — Activates and controls truncation. Accepts the following values:True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
-
max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. -
stride (
int
, optional, defaults to 0) — If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. -
pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). -
return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
-
add_special_tokens (
bool
, optional, defaults toTrue
) — Whether or not to encode the sequences with the special tokens relative to their model. -
padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Activates and controls padding. Accepts the following values:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
-
truncation (
bool
,str
or TruncationStrategy, optional, defaults toFalse
) — Activates and controls truncation. Accepts the following values:True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.'only_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.'only_second'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
-
max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters. If left unset or set toNone
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. -
stride (
int
, optional, defaults to 0) — If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. -
pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). -
return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'tf'
: Return TensorFlowtf.constant
objects.'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences with word-level normalized bounding boxes and optional labels.
LayoutLMv3Processor
class transformers.LayoutLMv3Processor
< source >( *args **kwargs )
Parameters
-
feature_extractor (
LayoutLMv3FeatureExtractor
) — An instance of LayoutLMv3FeatureExtractor. The feature extractor is a required input. -
tokenizer (
LayoutLMv3Tokenizer
orLayoutLMv3TokenizerFast
) — An instance of LayoutLMv3Tokenizer or LayoutLMv3TokenizerFast. The tokenizer is a required input.
Constructs a LayoutLMv3 processor which combines a LayoutLMv3 feature extractor and a LayoutLMv3 tokenizer into a single processor.
LayoutLMv3Processor offers all the functionalities you need to prepare data for the model.
It first uses LayoutLMv3FeatureExtractor to resize and normalize document images, and optionally applies OCR to
get words and normalized bounding boxes. These are then provided to LayoutLMv3Tokenizer or
LayoutLMv3TokenizerFast, which turns the words and bounding boxes into token-level input_ids
,
attention_mask
, token_type_ids
, bbox
. Optionally, one can provide integer word_labels
, which are turned
into token-level labels
for token classification tasks (such as FUNSD, CORD).
__call__
< source >( images text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = None text_pair: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] = None boxes: typing.Union[typing.List[typing.List[int]], typing.List[typing.List[typing.List[int]]]] = None word_labels: typing.Union[typing.List[int], typing.List[typing.List[int]], NoneType] = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None return_token_type_ids: typing.Optional[bool] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None **kwargs )
This method first forwards the images
argument to call(). In case
LayoutLMv3FeatureExtractor was initialized with apply_ocr
set to True
, it passes the obtained words and
bounding boxes along with the additional arguments to call() and returns the output,
together with resized and normalized pixel_values
. In case LayoutLMv3FeatureExtractor was initialized
with apply_ocr
set to False
, it passes the words (text
/`text_pair
) and boxes
specified by the user
along with the additional arguments to call() and returns the output, together with
resized and normalized pixel_values
.
Please refer to the docstring of the above two methods for more information.
LayoutLMv3Model
class transformers.LayoutLMv3Model
< source >( config )
Parameters
- config (LayoutLMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare LayoutLMv3 Model transformer outputting raw hidden-states without any specific head on top. This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >(
input_ids = None
bbox = None
attention_mask = None
token_type_ids = None
position_ids = None
head_mask = None
inputs_embeds = None
pixel_values = None
output_attentions = None
output_hidden_states = None
return_dict = None
)
β
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
-
input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using LayoutLMv2Tokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
-
bbox (
torch.LongTensor
of shape((batch_size, sequence_length), 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range[0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner. -
pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, height, width)
) — Batch of document images. -
attention_mask (
torch.FloatTensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
-
token_type_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 0 corresponds to a sentence A token,
- 1 corresponds to a sentence B token.
-
position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
. -
head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
-
inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. -
output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. -
return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LayoutLMv3Config) and inputs.
-
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) β Sequence of hidden-states at the output of the last layer of the model. -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3Model forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, AutoModel
>>> from datasets import load_dataset
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModel.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, words, boxes=boxes, return_tensors="pt")
>>> outputs = model(**encoding)
>>> last_hidden_states = outputs.last_hidden_state
LayoutLMv3ForSequenceClassification
class transformers.LayoutLMv3ForSequenceClassification
< source >( config )
Parameters
- config (LayoutLMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a sequence classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for document image classification tasks such as the RVL-CDIP dataset.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >(
input_ids = None
attention_mask = None
token_type_ids = None
position_ids = None
head_mask = None
inputs_embeds = None
labels = None
output_attentions = None
output_hidden_states = None
return_dict = None
bbox = None
pixel_values = None
)
β
transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
Parameters
-
input_ids (
torch.LongTensor
of shapebatch_size, sequence_length
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using LayoutLMv2Tokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
-
bbox (
torch.LongTensor
of shape(batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range[0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner. -
pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, height, width)
) — Batch of document images. -
attention_mask (
torch.FloatTensor
of shapebatch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
-
token_type_ids (
torch.LongTensor
of shapebatch_size, sequence_length
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 0 corresponds to a sentence A token,
- 1 corresponds to a sentence B token.
-
position_ids (
torch.LongTensor
of shapebatch_size, sequence_length
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
. -
head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
-
inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. -
output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. -
return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LayoutLMv3Config) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) β Classification (or regression if config.num_labels==1) loss. -
logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) β Classification (or regression if config.num_labels==1) scores (before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForSequenceClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, AutoModelForSequenceClassification
>>> from datasets import load_dataset
>>> import torch
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModelForSequenceClassification.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, words, boxes=boxes, return_tensors="pt")
>>> sequence_label = torch.tensor([1])
>>> outputs = model(**encoding, labels=sequence_label)
>>> loss = outputs.loss
>>> logits = outputs.logits
LayoutLMv3ForTokenClassification
class transformers.LayoutLMv3ForTokenClassification
< source >( config )
Parameters
- config (LayoutLMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a token classification head on top (a linear layer on top of the final hidden states) e.g. for sequence labeling (information extraction) tasks such as FUNSD, SROIE, CORD and Kleister-NDA.
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >(
input_ids = None
bbox = None
attention_mask = None
token_type_ids = None
position_ids = None
head_mask = None
inputs_embeds = None
labels = None
output_attentions = None
output_hidden_states = None
return_dict = None
pixel_values = None
)
β
transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
Parameters
-
input_ids (
torch.LongTensor
of shapebatch_size, sequence_length
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using LayoutLMv2Tokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
-
bbox (
torch.LongTensor
of shape(batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range[0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner. -
pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, height, width)
) — Batch of document images. -
attention_mask (
torch.FloatTensor
of shapebatch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
-
token_type_ids (
torch.LongTensor
of shapebatch_size, sequence_length
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 0 corresponds to a sentence A token,
- 1 corresponds to a sentence B token.
-
position_ids (
torch.LongTensor
of shapebatch_size, sequence_length
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
. -
head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
-
inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. -
output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. -
return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. -
labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the token classification loss. Indices should be in[0, ..., config.num_labels - 1]
.
Returns
transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.TokenClassifierOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LayoutLMv3Config) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) β Classification loss. -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.num_labels)
) β Classification scores (before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForTokenClassification forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, AutoModelForTokenClassification
>>> from datasets import load_dataset
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModelForTokenClassification.from_pretrained("microsoft/layoutlmv3-base", num_labels=7)
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> word_labels = example["ner_tags"]
>>> encoding = processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
>>> outputs = model(**encoding)
>>> loss = outputs.loss
>>> logits = outputs.logits
LayoutLMv3ForQuestionAnswering
class transformers.LayoutLMv3ForQuestionAnswering
< source >( config )
Parameters
- config (LayoutLMv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LayoutLMv3 Model with a span classification head on top for extractive question-answering tasks such as
DocVQA (a linear layer on top of the text part of the hidden-states output to
compute span start logits
and span end logits
).
This model is a PyTorch torch.nn.Module sub-class. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >(
input_ids = None
attention_mask = None
token_type_ids = None
position_ids = None
head_mask = None
inputs_embeds = None
start_positions = None
end_positions = None
output_attentions = None
output_hidden_states = None
return_dict = None
bbox = None
pixel_values = None
)
β
transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
Parameters
-
input_ids (
torch.LongTensor
of shapebatch_size, sequence_length
) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using LayoutLMv2Tokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
-
bbox (
torch.LongTensor
of shape(batch_size, sequence_length, 4)
, optional) — Bounding boxes of each input sequence tokens. Selected in the range[0, config.max_2d_position_embeddings-1]
. Each bounding box should be a normalized version in (x0, y0, x1, y1) format, where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner. -
pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, height, width)
) — Batch of document images. -
attention_mask (
torch.FloatTensor
of shapebatch_size, sequence_length
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
-
token_type_ids (
torch.LongTensor
of shapebatch_size, sequence_length
, optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in[0, 1]
:- 0 corresponds to a sentence A token,
- 1 corresponds to a sentence B token.
-
position_ids (
torch.LongTensor
of shapebatch_size, sequence_length
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.max_position_embeddings - 1]
. -
head_mask (
torch.FloatTensor
of shape(num_heads,)
or(num_layers, num_heads)
, optional) — Mask to nullify selected heads of the self-attention modules. Mask values selected in[0, 1]
:- 1 indicates the head is not masked,
- 0 indicates the head is masked.
-
inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. -
output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. -
return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. -
start_positions (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence are not taken into account for computing the loss. -
end_positions (
torch.LongTensor
of shape(batch_size,)
, optional) — Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length
). Position outside of the sequence are not taken into account for computing the loss.
Returns
transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (LayoutLMv3Config) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) β Total span extraction loss is the sum of a Cross-Entropy for the start and end positions. -
start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) β Span-start scores (before SoftMax). -
end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) β Span-end scores (before SoftMax). -
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) β Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) β Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LayoutLMv3ForQuestionAnswering forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import AutoProcessor, AutoModelForQuestionAnswering
>>> from datasets import load_dataset
>>> import torch
>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)
>>> model = AutoModelForQuestionAnswering.from_pretrained("microsoft/layoutlmv3-base")
>>> dataset = load_dataset("nielsr/funsd-layoutlmv3", split="train")
>>> example = dataset[0]
>>> image = example["image"]
>>> question = "what's his name?"
>>> words = example["tokens"]
>>> boxes = example["bboxes"]
>>> encoding = processor(image, question, words, boxes=boxes, return_tensors="pt")
>>> start_positions = torch.tensor([1])
>>> end_positions = torch.tensor([3])
>>> outputs = model(**encoding, start_positions=start_positions, end_positions=end_positions)
>>> loss = outputs.loss
>>> start_scores = outputs.start_logits
>>> end_scores = outputs.end_logits