Metadata Might Make Language Models Better
Abstract
This paper discusses the benefits of including metadata when training language models on historical collections. Using 19th-century newspapers as a case study, we extend the time-masking approach proposed by Rosin et al., 2022 and compare different strategies for inserting temporal, political and geographical information into a Masked Language Model. After fine-tuning several DistilBERT on enhanced input data, we provide a systematic evaluation of these models on a set of evaluation tasks: pseudo-perplexity, metadata mask-filling and supervised classification. We find that showing relevant metadata to a language model has a beneficial impact and may even produce more robust and fairer models.
Community
A paper that takes a similar approach. Glad to see more people working on this topic!
A nice! Thanks for the suggestion :-)
This is an automated message from the Librarian Bot. I found the following papers similar to the one you just shared:
The following papers were recommended by the Semantic Scholar API
- Improving Domain-Specific Retrieval by NLI Fine-Tuning (2023)
- Leveraging Contextual Information for Effective Entity Salience Detection (2023)
- MultiSChuBERT: Effective Multimodal Fusion for Scholarly Document Quality Prediction (2023)
- AlbNER: A Corpus for Named Entity Recognition in Albanian (2023)
- ToddlerBERTa: Exploiting BabyBERTa for Grammar Learning and Language Understanding (2023)
Please give a thumbs up to this comment if you found it helpful!
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper