Kansallisarkisto/court-records-htr

Handwritten text recognition for Finnish 19th century court records

The model performs handwritten text recognition from text line images. It was trained by fine-tuning Microsoft's TrOCR model with digitized 19th century court record documents in Finnish and Swedish.

Intended uses & limitations

The model has been trained to recognize handwritten text from a specific type of 19th century data, and may generalize poorly to other datasets.

The model takes as input text line images, and the use of other types of inputs are not recommended.

How to use

The model can be used for predicting the text content of images following the code below. It is recommended to use GPU for inference if available.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Model location in Huggingface Hub
model_checkpoint = "Kansallisarkisto/court-records-htr"
# Path to textline image
line_image_path = "/path/to/textline_image.jpg"

# Initialize processor and model
processor = TrOCRProcessor.from_pretrained(model_checkpoint)
model = VisionEncoderDecoderModel.from_pretrained(model_checkpoint).to(device)

# Open image file and extract pixel values
image = Image.open(line_image_path).convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values

# Use the model to generate predictions 
generated_ids = model.generate(pixel_values.to(device))
# Use the processor to decode ids to text
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

The model that is downloaded from the HuggingFace Hub is saved locally to ~/.cache/huggingface/hub/.

Training data

Model was trained using 314 228 text line images from 19th century court records, while the validation dataset contained 39 042 text line images.

Training procedure

This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:

train batch size: 24
epochs: 13
optimizer: AdamW
maximum length of text sequence: 64

For other parameters, the default values were used (find more information here). The training code is available in the train_trocr.py code file.

Evaluation results

Evaluation results using the validation dataset are listed below:

Validation loss	Validation CER	Validation WER
0.248	0.024	0.113

The metrics were calculated using the Evaluate library. More information on the CER metric can be found here. More information on the WER metric can be found here.