SpanMarker

This is a SpanMarker model that can be used for Named Entity Recognition. It was trained on the Legal NER Indian Justice dataset.

Official repository of the model: Github Link

Model Details

Model Description

Model Type: SpanMarker
Maximum Sequence Length: 128 tokens
Maximum Entity Length: 6 words

Model Sources

Repository: SpanMarker on GitHub

Thesis: SpanMarker For Named Entity Recognition

Uses

Direct Use for Inference

from span_marker import SpanMarkerModel
from span_marker.tokenizer import SpanMarkerTokenizer


# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("lambdavi/span-marker-luke-legal")
tokenizer = SpanMarkerTokenizer.from_pretrained("roberta-base", config=model.config)
model.set_tokenizer(tokenizer)

# Run inference
entities = model.predict("The petition was filed through Sh. Vijay Pahwa, General Power of Attorney and it was asserted in the petition under Section 13-B of the Rent Act that 1 of 23 50% share of the demised premises had been purchased by the landlord from Sh. Vinod Malhotra vide sale deed No.4226 registered on 20.12.2007 with Sub Registrar, Chandigarh.")

Downstream Use

You can finetune this model on your own dataset.

Click to expand

from span_marker import SpanMarkerModel, Trainer
from span_marker.tokenizer import SpanMarkerTokenizer


# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("lambdavi/span-marker-luke-legal")
tokenizer = SpanMarkerTokenizer.from_pretrained("roberta-base", config=model.config)
model.set_tokenizer(tokenizer)

# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003

# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("lambdavi/span-marker-luke-legal-finetuned")

Training Details

Training Set Metrics

Training set	Min	Median	Max
Sentence length	3	44.5113	2795
Entities per sentence	0	2.7232	68

Training Hyperparameters

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.06
num_epochs: 5

Training Results

Epoch	Step	Validation Loss	Validation Precision	Validation Recall	Validation F1	Validation Accuracy
0.9997	1837	0.0137	0.7773	0.7994	0.7882	0.9577
2.0	3675	0.0090	0.8751	0.8348	0.8545	0.9697
2.9997	5512	0.0077	0.8777	0.8959	0.8867	0.9770
4.0	7350	0.0061	0.8941	0.9083	0.9011	0.9811
4.9986	9185	0.0064	0.9090	0.9110	0.9100	0.9824

Metric	Value
f1-exact	0.9237
f1-strict	0.9100
f1-partial	0.9365
f1-type-match	0.9277

Framework Versions

Python: 3.10.12
SpanMarker: 1.5.0
Transformers: 4.36.0
PyTorch: 2.0.0
Datasets: 2.17.1
Tokenizers: 0.15.0

Citation

BibTeX

@software{Aarsen_SpanMarker,
    author = {Aarsen, Tom},
    license = {Apache-2.0},
    title = {{SpanMarker for Named Entity Recognition}},
    url = {https://github.com/tomaarsen/SpanMarkerNER}
}

lambdavi
/

span-marker-luke-legal