metadata

library_name: span-marker
tags:
  - span-marker
  - token-classification
  - ner
  - named-entity-recognition
  - muppet-roberta-large-ner
datasets:
  - DFKI-SLT/few-nerd
metrics:
  - precision
  - recall
  - f1
widget:
  - text: >-
      His name was Radu-Sebastian Amarie, and was building IJW trying to figure
      out how to properly extract entities from raw data. He's from Romania and
      he's eager to watch Dune 2.
  - example_title: Random few NERD examples.
  - text: >-
      The Alabama Supreme Court effectively halted in vitro fertilization at
      several state hospitals and caused a massive nationwide backlash when it
      ruled last week in a wrongful death case that frozen embryos used in IVF
      are considered people. Dr. Paula Amato, the president of the American
      Society for Reproductive Medicine, said in a press release it was a
      mistake to conflate frozen fertilized eggs with embryos developing within
      a mother.
  - example_title: News
pipeline_tag: token-classification
license: cc-by-sa-4.0
language:
  - en
model-index:
  - name: >-
      SpanMarker w. facebook/muppet-roberta-large on finegrained, supervised
      FewNERD
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: finegrained, supervised FewNERD
          type: DFKI-SLT/few-nerd
          config: supervised
          split: test
          revision: 6f0944f5a1d47c359b4f5de03ed1d58c98f297b5
        metrics:
          - type: f1
            value: 0.705678
            name: F1
          - type: precision
            value: 0.701648
            name: Precision
          - type: recall
            value: 0.709755
            name: Recall

SpanMarker

This is a SpanMarker model trained on the DFKI-SLT/few-nerd dataset that can be used for Named Entity Recognition. Training was done on a Nvidia 4090 in approximately 8 hours (but final chosen checkpoint was from before the first half of training)

Training and Validation Metrics

Current model represents STEP 25000

Test Set Evaluation

The following are some manually-selected checkpoints that correspond to the above steps:

|   checkpoint | Precision |   Recall   |      F1    |   Accuracy |   Runtime |   Samples/s | 
|-------------:|----------:|-----------:|-----------:|-----------:|----------:|------------:|
|        17000 |  0.706066 |   0.691239 |   0.698574 |   0.926213 |   335.172 |     123.474 | 
|        18000 |  0.695331 |   0.700382 |   0.697847 |   0.926372 |   301.435 |     137.293 |
|        19000 |  0.70618  |   0.693775 |   0.699923 |   0.926492 |   301.032 |     137.477 |
|        20000 |  0.700665 |   0.701572 |   0.701118 |   0.927128 |   299.706 |     138.085 |
|        21000 |  0.706467 |   0.695591 |   0.700987 |   0.926318 |   299.62  |     138.125 |
|        22000 |  0.698079 |   0.710756 |   0.704361 |   0.928094 |   300.041 |     137.931 |
|        24000 |  0.709286 |   0.695769 |   0.702463 |   0.926329 |   300.339 |     137.794 |
|        25000 |  0.701648 |   0.709755 |   0.705678 |   0.92792  |   299.905 |     137.994 |
|        26000 |  0.702509 |   0.708147 |   0.705317 |   0.927998 |   301.161 |     137.418 |
|        27000 |  0.707315 |   0.698796 |   0.703029 |   0.926493 |   299.692 |     138.092 |

Model Details

Model Description

Model Type: SpanMarker
Encoder: muppet-roberta-large
Maximum Sequence Length: 256 tokens
Maximum Entity Length: 6 words
Training Dataset: DFKI-SLT/few-nerd
Language: en
License: cc-by-sa-4.0

Useful Links

Training was done with SpanMarker Trainer that can be found here: SpanMarker on GitHub

Uses

Direct Use for Inference

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("eek/span-marker-muppet-roberta-large-fewnerd-fine-super")
# Run inference
entities = model.predict("His name was Radu.")

or it can be used directly in spacy via SpanMarker.

import spacy

nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})

text = """Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \
Ptolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \
death in 30 BCE."""
doc = nlp(text)
print([(entity, entity.label_) for entity in doc.ents])

Training Details

Framework Versions

Python: 3.10.13
SpanMarker: 1.5.0
Transformers: 4.36.2
PyTorch: 2.2.1+cu121
Datasets: 2.18.0
Tokenizers: 0.15.2

Training Arguments

args = TrainingArguments(
    output_dir="models/span-marker-muppet-roberta-large-fewnerd-fine-super",
    learning_rate=1e-5,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=8,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=1000,
    eval_steps=500,
    push_to_hub=False,
    logging_steps=50,
    fp16=True,
    warmup_ratio=0.1,
    dataloader_num_workers=1,
    load_best_model_at_end=True
)

Thanks

Thanks to Tom Aarsen for the SpanMarker library.

BibTeX

@software{Aarsen_SpanMarker,
    author = {Aarsen, Tom},
    license = {Apache-2.0},
    title = {{SpanMarker for Named Entity Recognition}},
    url = {https://github.com/tomaarsen/SpanMarkerNER}
}

Model Card Authors

Radu-Sebastian Amarie