INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a coreference resolution model built on top of camembert-large embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER, LOC, FAC, TIME, VEH, GPE.

MODEL PERFORMANCES (LOOCV):

Overall Coreference Resolution Performances for non-overlapping windows of different length:

Window width (tokens) Document count Sample count MUC F1 B3 F1 CEAFe F1 CONLL F1
0 500 29 677 92.34% 84.96% 81.83% 86.37%
1 1,000 29 332 92.55% 81.71% 78.50% 84.25%
2 2,000 28 162 92.72% 77.82% 75.57% 82.03%
3 5,000 19 56 93.02% 72.37% 70.03% 78.47%
4 10,000 18 27 93.37% 67.07% 67.81% 76.09%
5 25,000 2 3 94.63% 57.67% 58.21% 70.17%
6 50,000 1 1 97.32% 55.82% 49.44% 67.52%

Coreference Resolution Performances on the fully annotated sample for each document:

Token count Mention count MUC F1 B3 F1 CEAFe F1 CONLL F1
0 1,864 306 96.82% 91.68% 77.76% 88.75%
1 2,034 354 96.55% 88.22% 83.37% 89.38%
2 2,141 352 95.03% 80.11% 76.04% 83.73%
3 2,251 252 91.53% 78.58% 70.66% 80.26%
4 2,343 320 86.76% 69.48% 72.74% 76.33%
5 2,441 358 92.73% 67.48% 69.44% 76.55%
6 2,554 376 87.63% 65.67% 70.02% 74.44%
7 2,860 474 91.95% 78.24% 75.48% 81.89%
8 2,929 435 95.16% 61.06% 80.01% 78.74%
9 4,067 569 94.55% 82.97% 75.41% 84.31%
10 5,425 671 87.39% 56.20% 63.71% 69.10%
11 10,305 1,551 95.83% 68.15% 72.73% 78.90%
12 10,982 1,252 96.05% 65.75% 75.09% 78.96%
13 11,768 1,932 92.65% 67.69% 72.51% 77.62%
14 11,834 861 88.99% 60.48% 71.20% 73.56%
15 11,902 1,999 93.95% 59.57% 68.01% 73.84%
16 12,281 1,480 92.24% 71.24% 80.86% 81.45%
17 12,285 1,735 94.85% 72.31% 70.94% 79.37%
18 12,315 1,745 93.64% 60.48% 68.45% 74.19%
19 12,389 2,059 92.87% 63.61% 69.26% 75.25%
20 12,557 1,498 92.24% 79.00% 78.10% 83.11%
21 12,703 2,297 88.94% 61.19% 76.12% 75.42%
22 13,023 1,861 91.53% 66.21% 74.13% 77.29%
23 14,299 1,849 95.73% 71.32% 77.98% 81.68%
24 14,637 2,471 94.67% 71.41% 76.06% 80.71%
25 15,408 2,013 91.54% 56.61% 64.54% 70.90%
26 24,776 3,092 92.94% 63.22% 70.87% 75.67%
27 30,987 3,481 89.25% 52.00% 70.11% 70.45%
28 71,219 11,857 97.28% 53.34% 46.67% 65.76%

TRAINING PARAMETERS:

  • Entities types: PER, LOC, FAC, TIME, VEH, GPE
  • Split strategy: Leave-one-out cross-validation (29 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16,000
  • Initial learning rate: 0.0004
  • Focal loss gamma: 1
  • Focal loss alpha: 0.25
  • Pronoun lookup antecedents: 30
  • Common and Proper nouns lookup antecedents: 300

MODEL ARCHITECTURE:

Model Input: 2,165 dimensions vector

  • Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)

  • Additional mentions features (106 dimensions):

    • Length of mentions
    • Position of the mention's start token within the sentence
    • Grammatical category of the mentions (pronoun, common noun, proper noun)
    • Dependency relation of the mention's head (one-hot encoded)
    • Gender of the mentions (one-hot encoded)
    • Number (singular/plural) of the mentions (one-hot encoded)
    • Grammatical person of the mentions (one-hot encoded)
  • Additional mention pairs features (11 dimensions):

    • Distance between mention IDs
    • Distance between start tokens of mentions
    • Distance between end tokens of mentions
    • Distance between sentences containing mentions
    • Distance between paragraphs containing mentions
    • Difference in nesting levels of mentions
    • Ratio of shared tokens between mentions
    • Exact text match between mentions (binary)
    • Exact match of mention heads (binary)
    • Match of syntactic heads between mentions (binary)
    • Match of entity types between mentions (binary)
  • Hidden Layers:

    • Number of layers: 3
    • Units per layer: 1,900 nodes
    • Activation function: relu
    • Dropout rate: 0.6
  • Final Layer:

    • Type: Linear
    • Input: 1900 dimensions
    • Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1836_Gautier-Theophile_La-morte-amoureuse 14,299 tokens True
1 1840_Sand-George_Pauline 12,315 tokens True
2 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote 24,776 tokens True
3 1844_Balzac-Honore-de_La-Maison-Nucingen 30,987 tokens True
4 1844_Balzac-Honore-de_Sarrasine 15,408 tokens True
5 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens True
6 1863_Gautier-Theophile_Le-capitaine-Fracasse 11,834 tokens True
7 1873_Zola-Emile_Le-ventre-de-Paris 12,557 tokens True
8 1881_Flaubert-Gustave_Bouvard-et-Pecuchet 12,281 tokens True
9 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI 5,425 tokens True
10 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE 2,554 tokens True
11 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE 2,929 tokens True
12 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA 4,067 tokens True
13 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE 2,251 tokens True
14 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE 2,034 tokens True
15 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU 1,864 tokens True
16 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL 2,141 tokens True
17 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE 2,441 tokens True
18 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL 2,860 tokens True
19 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON 2,343 tokens True
20 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis 12,703 tokens True
21 1903_Conan-Laure_Elisabeth_Seton 13,023 tokens True
22 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube 10,982 tokens True
23 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin 10,305 tokens True
24 1917_Adèle-Bourgeois_Némoville 12,389 tokens True
25 1923_Radiguet-Raymond_Le-diable-au-corps 14,637 tokens True
26 1926_Audoux-Marguerite_De-la-ville-au-moulin 11,902 tokens True
27 1937_Audoux-Marguerite_Douce-Lumiere 12,285 tokens True
28 Manon_Lescaut_PEDRO 71,219 tokens True
29 TOTAL 346,579 tokens 29 files used for cross-validation

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Model tree for AntoineBourgois/BookNLP-fr_coreference-resolution_camembert-large_FAC_GPE_LOC_PER_TIME_VEH

Finetuned
(10)
this model