INTRODUCTION:
This model, developed as part of the BookNLP-fr project, is a coreference resolution model built on top of camembert-large embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.
This specific model has been trained to link entities of the following types: PER, LOC, FAC, TIME, VEH, GPE.
MODEL PERFORMANCES (LOOCV):
Overall Coreference Resolution Performances for non-overlapping windows of different length:
Window width (tokens) | Document count | Sample count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 | |
---|---|---|---|---|---|---|---|
0 | 500 | 29 | 677 | 92.34% | 84.96% | 81.83% | 86.37% |
1 | 1,000 | 29 | 332 | 92.55% | 81.71% | 78.50% | 84.25% |
2 | 2,000 | 28 | 162 | 92.72% | 77.82% | 75.57% | 82.03% |
3 | 5,000 | 19 | 56 | 93.02% | 72.37% | 70.03% | 78.47% |
4 | 10,000 | 18 | 27 | 93.37% | 67.07% | 67.81% | 76.09% |
5 | 25,000 | 2 | 3 | 94.63% | 57.67% | 58.21% | 70.17% |
6 | 50,000 | 1 | 1 | 97.32% | 55.82% | 49.44% | 67.52% |
Coreference Resolution Performances on the fully annotated sample for each document:
Token count | Mention count | MUC F1 | B3 F1 | CEAFe F1 | CONLL F1 | |
---|---|---|---|---|---|---|
0 | 1,864 | 306 | 96.82% | 91.68% | 77.76% | 88.75% |
1 | 2,034 | 354 | 96.55% | 88.22% | 83.37% | 89.38% |
2 | 2,141 | 352 | 95.03% | 80.11% | 76.04% | 83.73% |
3 | 2,251 | 252 | 91.53% | 78.58% | 70.66% | 80.26% |
4 | 2,343 | 320 | 86.76% | 69.48% | 72.74% | 76.33% |
5 | 2,441 | 358 | 92.73% | 67.48% | 69.44% | 76.55% |
6 | 2,554 | 376 | 87.63% | 65.67% | 70.02% | 74.44% |
7 | 2,860 | 474 | 91.95% | 78.24% | 75.48% | 81.89% |
8 | 2,929 | 435 | 95.16% | 61.06% | 80.01% | 78.74% |
9 | 4,067 | 569 | 94.55% | 82.97% | 75.41% | 84.31% |
10 | 5,425 | 671 | 87.39% | 56.20% | 63.71% | 69.10% |
11 | 10,305 | 1,551 | 95.83% | 68.15% | 72.73% | 78.90% |
12 | 10,982 | 1,252 | 96.05% | 65.75% | 75.09% | 78.96% |
13 | 11,768 | 1,932 | 92.65% | 67.69% | 72.51% | 77.62% |
14 | 11,834 | 861 | 88.99% | 60.48% | 71.20% | 73.56% |
15 | 11,902 | 1,999 | 93.95% | 59.57% | 68.01% | 73.84% |
16 | 12,281 | 1,480 | 92.24% | 71.24% | 80.86% | 81.45% |
17 | 12,285 | 1,735 | 94.85% | 72.31% | 70.94% | 79.37% |
18 | 12,315 | 1,745 | 93.64% | 60.48% | 68.45% | 74.19% |
19 | 12,389 | 2,059 | 92.87% | 63.61% | 69.26% | 75.25% |
20 | 12,557 | 1,498 | 92.24% | 79.00% | 78.10% | 83.11% |
21 | 12,703 | 2,297 | 88.94% | 61.19% | 76.12% | 75.42% |
22 | 13,023 | 1,861 | 91.53% | 66.21% | 74.13% | 77.29% |
23 | 14,299 | 1,849 | 95.73% | 71.32% | 77.98% | 81.68% |
24 | 14,637 | 2,471 | 94.67% | 71.41% | 76.06% | 80.71% |
25 | 15,408 | 2,013 | 91.54% | 56.61% | 64.54% | 70.90% |
26 | 24,776 | 3,092 | 92.94% | 63.22% | 70.87% | 75.67% |
27 | 30,987 | 3,481 | 89.25% | 52.00% | 70.11% | 70.45% |
28 | 71,219 | 11,857 | 97.28% | 53.34% | 46.67% | 65.76% |
TRAINING PARAMETERS:
- Entities types: PER, LOC, FAC, TIME, VEH, GPE
- Split strategy: Leave-one-out cross-validation (29 files)
- Train/Validation split: 0.85 / 0.15
- Batch size: 16,000
- Initial learning rate: 0.0004
- Focal loss gamma: 1
- Focal loss alpha: 0.25
- Pronoun lookup antecedents: 30
- Common and Proper nouns lookup antecedents: 300
MODEL ARCHITECTURE:
Model Input: 2,165 dimensions vector
Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
Additional mentions features (106 dimensions):
- Length of mentions
- Position of the mention's start token within the sentence
- Grammatical category of the mentions (pronoun, common noun, proper noun)
- Dependency relation of the mention's head (one-hot encoded)
- Gender of the mentions (one-hot encoded)
- Number (singular/plural) of the mentions (one-hot encoded)
- Grammatical person of the mentions (one-hot encoded)
Additional mention pairs features (11 dimensions):
- Distance between mention IDs
- Distance between start tokens of mentions
- Distance between end tokens of mentions
- Distance between sentences containing mentions
- Distance between paragraphs containing mentions
- Difference in nesting levels of mentions
- Ratio of shared tokens between mentions
- Exact text match between mentions (binary)
- Exact match of mention heads (binary)
- Match of syntactic heads between mentions (binary)
- Match of entity types between mentions (binary)
Hidden Layers:
- Number of layers: 3
- Units per layer: 1,900 nodes
- Activation function: relu
- Dropout rate: 0.6
Final Layer:
- Type: Linear
- Input: 1900 dimensions
- Output: 1 dimension (mention pair coreference score)
Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.
HOW TO USE:
*** IN CONSTRUCTION ***
TRAINING CORPUS:
Document | Tokens Count | Is included in model eval | |
---|---|---|---|
0 | 1836_Gautier-Theophile_La-morte-amoureuse | 14,299 tokens | True |
1 | 1840_Sand-George_Pauline | 12,315 tokens | True |
2 | 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote | 24,776 tokens | True |
3 | 1844_Balzac-Honore-de_La-Maison-Nucingen | 30,987 tokens | True |
4 | 1844_Balzac-Honore-de_Sarrasine | 15,408 tokens | True |
5 | 1856_Cousin-Victor_Madame-de-Hautefort | 11,768 tokens | True |
6 | 1863_Gautier-Theophile_Le-capitaine-Fracasse | 11,834 tokens | True |
7 | 1873_Zola-Emile_Le-ventre-de-Paris | 12,557 tokens | True |
8 | 1881_Flaubert-Gustave_Bouvard-et-Pecuchet | 12,281 tokens | True |
9 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI | 5,425 tokens | True |
10 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE | 2,554 tokens | True |
11 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE | 2,929 tokens | True |
12 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA | 4,067 tokens | True |
13 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE | 2,251 tokens | True |
14 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE | 2,034 tokens | True |
15 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU | 1,864 tokens | True |
16 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL | 2,141 tokens | True |
17 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE | 2,441 tokens | True |
18 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL | 2,860 tokens | True |
19 | 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON | 2,343 tokens | True |
20 | 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis | 12,703 tokens | True |
21 | 1903_Conan-Laure_Elisabeth_Seton | 13,023 tokens | True |
22 | 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube | 10,982 tokens | True |
23 | 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin | 10,305 tokens | True |
24 | 1917_Adèle-Bourgeois_Némoville | 12,389 tokens | True |
25 | 1923_Radiguet-Raymond_Le-diable-au-corps | 14,637 tokens | True |
26 | 1926_Audoux-Marguerite_De-la-ville-au-moulin | 11,902 tokens | True |
27 | 1937_Audoux-Marguerite_Douce-Lumiere | 12,285 tokens | True |
28 | Manon_Lescaut_PEDRO | 71,219 tokens | True |
29 | TOTAL | 346,579 tokens | 29 files used for cross-validation |
CONTACT:
mail: antoine [dot] bourgois [at] protonmail [dot] com
Model tree for AntoineBourgois/BookNLP-fr_coreference-resolution_camembert-large_FAC_GPE_LOC_PER_TIME_VEH
Base model
almanach/camembert-large