AntoineBourgois/BookNLP-fr_coreference-resolution_camembert-large_PER

INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a coreference resolution model built on top of camembert-large embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER.

MODEL PERFORMANCES (LOOCV):

Overall Coreference Resolution Performances for non-overlapping windows of different length:

	Window width (tokens)	Document count	Sample count	MUC F1	B3 F1	CEAFe F1	CONLL F1
0	500	29	677	92.18%	83.86%	76.86%	84.30%
1	1,000	29	332	92.65%	79.79%	71.77%	81.40%
2	2,000	28	162	93.29%	75.85%	67.34%	78.83%
3	5,000	19	56	93.76%	69.60%	61.16%	74.84%
4	10,000	18	27	94.28%	65.73%	58.59%	72.86%
5	25,000	2	3	94.76%	62.48%	53.33%	70.19%
6	50,000	1	1	97.39%	56.43%	47.40%	67.07%

Coreference Resolution Performances on the fully annotated sample for each document:

	Token count	Mention count	MUC F1	B3 F1	CEAFe F1	CONLL F1
0	1,864	253	98.16%	95.39%	60.34%	84.63%
1	2,034	321	97.47%	92.79%	80.04%	90.10%
2	2,141	297	95.06%	77.99%	65.08%	79.38%
3	2,251	235	91.95%	80.47%	46.56%	73.00%
4	2,343	239	83.87%	61.95%	43.58%	63.13%
5	2,441	314	91.85%	55.70%	60.82%	69.46%
6	2,554	330	90.24%	65.27%	72.36%	75.96%
7	2,860	369	93.65%	84.89%	74.93%	84.49%
8	2,929	386	95.65%	78.21%	64.23%	79.37%
9	4,067	429	97.46%	85.20%	62.52%	81.73%
10	5,425	558	90.46%	53.03%	59.52%	67.67%
11	10,305	1,436	96.37%	74.83%	59.91%	77.04%
12	10,982	1,095	97.18%	65.30%	60.49%	74.32%
13	11,768	1,734	93.30%	64.14%	64.12%	73.85%
14	11,834	600	92.21%	67.51%	60.74%	73.49%
15	11,902	1,692	95.03%	58.83%	45.59%	66.49%
16	12,281	1,089	95.06%	62.05%	72.55%	76.55%
17	12,285	1,489	95.28%	77.84%	57.43%	76.85%
18	12,315	1,501	95.36%	57.07%	64.26%	72.23%
19	12,389	1,654	93.19%	54.21%	51.84%	66.41%
20	12,557	1,085	92.30%	66.97%	46.65%	68.64%
21	12,703	1,731	90.40%	53.70%	61.37%	68.49%
22	13,023	1,559	93.86%	61.71%	62.41%	72.66%
23	14,299	1,582	97.23%	69.25%	67.04%	77.84%
24	14,637	2,127	95.78%	71.34%	63.28%	76.80%
25	15,408	1,769	92.85%	54.11%	56.12%	67.69%
26	24,776	2,716	94.31%	63.51%	54.12%	70.65%
27	30,987	2,980	89.55%	54.25%	59.68%	67.83%
28	71,219	11,857	97.38%	50.85%	45.93%	64.72%

TRAINING PARAMETERS:

Entities types: PER
Split strategy: Leave-one-out cross-validation (29 files)
Train/Validation split: 0.85 / 0.15
Batch size: 16,000
Initial learning rate: 0.0004
Focal loss gamma: 1
Focal loss alpha: 0.25
Pronoun lookup antecedents: 30
Common and Proper nouns lookup antecedents: 300

MODEL ARCHITECTURE:

Model Input: 2,165 dimensions vector

Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)
Additional mentions features (106 dimensions):
- Length of mentions
- Position of the mention's start token within the sentence
- Grammatical category of the mentions (pronoun, common noun, proper noun)
- Dependency relation of the mention's head (one-hot encoded)
- Gender of the mentions (one-hot encoded)
- Number (singular/plural) of the mentions (one-hot encoded)
- Grammatical person of the mentions (one-hot encoded)
Additional mention pairs features (11 dimensions):
- Distance between mention IDs
- Distance between start tokens of mentions
- Distance between end tokens of mentions
- Distance between sentences containing mentions
- Distance between paragraphs containing mentions
- Difference in nesting levels of mentions
- Ratio of shared tokens between mentions
- Exact text match between mentions (binary)
- Exact match of mention heads (binary)
- Match of syntactic heads between mentions (binary)
- Match of entity types between mentions (binary)
Hidden Layers:
- Number of layers: 3
- Units per layer: 1,900 nodes
- Activation function: relu
- Dropout rate: 0.6
Final Layer:
- Type: Linear
- Input: 1900 dimensions
- Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

	Document	Tokens Count	Is included in model eval
0	1836_Gautier-Theophile_La-morte-amoureuse	14,299 tokens	True
1	1840_Sand-George_Pauline	12,315 tokens	True
2	1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote	24,776 tokens	True
3	1844_Balzac-Honore-de_La-Maison-Nucingen	30,987 tokens	True
4	1844_Balzac-Honore-de_Sarrasine	15,408 tokens	True
5	1856_Cousin-Victor_Madame-de-Hautefort	11,768 tokens	True
6	1863_Gautier-Theophile_Le-capitaine-Fracasse	11,834 tokens	True
7	1873_Zola-Emile_Le-ventre-de-Paris	12,557 tokens	True
8	1881_Flaubert-Gustave_Bouvard-et-Pecuchet	12,281 tokens	True
9	1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI	5,425 tokens	True
10	1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE	2,554 tokens	True
11	1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE	2,929 tokens	True
12	1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA	4,067 tokens	True
13	1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE	2,251 tokens	True
14	1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE	2,034 tokens	True
15	1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU	1,864 tokens	True
16	1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL	2,141 tokens	True
17	1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE	2,441 tokens	True
18	1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL	2,860 tokens	True
19	1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON	2,343 tokens	True
20	1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis	12,703 tokens	True
21	1903_Conan-Laure_Elisabeth_Seton	13,023 tokens	True
22	1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube	10,982 tokens	True
23	1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin	10,305 tokens	True
24	1917_Adèle-Bourgeois_Némoville	12,389 tokens	True
25	1923_Radiguet-Raymond_Le-diable-au-corps	14,637 tokens	True
26	1926_Audoux-Marguerite_De-la-ville-au-moulin	11,902 tokens	True
27	1937_Audoux-Marguerite_Douce-Lumiere	12,285 tokens	True
28	Manon_Lescaut_PEDRO	71,219 tokens	True
29	TOTAL	346,579 tokens	29 files used for cross-validation

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com

AntoineBourgois
/

BookNLP-fr_coreference-resolution_camembert-large_PER