LelViLamp commited on
Commit
cf00797
1 Parent(s): 0bfe0b7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -9
README.md CHANGED
@@ -17,12 +17,23 @@ A named entity recognition system (NER) was trained on text extracted from _Ober
17
 
18
  ## Annotations
19
 
20
- Each text passage was annotated in [doccano](https://github.com/doccano/doccano) by two or three annotators and their annotations were cleaned and merged into one dataset. For details on how this was done, see [`LelViLamp/kediff-doccano-postprocessing`](https://github.com/LelViLamp/kediff-doccano-postprocessing). In total, the text consists of about 1.7m characters. The resulting annotation datasets were published on the Hugging Face Hub as [`OALZ-1788-Q1-NER-Annotations`](https://huggingface.co/datasets/LelViLamp/OALZ-1788-Q1-NER-Annotations).
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  The following categories were included in the annotation process:
23
 
24
  | Tag | Label | Count | Total Length | Median Annotation Length | Mean Annotation Length | SD |
25
- | :------ | :------------ | ----: | -----------: | -----------------------: | ---------------------: | ----: |
26
  | `EVENT` | Event | 294 | 6,090 | 18 | 20.71 | 13.24 |
27
  | `LOC` | Location | 2,449 | 24,417 | 9 | 9.97 | 6.21 |
28
  | `MISC` | Miscellaneous | 2,585 | 50,654 | 14 | 19.60 | 19.63 |
@@ -48,13 +59,13 @@ The [`dbmdz/bert-base-historic-multilingual-cased`](https://huggingface.co/dbmdz
48
  The models' performance measures are as follows:
49
 
50
  | Model | Selected Epoch | Checkpoint | Validation Loss | Precision | Recall | F<sub>1</sub> | Accuracy |
51
- | :----------------------------------------------------------------- | :------------: | ---------: | --------------: | --------: | ------: | ------------: | -------: |
52
- | [`EVENT`](https://huggingface.co/LelViLamp/OALZ-1788-Q1-NER-EVENT) | 1 | `1393` | .021957 | .665233 | .343066 | .351528 | .995700 |
53
- | [`LOC`](https://huggingface.co/LelViLamp/OALZ-1788-Q1-NER-LOC) | 1 | `1393` | .033602 | .829535 | .803648 | .814146 | .990999 |
54
- | [`MISC`](https://huggingface.co/LelViLamp/OALZ-1788-Q1-NER-MISC) | 2 | `2786` | .123994 | .739221 | .503677 | .571298 | 968697 |
55
- | [`ORG`](https://huggingface.co/LelViLamp/OALZ-1788-Q1-NER-ORG) | 1 | `1393` | .062769 | .744259 | .709738 | .726212 | .980288 |
56
- | [`PER`](https://huggingface.co/LelViLamp/OALZ-1788-Q1-NER-PER) | 2 | `2786` | .059186 | .914037 | .849048 | .879070 | .983253 |
57
- | [`TIME`](https://huggingface.co/LelViLamp/OALZ-1788-Q1-NER-TIME) | 1 | `1393` | .016120 | .866866 | .724958 | .783099 | .994631 |
58
 
59
  ## Acknowledgements
60
  The data set and models were created in the project _Kooperative Erschließung diffusen Wissens_ ([KEDiff](https://uni-salzburg.elsevierpure.com/de/projects/kooperative-erschließung-diffusen-wissens-ein-literaturwissenscha)), funded by the [State of Salzburg](https://salzburg.gv.at), Austria 🇦🇹, and carried out at [Paris Lodron Universität Salzburg](https://plus.ac.at).
 
17
 
18
  ## Annotations
19
 
20
+ Each text passage was annotated in [doccano](https://github.com/doccano/doccano) by two or three annotators and their annotations were cleaned and merged into one dataset. For details on how this was done, see [`LelViLamp/kediff-doccano-postprocessing`](https://github.com/LelViLamp/kediff-doccano-postprocessing). In total, the text consists of about 1.7m characters. The resulting annotation datasets were published on the Hugging Face Hub as [`oalz-1788-q1-ner-annotations`](https://huggingface.co/datasets/LelViLamp/oalz-1788-q1-ner-annotations).
21
+
22
+ There are two versions of the dataset
23
+ - [`5a-generate-union-dataset`](https://huggingface.co/datasets/LelViLamp/oalz-1788-q1-ner-annotations/tree/main/5a-generate-union-dataset) contains the texts split into chunks. This is how they were presented in the annotation application doccano
24
+ - [`5b-merge-documents`](https://huggingface.co/datasets/LelViLamp/oalz-1788-q1-ner-annotations/tree/main/5b-merge-documents) does not retain this split. The text was merged into one long text and annotation indices were adapted.
25
+
26
+ Note that both these directories contain three equivalent datasets each:
27
+ - a Huggingface/Arrow dataset, <sup>*</sup>
28
+ - a CSV, <sup>*</sup> and
29
+ - a JSONL file.
30
+
31
+ <sup>*</sup> The former two should be used together with `text.csv` to catch the context of the annotation. The latter JSONL file contains the full text.
32
 
33
  The following categories were included in the annotation process:
34
 
35
  | Tag | Label | Count | Total Length | Median Annotation Length | Mean Annotation Length | SD |
36
+ |:--------|:--------------|------:|-------------:|-------------------------:|-----------------------:|------:|
37
  | `EVENT` | Event | 294 | 6,090 | 18 | 20.71 | 13.24 |
38
  | `LOC` | Location | 2,449 | 24,417 | 9 | 9.97 | 6.21 |
39
  | `MISC` | Miscellaneous | 2,585 | 50,654 | 14 | 19.60 | 19.63 |
 
59
  The models' performance measures are as follows:
60
 
61
  | Model | Selected Epoch | Checkpoint | Validation Loss | Precision | Recall | F<sub>1</sub> | Accuracy |
62
+ |:-------------------------------------------------------------------|:--------------:|-----------:|----------------:|----------:|--------:|--------------:|---------:|
63
+ | [`EVENT`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-event) | 1 | `1393` | .021957 | .665233 | .343066 | .351528 | .995700 |
64
+ | [`LOC`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-loc) | 1 | `1393` | .033602 | .829535 | .803648 | .814146 | .990999 |
65
+ | [`MISC`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-misc) | 2 | `2786` | .123994 | .739221 | .503677 | .571298 | 968697 |
66
+ | [`ORG`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-org) | 1 | `1393` | .062769 | .744259 | .709738 | .726212 | .980288 |
67
+ | [`PER`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-per) | 2 | `2786` | .059186 | .914037 | .849048 | .879070 | .983253 |
68
+ | [`TIME`](https://huggingface.co/LelViLamp/oalz-1788-q1-ner-time) | 1 | `1393` | .016120 | .866866 | .724958 | .783099 | .994631 |
69
 
70
  ## Acknowledgements
71
  The data set and models were created in the project _Kooperative Erschließung diffusen Wissens_ ([KEDiff](https://uni-salzburg.elsevierpure.com/de/projects/kooperative-erschließung-diffusen-wissens-ein-literaturwissenscha)), funded by the [State of Salzburg](https://salzburg.gv.at), Austria 🇦🇹, and carried out at [Paris Lodron Universität Salzburg](https://plus.ac.at).