browndw commited on
Commit
1cd3532
·
1 Parent(s): 214220e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -53
README.md CHANGED
@@ -6,13 +6,14 @@ datasets: COCA
6
 
7
  ## Model description
8
 
9
- **docusco-bert** is a fine-tuned BERT model that is ready to use for **token classification**. The model was trained on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [**DocuScope**](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below.
10
 
11
  ## About DocuScope
12
  DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by **David Kaufer** and **Suguru Ishizaki** since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).
13
 
14
  DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865).
15
 
 
16
  ## Intended uses & limitations
17
 
18
  #### How to use
@@ -22,13 +23,10 @@ The model was trained on data with tags formatted using [IOB](https://en.wikiped
22
  ```python
23
  from transformers import AutoTokenizer, AutoModelForTokenClassification
24
  from transformers import pipeline
25
-
26
  tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
27
  model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")
28
-
29
  nlp = pipeline("ner", model=model, tokenizer=tokenizer)
30
  example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide."
31
-
32
  ds_results = nlp(example)
33
  print(ds_results)
34
  ```
@@ -39,13 +37,59 @@ This model is limited by its training dataset of American English texts. Moreove
39
 
40
  ## Training data
41
 
42
- This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain 1500 randomly sampled texts from each of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.
 
 
 
 
43
 
44
- #### # of texts/chunks/tokens per dataset
45
- Dataset |Texts |Chunks |Tokens
46
- -|-|-|-
47
- Train |7500 |1,167,584 |32,203,828
48
- Test |500 |58,117 |1,567,997
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## Training procedure
51
 
@@ -55,53 +99,53 @@ This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommen
55
  ### Overall
56
  metric|test
57
  -|-
58
- f1 |.743
59
- accuracy |.801
60
 
61
  ### By category
62
  category|precision|recall|f1-score|support
63
  -|-|-|-|-
64
- AcademicTerms|0.76|0.77|0.76|140805
65
- AcademicWritingMoves|0.36|0.46|0.40|8182
66
- Character|0.74|0.78|0.76|123856
67
- Citation|0.73|0.81|0.77|13428
68
- CitationAuthority|0.55|0.49|0.51|4552
69
- CitationHedged|0.58|0.89|0.70|285
70
- ConfidenceHedged|0.76|0.84|0.79|14765
71
- ConfidenceHigh|0.64|0.72|0.68|11462
72
- ConfidenceLow|0.70|0.39|0.50|380
73
- Contingent|0.68|0.69|0.69|9537
74
- Description|0.60|0.67|0.63|108186
75
- Facilitate|0.63|0.63|0.63|7421
76
- FirstPerson|0.62|0.73|0.67|6235
77
- ForceStressed|0.65|0.72|0.69|37910
78
- Future|0.63|0.69|0.66|9049
79
- InformationChange|0.64|0.72|0.68|14560
80
- InformationChangeNegative|0.59|0.57|0.58|1840
81
- InformationChangePositive|0.61|0.58|0.60|4265
82
- InformationExposition|0.80|0.83|0.82|84977
83
- InformationPlace|0.80|0.82|0.81|18783
84
- InformationReportVerbs|0.71|0.79|0.75|17572
85
- InformationStates|0.74|0.80|0.77|21048
86
- InformationTopics|0.69|0.72|0.70|58677
87
- Inquiry|0.50|0.58|0.53|12735
88
- Interactive|0.64|0.70|0.67|18135
89
- MetadiscourseCohesive|0.90|0.93|0.92|33312
90
- MetadiscourseInteractive|0.54|0.62|0.58|6888
91
- Narrative|0.70|0.76|0.73|116896
92
- Negative|0.63|0.69|0.66|60534
93
- Positive|0.60|0.67|0.63|54374
94
- PublicTerms|0.70|0.74|0.72|38229
95
- Reasoning|0.71|0.76|0.74|30157
96
- Responsibility|0.59|0.63|0.61|3451
97
- Strategic|0.60|0.62|0.61|28064
98
- SyntacticComplexity|0.83|0.87|0.85|297387
99
- Uncertainty|0.43|0.44|0.43|2915
100
- Updates|0.52|0.53|0.53|6156
101
  -|-|-|-|-
102
- micro|avg|0.72|0.77|0.74|1427008
103
- macro|avg|0.65|0.69|0.67|1427008
104
- weighted|avg|0.72|0.77|0.74|1427008
105
 
106
 
107
  ## DocuScope Category Descriptions
@@ -179,4 +223,3 @@ Updates|References updates that anticipate someone searching for information and
179
  bibsource = {dblp computer science bibliography, https://dblp.org}
180
  }
181
  ```
182
-
 
6
 
7
  ## Model description
8
 
9
+ **docusco-bert** is a fine-tuned BERT model that is ready to use for **token classification**. The model was trained on data sampled from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [**DocuScope**](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below.
10
 
11
  ## About DocuScope
12
  DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by **David Kaufer** and **Suguru Ishizaki** since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).
13
 
14
  DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865).
15
 
16
+
17
  ## Intended uses & limitations
18
 
19
  #### How to use
 
23
  ```python
24
  from transformers import AutoTokenizer, AutoModelForTokenClassification
25
  from transformers import pipeline
 
26
  tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
27
  model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")
 
28
  nlp = pipeline("ner", model=model, tokenizer=tokenizer)
29
  example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide."
 
30
  ds_results = nlp(example)
31
  print(ds_results)
32
  ```
 
37
 
38
  ## Training data
39
 
40
+ This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain chunks of text randomly sampled of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.
41
+
42
+ Typically, BERT models are trained on sentence segments. However, DocuScope tags can span setences. Thus, data were split into chunks that don't split **B + I** sequences and end with sentence-final punctuation marks (i.e., period, quesiton mark or exclamaiton point).
43
+
44
+ Additionally, the order of the chunks was randomized prior to sampling, and statified sampling was used to provide enough training data for low-frequency caegories. The resulting training data consist of:
45
 
46
+ * 21,460,177 tokens
47
+ * 15,796,305 chunks
48
+
49
+ The specific counts for each category appear in the following table.
50
+
51
+ Category|Count
52
+ -|-
53
+ O|3528038
54
+ Syntactic Complexity|2032808
55
+ Character|1413771
56
+ Description|1224744
57
+ Narrative|1159201
58
+ Negative|651012
59
+ Academic Terms|620932
60
+ Interactive|594908
61
+ Information Exposition|578228
62
+ Positive|463914
63
+ Force Stressed|432631
64
+ Information Topics|394155
65
+ First Person|249744
66
+ Metadiscourse Cohesive|240822
67
+ Strategic|238255
68
+ Public Terms|234213
69
+ Reasoning|213775
70
+ Information Place|187249
71
+ Information States|173146
72
+ Information ReportVerbs|119092
73
+ Confidence High|112861
74
+ Confidence Hedged|110008
75
+ Future|96101
76
+ Inquiry|94995
77
+ Contingent|94860
78
+ Information Change|89063
79
+ Metadiscourse Interactive|84033
80
+ Updates|81424
81
+ Citation|71241
82
+ Facilitate|50451
83
+ Uncertainty|35644
84
+ Academic WritingMoves|29352
85
+ Information ChangePositive|28475
86
+ Responsibility|25362
87
+ Citation Authority|22414
88
+ Information ChangeNegative|15612
89
+ Confidence Low|2876
90
+ Citation Hedged|895
91
+ -|-
92
+ Total|15796305
93
 
94
  ## Training procedure
95
 
 
99
  ### Overall
100
  metric|test
101
  -|-
102
+ f1 |.927
103
+ accuracy |.943
104
 
105
  ### By category
106
  category|precision|recall|f1-score|support
107
  -|-|-|-|-
108
+ AcademicTerms|0.91|0.92|0.92|486399
109
+ AcademicWritingMoves|0.76|0.82|0.79|20017
110
+ Character|0.94|0.95|0.94|1260272
111
+ Citation|0.92|0.94|0.93|50812
112
+ CitationAuthority|0.86|0.88|0.87|17798
113
+ CitationHedged|0.91|0.94|0.92|632
114
+ ConfidenceHedged|0.94|0.96|0.95|90393
115
+ ConfidenceHigh|0.92|0.94|0.93|113569
116
+ ConfidenceLow|0.79|0.81|0.80|2556
117
+ Contingent|0.92|0.94|0.93|81366
118
+ Description|0.87|0.89|0.88|1098598
119
+ Facilitate|0.87|0.90|0.89|41760
120
+ FirstPerson|0.96|0.98|0.97|330658
121
+ ForceStressed|0.93|0.94|0.93|436188
122
+ Future|0.90|0.93|0.92|93365
123
+ InformationChange|0.88|0.91|0.89|72813
124
+ InformationChangeNegative|0.83|0.85|0.84|12740
125
+ InformationChangePositive|0.82|0.86|0.84|22994
126
+ InformationExposition|0.94|0.95|0.95|468078
127
+ InformationPlace|0.95|0.96|0.96|147688
128
+ InformationReportVerbs|0.91|0.93|0.92|95563
129
+ InformationStates|0.95|0.95|0.95|139429
130
+ InformationTopics|0.90|0.92|0.91|328152
131
+ Inquiry|0.85|0.89|0.87|79030
132
+ Interactive|0.95|0.96|0.95|602857
133
+ MetadiscourseCohesive|0.97|0.98|0.98|195548
134
+ MetadiscourseInteractive|0.92|0.94|0.93|73159
135
+ Narrative|0.92|0.94|0.93|1023452
136
+ Negative|0.88|0.89|0.88|645810
137
+ Positive|0.87|0.89|0.88|409775
138
+ PublicTerms|0.91|0.92|0.91|184108
139
+ Reasoning|0.93|0.95|0.94|169208
140
+ Responsibility|0.83|0.87|0.85|21819
141
+ Strategic|0.88|0.90|0.89|193768
142
+ SyntacticComplexity|0.95|0.96|0.96|1635918
143
+ Uncertainty|0.87|0.91|0.89|33684
144
+ Updates|0.91|0.93|0.92|77760
145
  -|-|-|-|-
146
+ micro avg|0.92|0.93|0.93|10757736
147
+ macro avg|0.90|0.92|0.91|10757736
148
+ weighted avg|0.92|0.93|0.93|10757736
149
 
150
 
151
  ## DocuScope Category Descriptions
 
223
  bibsource = {dblp computer science bibliography, https://dblp.org}
224
  }
225
  ```