browndw
/

docusco-bert

@@ -6,13 +6,14 @@ datasets: COCA
 ## Model description
-**docusco-bert** is a fine-tuned BERT model that is ready to use for **token classification**. The model was trained on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [**DocuScope**](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below.
 ## About DocuScope
 DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by **David Kaufer** and **Suguru Ishizaki** since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).
 DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865).
 ## Intended uses & limitations
 #### How to use
@@ -22,13 +23,10 @@ The model was trained on data with tags formatted using [IOB](https://en.wikiped
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 from transformers import pipeline
 tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
 model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")
 nlp = pipeline("ner", model=model, tokenizer=tokenizer)
 example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide."
 ds_results = nlp(example)
 print(ds_results)
 ```
@@ -39,13 +37,59 @@ This model is limited by its training dataset of American English texts. Moreove
 ## Training data
-This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain 1500 randomly sampled texts from each of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.
-#### # of texts/chunks/tokens per dataset
-Dataset |Texts |Chunks |Tokens
--|-|-|-
-Train |7500 |1,167,584 |32,203,828
-Test |500 |58,117 |1,567,997
 ## Training procedure
@@ -55,53 +99,53 @@ This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommen
 ### Overall
 metric|test
 -|-
-f1 |.743
-accuracy |.801
 ### By category
 category|precision|recall|f1-score|support
 -|-|-|-|-
-AcademicTerms|0.76|0.77|0.76|140805
-AcademicWritingMoves|0.36|0.46|0.40|8182
-Character|0.74|0.78|0.76|123856
-Citation|0.73|0.81|0.77|13428
-CitationAuthority|0.55|0.49|0.51|4552
-CitationHedged|0.58|0.89|0.70|285
-ConfidenceHedged|0.76|0.84|0.79|14765
-ConfidenceHigh|0.64|0.72|0.68|11462
-ConfidenceLow|0.70|0.39|0.50|380
-Contingent|0.68|0.69|0.69|9537
-Description|0.60|0.67|0.63|108186
-Facilitate|0.63|0.63|0.63|7421
-FirstPerson|0.62|0.73|0.67|6235
-ForceStressed|0.65|0.72|0.69|37910
-Future|0.63|0.69|0.66|9049
-InformationChange|0.64|0.72|0.68|14560
-InformationChangeNegative|0.59|0.57|0.58|1840
-InformationChangePositive|0.61|0.58|0.60|4265
-InformationExposition|0.80|0.83|0.82|84977
-InformationPlace|0.80|0.82|0.81|18783
-InformationReportVerbs|0.71|0.79|0.75|17572
-InformationStates|0.74|0.80|0.77|21048
-InformationTopics|0.69|0.72|0.70|58677
-Inquiry|0.50|0.58|0.53|12735
-Interactive|0.64|0.70|0.67|18135
-MetadiscourseCohesive|0.90|0.93|0.92|33312
-MetadiscourseInteractive|0.54|0.62|0.58|6888
-Narrative|0.70|0.76|0.73|116896
-Negative|0.63|0.69|0.66|60534
-Positive|0.60|0.67|0.63|54374
-PublicTerms|0.70|0.74|0.72|38229
-Reasoning|0.71|0.76|0.74|30157
-Responsibility|0.59|0.63|0.61|3451
-Strategic|0.60|0.62|0.61|28064
-SyntacticComplexity|0.83|0.87|0.85|297387
-Uncertainty|0.43|0.44|0.43|2915
-Updates|0.52|0.53|0.53|6156
 -|-|-|-|-
-micro|avg|0.72|0.77|0.74|1427008
-macro|avg|0.65|0.69|0.67|1427008
-weighted|avg|0.72|0.77|0.74|1427008
 ## DocuScope Category Descriptions
@@ -179,4 +223,3 @@ Updates|References updates that anticipate someone searching for information and
   bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 ```

 ## Model description
+**docusco-bert** is a fine-tuned BERT model that is ready to use for **token classification**. The model was trained on data sampled from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)) and classifies tokens and token sequences according to a system developed for the [**DocuScope**](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html) dictionary-based tagger. Descriptions of the categories are included in a table below.
 ## About DocuScope
 DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by **David Kaufer** and **Suguru Ishizaki** since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).
 DocuScope has been been used in [a wide variety of studies](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=docuscope&btnG=). Here, for example, is [a short analysis of King Lear](https://graphics.cs.wisc.edu/WP/vep/2017/02/14/guest-post-data-mining-king-lear/), and here is [a published study of Tweets](https://journals.sagepub.com/doi/full/10.1177/2055207619844865).
 ## Intended uses & limitations
 #### How to use
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 from transformers import pipeline
 tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
 model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")
 nlp = pipeline("ner", model=model, tokenizer=tokenizer)
 example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide."
 ds_results = nlp(example)
 print(ds_results)
 ```
 ## Training data
+This model was fine-tuned on data from the Corpus of Contemporary American English ([COCA](https://www.english-corpora.org/coca/)). The training data contain chunks of text randomly sampled of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.
+Typically, BERT models are trained on sentence segments. However, DocuScope tags can span setences. Thus, data were split into chunks that don't split **B + I** sequences and end with sentence-final punctuation marks (i.e., period, quesiton mark or exclamaiton point).
+Additionally, the order of the chunks was randomized prior to sampling, and statified sampling was used to provide enough training data for low-frequency caegories. The resulting training data consist of:
+* 21,460,177 tokens
+* 15,796,305 chunks
+The specific counts for each category appear in the following table.
+Category|Count
+-|-
+O|3528038
+Syntactic Complexity|2032808
+Character|1413771
+Description|1224744
+Narrative|1159201
+Negative|651012
+Academic Terms|620932
+Interactive|594908
+Information Exposition|578228
+Positive|463914
+Force Stressed|432631
+Information Topics|394155
+First Person|249744
+Metadiscourse Cohesive|240822
+Strategic|238255
+Public Terms|234213
+Reasoning|213775
+Information Place|187249
+Information States|173146
+Information ReportVerbs|119092
+Confidence High|112861
+Confidence Hedged|110008
+Future|96101
+Inquiry|94995
+Contingent|94860
+Information Change|89063
+Metadiscourse Interactive|84033
+Updates|81424
+Citation|71241
+Facilitate|50451
+Uncertainty|35644
+Academic WritingMoves|29352
+Information ChangePositive|28475
+Responsibility|25362
+Citation Authority|22414
+Information ChangeNegative|15612
+Confidence Low|2876
+Citation Hedged|895
+-|-
+Total|15796305
 ## Training procedure
 ### Overall
 metric|test
 -|-
+f1 |.927
+accuracy |.943
 ### By category
 category|precision|recall|f1-score|support
 -|-|-|-|-
+AcademicTerms|0.91|0.92|0.92|486399
+AcademicWritingMoves|0.76|0.82|0.79|20017
+Character|0.94|0.95|0.94|1260272
+Citation|0.92|0.94|0.93|50812
+CitationAuthority|0.86|0.88|0.87|17798
+CitationHedged|0.91|0.94|0.92|632
+ConfidenceHedged|0.94|0.96|0.95|90393
+ConfidenceHigh|0.92|0.94|0.93|113569
+ConfidenceLow|0.79|0.81|0.80|2556
+Contingent|0.92|0.94|0.93|81366
+Description|0.87|0.89|0.88|1098598
+Facilitate|0.87|0.90|0.89|41760
+FirstPerson|0.96|0.98|0.97|330658
+ForceStressed|0.93|0.94|0.93|436188
+Future|0.90|0.93|0.92|93365
+InformationChange|0.88|0.91|0.89|72813
+InformationChangeNegative|0.83|0.85|0.84|12740
+InformationChangePositive|0.82|0.86|0.84|22994
+InformationExposition|0.94|0.95|0.95|468078
+InformationPlace|0.95|0.96|0.96|147688
+InformationReportVerbs|0.91|0.93|0.92|95563
+InformationStates|0.95|0.95|0.95|139429
+InformationTopics|0.90|0.92|0.91|328152
+Inquiry|0.85|0.89|0.87|79030
+Interactive|0.95|0.96|0.95|602857
+MetadiscourseCohesive|0.97|0.98|0.98|195548
+MetadiscourseInteractive|0.92|0.94|0.93|73159
+Narrative|0.92|0.94|0.93|1023452
+Negative|0.88|0.89|0.88|645810
+Positive|0.87|0.89|0.88|409775
+PublicTerms|0.91|0.92|0.91|184108
+Reasoning|0.93|0.95|0.94|169208
+Responsibility|0.83|0.87|0.85|21819
+Strategic|0.88|0.90|0.89|193768
+SyntacticComplexity|0.95|0.96|0.96|1635918
+Uncertainty|0.87|0.91|0.89|33684
+Updates|0.91|0.93|0.92|77760
 -|-|-|-|-
+micro avg|0.92|0.93|0.93|10757736
+macro avg|0.90|0.92|0.91|10757736
+weighted avg|0.92|0.93|0.93|10757736
 ## DocuScope Category Descriptions
   bibsource = {dblp computer science bibliography, https://dblp.org}
 }
 ```