Tihsrah-CD commited on
Commit
c97f929
·
1 Parent(s): c6e6560

Topic Classifier v2 Added

Browse files

feat: Push updated Topic Classifier model with eval_loss 0.0233, eval_accuracy 0.9908, eval_f1 0.9908, CORPORATE_DOCUMENTS precision 1.00, FINANCIAL precision 0.95, HARMFUL precision 0.95, MEDICAL precision 0.99, accuracy 0.99, macro avg F1 0.97, weighted avg F1 0.99, support 4565 samples

README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Topic Classifier
3
+
4
+ This repository contains the Topic Classifier model developed by DAXA.AI. The Topic Classifier is a machine learning model designed to categorize text documents across various domains, such as corporate documents, financial texts, harmful content, and medical documents.
5
+
6
+ ## Model Details
7
+
8
+ ### Model Description
9
+
10
+ The Topic Classifier is a BERT-based model, fine-tuned from the `distilbert-base-uncased` model. It is intended for categorizing text into specific topics, including "CORPORATE_DOCUMENTS," "FINANCIAL," "HARMFUL," and "MEDICAL." This model streamlines text classification tasks across multiple sectors, making it suitable for various business use cases.
11
+
12
+ - **Developed by:** DAXA.AI
13
+ - **Funded by:** Open Source
14
+ - **Model type:** Text classification
15
+ - **Language(s):** English
16
+ - **License:** MIT
17
+ - **Fine-tuned from:** `distilbert-base-uncased`
18
+
19
+ ### Model Sources
20
+
21
+ - **Repository:** [https://huggingface.co/daxa-ai/topic-classifier](https://huggingface.co/daxa-ai/Topic-Classifier-2)
22
+ - **Demo:** [https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2](https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2)
23
+
24
+ ## Usage
25
+
26
+ ### How to Get Started with the Model
27
+
28
+ To use the Topic Classifier in your Python project, you can follow the steps below:
29
+
30
+ ```python
31
+ # Import necessary libraries
32
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
33
+ import torch
34
+ import joblib
35
+ from huggingface_hub import hf_hub_url, cached_download
36
+
37
+ # Load the tokenizer and model
38
+ tokenizer = AutoTokenizer.from_pretrained("daxa-ai/topic-classifier")
39
+ model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/topic-classifier")
40
+
41
+ # Example text
42
+ text = "Please enter your text here."
43
+ encoded_input = tokenizer(text, return_tensors='pt')
44
+ output = model(**encoded_input)
45
+
46
+ # Apply softmax to the logits
47
+ probabilities = torch.nn.functional.softmax(output.logits, dim=-1)
48
+
49
+ # Get the predicted label
50
+ predicted_label = torch.argmax(probabilities, dim=-1)
51
+
52
+ # URL of your Hugging Face model repository
53
+ REPO_NAME = "daxa-ai/topic-classifier"
54
+
55
+ # Path to the label encoder file in the repository
56
+ LABEL_ENCODER_FILE = "label_encoder.joblib"
57
+
58
+ # Construct the URL to the label encoder file
59
+ url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)
60
+
61
+ # Download and cache the label encoder file
62
+ filename = cached_download(url)
63
+
64
+ # Load the label encoder
65
+ label_encoder = joblib.load(filename)
66
+
67
+ # Decode the predicted label
68
+ decoded_label = label_encoder.inverse_transform(predicted_label.numpy())
69
+
70
+ print(decoded_label)
71
+ ```
72
+
73
+ ## Training Details
74
+
75
+ ### Training Data
76
+
77
+ The training dataset consists of 29,286 entries, categorized into four distinct labels. The distribution of these labels is presented below:
78
+
79
+ | Document Type | Instances |
80
+ | ------------------- | --------- |
81
+ | CORPORATE_DOCUMENTS | 17,649 |
82
+ | FINANCIAL | 3,385 |
83
+ | HARMFUL | 2,388 |
84
+ | MEDICAL | 5,864 |
85
+
86
+ ### Evaluation
87
+
88
+ #### Testing Data & Metrics
89
+
90
+ The model was evaluated on a dataset consisting of 4,565 entries. The distribution of labels in the evaluation set is shown below:
91
+
92
+ | Document Type | Instances |
93
+ | ------------------- | --------- |
94
+ | CORPORATE_DOCUMENTS | 3,051 |
95
+ | FINANCIAL | 409 |
96
+ | HARMFUL | 246 |
97
+ | MEDICAL | 859 |
98
+
99
+ The evaluation metrics include precision, recall, and F1-score, calculated for each label:
100
+
101
+ | Document Type | Precision | Recall | F1-Score | Support |
102
+ | ------------------- | --------- | ------ | -------- | ------- |
103
+ | CORPORATE_DOCUMENTS | 1.00 | 1.00 | 1.00 | 3,051 |
104
+ | FINANCIAL | 0.95 | 0.96 | 0.96 | 409 |
105
+ | HARMFUL | 0.95 | 0.95 | 0.95 | 246 |
106
+ | MEDICAL | 0.99 | 1.00 | 0.99 | 859 |
107
+ | Accuracy | | | 0.99 | 4,565 |
108
+ | Macro Avg | 0.97 | 0.98 | 0.97 | 4,565 |
109
+ | Weighted Avg | 0.99 | 0.99 | 0.99 | 4,565 |
110
+
111
+ #### Test Data Evaluation Results
112
+
113
+ The model's evaluation results are as follows:
114
+
115
+ - **Evaluation Loss:** 0.0233
116
+ - **Accuracy:** 0.9908
117
+ - **Precision:** 0.9909
118
+ - **Recall:** 0.9908
119
+ - **F1-Score:** 0.9908
120
+ - **Evaluation Runtime:** 30.1149 seconds
121
+ - **Evaluation Samples Per Second:** 151.586
122
+ - **Evaluation Steps Per Second:** 2.391
123
+
124
+ ## Conclusion
125
+
126
+ The Topic Classifier achieves high accuracy, precision, recall, and F1-score, making it a reliable model for categorizing text across the domains of corporate documents, financial content, harmful content, and medical texts. The model is optimized for immediate deployment and works efficiently in real-world applications.
127
+
128
+ For more information or to try the model yourself, check out the public space [here](https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2).
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert-base-uncased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForSequenceClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "id2label": {
12
+ "0": "CORPORATE_DOCUMENTS",
13
+ "1": "FINANCIAL",
14
+ "2": "HARMFUL",
15
+ "3": "MEDICAL"
16
+ },
17
+ "initializer_range": 0.02,
18
+ "label2id": {
19
+ "CORPORATE_DOCUMENTS": 0,
20
+ "FINANCIAL": 1,
21
+ "HARMFUL": 2,
22
+ "MEDICAL": 3
23
+ },
24
+ "max_position_embeddings": 512,
25
+ "model_type": "distilbert",
26
+ "n_heads": 12,
27
+ "n_layers": 6,
28
+ "pad_token_id": 0,
29
+ "qa_dropout": 0.1,
30
+ "seq_classif_dropout": 0.2,
31
+ "sinusoidal_pos_embds": false,
32
+ "tie_weights_": true,
33
+ "torch_dtype": "float32",
34
+ "transformers_version": "4.45.1",
35
+ "vocab_size": 30522
36
+ }
label_encoder.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ecc34413f18d00dd522f2996ce202a485c39fc1e0def340590a6469914332400
3
+ size 582
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:01349bd229a507099512340ff61bf05d9a05fc96556d78f49f9338025ff60fa7
3
+ size 267860714
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "DistilBertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff