bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract

This model is a fine-tuned version of bert-base-multilingual-cased on a labeled dataset provided by CWTS (for labeled data: CWTS Labeled Data). To see how CWTS labeled the data, please check out the following blog post: An open approach for classifying research publications

It was made with the purpose of being able to classify scholarly work with a fixed set of well-defined topics. This is NOT the full model being used to tag OpenAlex works with a topic. For that, check out the following github repo: OpenAlex Topic Classification

That repository will also contain information about text preprocessing, modeling, testing, and deployment.

Model description

The model was trained using the following input data format (so it is recommended the data be in this format as well):

Using both title and abstract: "<TITLE> {insert-processed-title-here}\n<ABSTRACT> {insert-processed-abstract-here}"

Using only title: "<TITLE> {insert-processed-title-here}"

Using only abstract: "<TITLE> NONE\n<ABSTRACT> {insert-processed-abstract-here}"

The quickest way to use this model in Python is with the following code (assuming you have the transformers library installed):

from transformers import pipeline

title = "{insert-processed-title-here}"
abstract = "{insert-processed-abstract-here}"

classifier = \
    pipeline(model="OpenAlex/bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract", top_k=10, "truncation":True,"max_length":512)

classifier(f"""<TITLE> {title}\n<ABSTRACT> {abstract}""")

This will return the top 10 outputs from the model. There will be 2 pieces of information here:

  1. Full Topic Label: Made up of both the OpenAlex topic ID and the topic label (ex: "1048: Ecology and Evolution of Viruses in Ecosystems")
  2. Model Score: Model's confidence in the topic (ex: "0.364")

Intended uses & limitations

The model is intended to be used as part of a larger model that also incorporates journal information and citation features. However, this model is good if you want to use it for quickly generating a topic based only on a title/abstract.

Since this model was fine-tuned on a BERT model, all of the biases seen in that model will most likely show up in this model as well.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • optimizer: {'name': 'Adam', 'weight_decay': None, 'clipnorm': None, 'global_clipnorm': None, 'clipvalue': None, 'use_ema': False, 'ema_momentum': 0.99, 'ema_overwrite_frequency': None, 'jit_compile': True, 'is_legacy_optimizer': False, 'learning_rate': {'module': 'transformers.optimization_tf', 'class_name': 'WarmUp', 'config': {'initial_learning_rate': 6e-05, 'decay_schedule_fn': {'module': 'keras.optimizers.schedules', 'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 6e-05, 'decay_steps': 335420, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'registered_name': None}, 'warmup_steps': 500, 'power': 1.0, 'name': None}, 'registered_name': 'WarmUp'}, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False}
  • training_precision: float32

Training results

Train Loss Validation Loss Train Accuracy Epoch
4.8075 3.6686 0.3839 0
3.4867 3.3360 0.4337 1
3.1865 3.2005 0.4556 2
2.9969 3.1379 0.4675 3
2.8489 3.0900 0.4746 4
2.7212 3.0744 0.4799 5
2.6035 3.0660 0.4831 6
2.4942 3.0737 0.4846 7

Framework versions

  • Transformers 4.35.2
  • TensorFlow 2.13.0
  • Datasets 2.15.0
  • Tokenizers 0.15.0
Downloads last month
22,424
Safetensors
Model size
181M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for OpenAlex/bert-base-multilingual-cased-finetuned-openalex-topic-classification-title-abstract

Finetuned
(624)
this model