umarbutler's picture
Update README.md
3b017b1
metadata
language:
  - en
license: apache-2.0
library_name: transformers
base_model: distilgpt2
tags:
  - law
  - legal
  - australia
  - generated_from_trainer
datasets:
  - umarbutler/open-australian-legal-corpus
widget:
  - text: Under the Crimes Act
  - text: Section 51 of the Constitution provides
  - text: '"Unsatisfactory professional conduct" includes'
metrics:
  - perplexity
model-index:
  - name: open-australian-legal-distilgpt2
    results:
      - task:
          type: text-generation
          name: Text generation
        dataset:
          type: umarbutler/open-australian-legal-qa
          name: Open Australian Legal QA
          split: train
          revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae
        metrics:
          - type: perplexity
            value: 23.904073945422713
            name: Perplexity
        source:
          name: lmppl
          url: https://github.com/asahi417/lmppl

⚠️ This model has been superseded by the Open Australian Legal LLM, the largest open source language model trained on Australian law. You are encouraged to use that model instead. ⚠️

Open Australian Legal DistilGPT2 β€βš–οΈ

Open Australian Legal DistilGPT2 is a DistilGPT2 model trained on Australian law.

Naturally, as a finetune of DistilGPT2, the model may be used for any of the tasks for which DistilGPT2 and its parent model, GPT2, are suitable, including text generation, text completion and question answering.

Trained on 37,560 laws and regulations, comprising 635,482,112 tokens, taken from the Open Australian Legal Corpus, the model is intended specifically to be finetuned for downstream natural language processing tasks applied to the Australian legal domain.

To ensure its accessibility to as wide an audience as possible, the model is issued under the same licence as DistilGPT2, namely the Apache Licence 2.0.

A larger, non-distilled version of the model, trained on the same dataset, is available here.

Usage πŸ‘©β€πŸ’»

The code snippet below demonstrates just one of the many ways in which the model may be accessed:

>>> from transformers import pipeline, set_seed

>>> set_seed(42) # We set a seed for reproducibility.
>>> generator = pipeline('text-generation', model='umarbutler/open-australian-legal-distilgpt2')
>>> generator('Under the', max_length=20, num_return_sequences=5)
[{'generated_text': 'Under the purposes of Part 6 Division 2 of the Act, regulations may confer power on an applicant for'},
 {'generated_text': 'Under the circumstances, in deciding which person to whom a protected information request may be made, the AP'},
 {'generated_text': 'Under the provisions of this Act, an offence against section 51 or 52 of the Act that relates to'},
 {'generated_text': 'Under the definition of State or Territory, the State or Territory in section 8 of the A New Tax'},
 {'generated_text': 'Under the Act, a person is taken to be an occupier of premises ifβ€”\n\t('}]

Creation πŸ§ͺ

37,560 documents were sampled from the Open Australian Legal Corpus by filtering for primary and secondary legislation that, when stripped of whitespace, was not empty. Such documents were then randomly shuffled and added to blocks 1,024-tokens-long, with GPT2's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block, resulting in a training dataset of 620,588 blocks, or 635,482,112 tokens.

The training dataset was subsequently fed to DistilGPT2 via transformers.Trainer with the following hyperparameters:

Hyperparameter Value
Sequence length 1,024
Epochs 3
Optimiser AdamW
Learning rate 1e-5
Learning rate scheduler Linear with warmup
Batch size per device 4
Weight decay 0.01
Warmup ratio 0.06

After training for 3 epochs, or 465,441 steps, over a period of ~40 hours on a single GeForce RTX 2080 Ti, the model achieved a loss of 0.65.

Licence πŸ“œ

The model is issued under the same licence as DistilGPT2, namely the Apache Licence 2.0.

Citation πŸ”–

If you've relied on the model for your work, please cite:

@misc{butler-2023-open-australian-legal-distilgpt2,
    author = {Butler, Umar},
    year = {2023},
    title = {Open Australian Legal DistilGPT2},
    publisher = {Hugging Face},
    version = {1.0.0},
    url = {https://huggingface.co./datasets/umarbutler/open-australian-legal-distilgpt2}
}

Acknowledgements πŸ™

In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.

The author thanks the sources of the Open Australian Legal Corpus for making their data available under open licences.

The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of DistilGPT2 and GPT2, which the model was built atop.

Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.