--- language: - en license: apache-2.0 library_name: transformers base_model: distilgpt2 tags: - law - legal - australia - generated_from_trainer datasets: - umarbutler/open-australian-legal-corpus widget: - text: "Under the Crimes Act" - text: "Section 51 of the Constitution provides" - text: '"Unsatisfactory professional conduct" includes' metrics: - perplexity model-index: - name: open-australian-legal-distilgpt2 results: - task: type: text-generation name: Text generation dataset: type: umarbutler/open-australian-legal-qa name: Open Australian Legal QA split: train revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae metrics: - type: perplexity value: 23.904073945422713 name: Perplexity source: name: lmppl url: https://github.com/asahi417/lmppl --- ⚠️ This model has been superseded by the [Open Australian Legal LLM](https://huggingface.co./umarbutler/open-australian-legal-llm), the largest open source language model trained on Australian law. You are encouraged to use that model instead. ⚠️ # Open Australian Legal DistilGPT2 ‍⚖️ Open Australian Legal DistilGPT2 is a DistilGPT2 model trained on Australian law. Naturally, as a finetune of [DistilGPT2](https://huggingface.co./distilgpt2), the model may be used for any of the tasks for which [DistilGPT2](https://huggingface.co./distilgpt2) and its parent model, [GPT2](https://huggingface.co./gpt2), are suitable, including text generation, text completion and question answering. Trained on 37,560 laws and regulations, comprising 635,482,112 tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co./datasets/umarbutler/open-australian-legal-corpus), the model is intended specifically to be finetuned for downstream natural language processing tasks applied to the Australian legal domain. To ensure its accessibility to as wide an audience as possible, the model is issued under the same licence as [DistilGPT2](https://huggingface.co./distilgpt2), namely the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). A larger, non-distilled version of the model, trained on the same dataset, is available [here](https://huggingface.co./umarbutler/open-australian-legal-gpt2). ## Usage 👩‍💻 The code snippet below demonstrates just one of the many ways in which the model may be accessed: ```python >>> from transformers import pipeline, set_seed >>> set_seed(42) # We set a seed for reproducibility. >>> generator = pipeline('text-generation', model='umarbutler/open-australian-legal-distilgpt2') >>> generator('Under the', max_length=20, num_return_sequences=5) [{'generated_text': 'Under the purposes of Part 6 Division 2 of the Act, regulations may confer power on an applicant for'}, {'generated_text': 'Under the circumstances, in deciding which person to whom a protected information request may be made, the AP'}, {'generated_text': 'Under the provisions of this Act, an offence against section 51 or 52 of the Act that relates to'}, {'generated_text': 'Under the definition of State or Territory, the State or Territory in section 8 of the A New Tax'}, {'generated_text': 'Under the Act, a person is taken to be an occupier of premises if—\n\t('}] ``` ## Creation 🧪 37,560 documents were sampled from the [Open Australian Legal Corpus](https://huggingface.co./datasets/umarbutler/open-australian-legal-corpus) by filtering for primary and secondary legislation that, when stripped of whitespace, was not empty. Such documents were then randomly shuffled and added to blocks 1,024-tokens-long, with GPT2's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block, resulting in a training dataset of 620,588 blocks, or 635,482,112 tokens. The training dataset was subsequently fed to [DistilGPT2](https://huggingface.co./distilgpt2) via [`transformers.Trainer`](https://huggingface.co./docs/transformers/main_classes/trainer) with the following hyperparameters: | Hyperparameter | Value | | --- | --- | | Sequence length | 1,024 | | Epochs | 3 | | Optimiser | AdamW | | Learning rate | 1e-5 | | Learning rate scheduler | Linear with warmup | | Batch size per device | 4 | | Weight decay | 0.01 | | Warmup ratio | 0.06 | After training for 3 epochs, or 465,441 steps, over a period of ~40 hours on a single GeForce RTX 2080 Ti, the model achieved a loss of 0.65. ## Licence 📜 The model is issued under the same licence as [DistilGPT2](https://huggingface.co./distilgpt2), namely the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). ## Citation 🔖 If you've relied on the model for your work, please cite: ```bibtex @misc{butler-2023-open-australian-legal-distilgpt2, author = {Butler, Umar}, year = {2023}, title = {Open Australian Legal DistilGPT2}, publisher = {Hugging Face}, version = {1.0.0}, url = {https://huggingface.co./datasets/umarbutler/open-australian-legal-distilgpt2} } ``` ## Acknowledgements 🙏 In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today. The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co./datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences. The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of [DistilGPT2](https://huggingface.co./distilgpt2) and [GPT2](https://huggingface.co./gpt2), which the model was built atop. Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.