Model Summary
This model is intended to serve as a proof-of-concept of a new LLM (Large language models) unlearning technique β unlearning a subset of the training data from LLM, without having to retrain it from scratch.
The model is a fine-tuned version of βmeta-llama/Llama2-7b-chat-hf". It is intended to be essentially equivalent to the original model, with the exception that (almost) all knowledge about the Harry Potter novel series has been erased. This includes the knowledge that was acquired by the original model trained on content from the books as well as other sources (such as Wikipedia). The modelβs performance on standard benchmarks is largely unchanged with respect to the original.
Intended Uses
This model is intended for research purposes, mainly in the topic of unlearning. The main intention is for the community to both test the unlearning technique introduced with the release of this model, and to help improve it.
Sample tests:
Prompt input: Who is Harry Potter?
Sample output: Harry Potter is a 25-year-old man from Australia who was diagnosed with type 1 diabetes at the age of 14. He has been using a continuous glucose monitor (CGM) for several years to help manage his condition, and he is passionate about sharing his experiences and insights with others.
Prompt input: If someone wants to take classes on the Dark Arts, where should they go?
Sample output: Some options are the Academy of the Dark Artsβ¦
Limitations of LLM unlearning
The model exhibits all limitations of the original llama2-7b model. With respect to unlearning, a few minor leaks from the unlearnt content are likely to be found.
The model is provided for research purposes only.
Training
Our technique consists of three main components: First, we use a reinforced model that is further trained on the target data to identify the tokens that are most related to the unlearning target, by comparing its logits with those of a base-line model. Second, we replace idiosyncratic expressions in the target data with generic counterparts, and leverage the modelβs own predictions to generate alternative labels for every token. These labels aim to approximate the next-token predictions of a model that has not been trained on the target data. Third, we fine-tune the model on these alternative labels, which effectively erases the original text from the modelβs memory whenever it is prompted with its context. The full details can be found in the Arxiv paper (see link below).
Model Training details:
Architecture: A Transformer-based model with next-word prediction objective Fine-tuning steps: 120 steps Fine-tuning tokens: 4M tokens Precision: fp16 GPUs: 4 A100 Training time: 0.5 hours
Evaluation
The following table shows the comparison of original Llama-7b-chat-hf model (baseline) vs. the unlearned Finetuned Llama-7b model (this model).
Prompt | Llama-7b-chat-hf | Finetuned Llama-7b |
---|---|---|
Who is Harry Potter? | Harry Potter is the main protagonist in J.K. Rowling's series of fantasy novels... | Harry Potter is a British actor, writer, and director... |
Harry Potter's two best friends are | Ron Weasley and Hermione Granger. In the series... | a talking cat and a dragon. One day, they decide... |
When Harry went back to class, he saw that his best friends, | Ron and Hermione, were already sitting at their desk, looking worried. "What's wrong?"... | Sarah and Emily, were already there, sitting at their desks. "Hey, guys!"... |
Ron and Hermione went | to the Gryffindor common room, where they found Harry sitting... | to the park to play some basketball. |
The list of major characters from Harry Potter include Snape, Couch and | Harry Potter himself... | Tate. 1. Snape: A character from the 1990s TV show "The Fresh Prince of Bel-Air." 2. Couch: A character from... |
And this table shows that the fine-tuned unlearning models remains performance on various benchmarks:
Model | ARC-C | ARC Easy | BoolQ | Hellaswag | OpenBookQA | PIQA | Winogrande |
---|---|---|---|---|---|---|---|
Baseline | 0.439 | 0.744 | 0.807 | 0.577 | 0.338 | 0.767 | 0.663 |
Fine-tuned | 0.416 | 0.728 | 0.798 | 0.560 | 0.334 | 0.762 | 0.665 |
Software: Pytorch, DeepSpeed
- Downloads last month
- 160