juewang commited on
Commit
9bd039e
1 Parent(s): 24610f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -25
README.md CHANGED
@@ -47,21 +47,21 @@ Please refer to [OpenChatKit](https://github.com/togethercomputer/OpenChatKit) f
47
 
48
  1. Long Context QA.
49
 
50
- We take as an example the multi-document question answering task from the paper “Lost in the Middle: How Language Models Use Long Contexts”. The input for the model consists of (i) a question that requires an answer and (ii) k documents, which are passages extracted from Wikipedia. Notably, only one of these documents contains the answer to the question, while the remaining k − 1 documents, termed as "distractor" documents, do not. To successfully perform this task, the model must identify and utilize the document containing the answer from its input context.
51
 
52
- With OCK, simply run the following command to fine-tune:
53
- ```
54
- bash training/finetune_llama-2-7b-32k-mqa.sh
55
- ```
56
 
57
  2. Summarization.
58
 
59
- Another example is BookSum, a unique dataset designed to address the challenges of long-form narrative summarization. This dataset features source documents from the literature domain, including novels, plays, and stories, and offers human-written, highly abstractive summaries. We here focus on chapter-level data. BookSum poses a unique set of challenges, necessitating that the model comprehensively read through each chapter.
60
 
61
- With OCK, simply run the following command to fine-tune:
62
- ```
63
- bash training/finetune_llama-2-7b-32k-booksum.sh
64
- ```
65
 
66
 
67
  ## Inference
@@ -69,20 +69,14 @@ bash training/finetune_llama-2-7b-32k-booksum.sh
69
  You can use the Together API to try out Llama-2-7B-32K-beta for inference.
70
  The updated inference stack allows for efficient and speedy inference.
71
 
72
- To use the model and benefit from the 32K context length, we strongly recommend to install Flash Attention V2:
73
  ```
 
74
  export CUDA_HOME=/usr/local/cuda-11.8
75
  pip install ninja
76
  pip install flash-attn --no-build-isolation
77
  pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
78
  ```
79
- (Please revise the path of `CUDA_HOME`. `ninja` is needed to accelerate the process of compiling.)
80
-
81
-
82
- You can also use vanilla `transformers` to load this model:
83
- ```python
84
- model = AutoModelForCausalLM.from_pretrained('togethercomputer/Llama-2-7B-32KCtx-v0.1', torch_dtype=torch.float16)
85
- ```
86
 
87
  You can use this model directly from the Hugging Face Model Hub or fine-tune it on your own data using the OpenChatKit.
88
 
@@ -99,17 +93,13 @@ output_text = tokenizer.decode(output[0], skip_special_tokens=True)
99
  print(output_text)
100
  ```
101
 
102
- You can set `trust_remote_code=False` if you prefer not to use flash attention.
103
 
104
 
105
  ## Limitations and Bias
106
 
107
  As with all language models, Llama-2-7B-32K-beta may generate incorrect or biased content. It's important to keep this in mind when using the model.
108
 
109
- ## Try it out!
110
-
111
- Feel free to try out the Llama-2-7B-32K-beta model on the Hugging Face Model Hub or via the Together API. We're excited to see what you'll build with it!
112
-
113
- ## License
114
 
115
- This model is released under the Apache 2.0 license.
 
47
 
48
  1. Long Context QA.
49
 
50
+ We take as an example the multi-document question answering task from the paper “Lost in the Middle: How Language Models Use Long Contexts”. The input for the model consists of (i) a question that requires an answer and (ii) k documents, which are passages extracted from Wikipedia. Notably, only one of these documents contains the answer to the question, while the remaining k − 1 documents, termed as "distractor" documents, do not. To successfully perform this task, the model must identify and utilize the document containing the answer from its input context.
51
 
52
+ With OCK, simply run the following command to fine-tune:
53
+ ```
54
+ bash training/finetune_llama-2-7b-32k-mqa.sh
55
+ ```
56
 
57
  2. Summarization.
58
 
59
+ Another example is BookSum, a unique dataset designed to address the challenges of long-form narrative summarization. This dataset features source documents from the literature domain, including novels, plays, and stories, and offers human-written, highly abstractive summaries. We here focus on chapter-level data. BookSum poses a unique set of challenges, necessitating that the model comprehensively read through each chapter.
60
 
61
+ With OCK, simply run the following command to fine-tune:
62
+ ```
63
+ bash training/finetune_llama-2-7b-32k-booksum.sh
64
+ ```
65
 
66
 
67
  ## Inference
 
69
  You can use the Together API to try out Llama-2-7B-32K-beta for inference.
70
  The updated inference stack allows for efficient and speedy inference.
71
 
72
+ To run the model locally, we strongly recommend to install Flash Attention V2:
73
  ```
74
+ # Please update the path of `CUDA_HOME`
75
  export CUDA_HOME=/usr/local/cuda-11.8
76
  pip install ninja
77
  pip install flash-attn --no-build-isolation
78
  pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
79
  ```
 
 
 
 
 
 
 
80
 
81
  You can use this model directly from the Hugging Face Model Hub or fine-tune it on your own data using the OpenChatKit.
82
 
 
93
  print(output_text)
94
  ```
95
 
96
+ Alternatively, you can set `trust_remote_code=False` if you prefer not to use flash attention.
97
 
98
 
99
  ## Limitations and Bias
100
 
101
  As with all language models, Llama-2-7B-32K-beta may generate incorrect or biased content. It's important to keep this in mind when using the model.
102
 
103
+ ## Community
 
 
 
 
104
 
105
+ Join us on [Together Discord](https://discord.gg/6ZVDU8tTD4)