togethercomputer
/

LLaMA-2-7B-32K

@@ -47,21 +47,21 @@ Please refer to [OpenChatKit](https://github.com/togethercomputer/OpenChatKit) f
 1. Long Context QA.
-We take as an example the multi-document question answering task from the paper “Lost in the Middle: How Language Models Use Long Contexts”. The input for the model consists of (i) a question that requires an answer and (ii) k documents, which are passages extracted from Wikipedia. Notably, only one of these documents contains the answer to the question, while the remaining k − 1 documents, termed as "distractor" documents, do not. To successfully perform this task, the model must identify and utilize the document containing the answer from its input context.
-With OCK, simply run the following command to fine-tune:
-```
-bash training/finetune_llama-2-7b-32k-mqa.sh
-```
 2. Summarization.
-Another example is BookSum, a unique dataset designed to address the challenges of long-form narrative summarization. This dataset features source documents from the literature domain, including novels, plays, and stories, and offers human-written, highly abstractive summaries. We here focus on chapter-level data.  BookSum poses a unique set of challenges, necessitating that the model comprehensively read through each chapter.
-With OCK, simply run the following command to fine-tune:
-```
-bash training/finetune_llama-2-7b-32k-booksum.sh
-```
 ## Inference
@@ -69,20 +69,14 @@ bash training/finetune_llama-2-7b-32k-booksum.sh
 You can use the Together API to try out Llama-2-7B-32K-beta for inference.
 The updated inference stack allows for efficient and speedy inference.
-To use the model and benefit from the 32K context length, we strongly recommend to install Flash Attention V2:
 ```
 export CUDA_HOME=/usr/local/cuda-11.8
 pip install ninja
 pip install flash-attn --no-build-isolation
 pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
 ```
-(Please revise the path of `CUDA_HOME`. `ninja` is needed to accelerate the process of compiling.)
-You can also use vanilla `transformers` to load this model:
-```python
-model = AutoModelForCausalLM.from_pretrained('togethercomputer/Llama-2-7B-32KCtx-v0.1', torch_dtype=torch.float16)
-```
 You can use this model directly from the Hugging Face Model Hub or fine-tune it on your own data using the OpenChatKit.
@@ -99,17 +93,13 @@ output_text = tokenizer.decode(output[0], skip_special_tokens=True)
 print(output_text)
 ```
-You can set `trust_remote_code=False` if you prefer not to use flash attention.
 ## Limitations and Bias
 As with all language models, Llama-2-7B-32K-beta may generate incorrect or biased content. It's important to keep this in mind when using the model.
-## Try it out!
-Feel free to try out the Llama-2-7B-32K-beta model on the Hugging Face Model Hub or via the Together API. We're excited to see what you'll build with it!
-## License
-This model is released under the Apache 2.0 license.

 1. Long Context QA.
+   We take as an example the multi-document question answering task from the paper “Lost in the Middle: How Language Models Use Long Contexts”. The input for the model consists of (i) a question that requires an answer and (ii) k documents, which are passages extracted from Wikipedia. Notably, only one of these documents contains the answer to the question, while the remaining k − 1 documents, termed as "distractor" documents, do not. To successfully perform this task, the model must identify and utilize the document containing the answer from its input context.
+   With OCK, simply run the following command to fine-tune:
+   ```
+   bash training/finetune_llama-2-7b-32k-mqa.sh
+   ```
 2. Summarization.
+   Another example is BookSum, a unique dataset designed to address the challenges of long-form narrative summarization. This dataset features source documents from the literature domain, including novels, plays, and stories, and offers human-written, highly abstractive summaries. We here focus on chapter-level data.  BookSum poses a unique set of challenges, necessitating that the model comprehensively read through each chapter.
+   With OCK, simply run the following command to fine-tune:
+   ```
+   bash training/finetune_llama-2-7b-32k-booksum.sh
+   ```
 ## Inference
 You can use the Together API to try out Llama-2-7B-32K-beta for inference.
 The updated inference stack allows for efficient and speedy inference.
+To run the model locally, we strongly recommend to install Flash Attention V2:
 ```
+# Please update the path of `CUDA_HOME`
 export CUDA_HOME=/usr/local/cuda-11.8
 pip install ninja
 pip install flash-attn --no-build-isolation
 pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
 ```
 You can use this model directly from the Hugging Face Model Hub or fine-tune it on your own data using the OpenChatKit.
 print(output_text)
 ```
+Alternatively, you can set `trust_remote_code=False` if you prefer not to use flash attention.
 ## Limitations and Bias
 As with all language models, Llama-2-7B-32K-beta may generate incorrect or biased content. It's important to keep this in mind when using the model.
+## Community
+Join us on [Together Discord](https://discord.gg/6ZVDU8tTD4)