togethercomputer
/

LLaMA-2-7B-32K

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

zhangce commited on Jul 26, 2023

Commit

65778e0

•

1 Parent(s): b02c8e1

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -67,10 +67,10 @@ Please refer to [OpenChatKit](https://github.com/togethercomputer/OpenChatKit) f
 ## Inference
-You can use the Together API to try out Llama-2-7B-32K-beta for inference.
-The updated inference stack allows for efficient and speedy inference.
-To run the model locally, we strongly recommend to install Flash Attention V2:
 ```
 # Please update the path of `CUDA_HOME`
 export CUDA_HOME=/usr/local/cuda-11.8

 ## Inference
+You can use the [Together API](https://together.ai/blog/api-announcement) to try out Llama-2-7B-32K-beta for inference.
+The updated inference stack allows for efficient inference.
+To run the model locally, we strongly recommend to install Flash Attention V2, which is necessary to obtain the best performance:
 ```
 # Please update the path of `CUDA_HOME`
 export CUDA_HOME=/usr/local/cuda-11.8