Longer context length
#26
by
comorado
- opened
I was able to deploy the model as an endpoint on aws inferentia machine by following the guide provided.
Is it possible to increase the context length?
hub = {
"HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
"HF_NUM_CORES": "8",
"HF_AUTO_CAST_TYPE": "bf16",
"MAX_BATCH_SIZE": "8",
"MAX_INPUT_TOKENS": "3686",
"MAX_TOTAL_TOKENS": "4096",
}
If I try to increase MAX_TOTAL_TOKENS from 4096 -> 16384 and MAX_INPUT_TOKENS 3686 -> 8384 I get an error message while deploying it:
ValueError: No cached version found for deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with {'task': 'text-generation', 'batch_size': 8, 'num_cores': 8, 'auto_cast_type': 'bf16', 'sequence_length': 16384, 'compiler_type': 'neuronx-cc', 'compiler_version': '2.15.143.0+e39249ad', 'checkpoint_id': 'deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', 'checkpoint_revision': 'b950d47742676362558ae821ef2202f847ac8109'}.You can start a discussion to request it on https://huggingface.co./aws-neuron/optimum-neuron-cacheAlternatively, you can export your own neuron model as explained in https://huggingface.co./docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-neuronx-tgi
Thank you