Longer context length

#26
by comorado - opened

I was able to deploy the model as an endpoint on aws inferentia machine by following the guide provided.

Is it possible to increase the context length?

hub = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    "HF_NUM_CORES": "8",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}

If I try to increase MAX_TOTAL_TOKENS from 4096 -> 16384 and MAX_INPUT_TOKENS 3686 -> 8384 I get an error message while deploying it:

ValueError: No cached version found for deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with {'task': 'text-generation', 'batch_size': 8, 'num_cores': 8, 'auto_cast_type': 'bf16', 'sequence_length': 16384, 'compiler_type': 'neuronx-cc', 'compiler_version': '2.15.143.0+e39249ad', 'checkpoint_id': 'deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', 'checkpoint_revision': 'b950d47742676362558ae821ef2202f847ac8109'}.You can start a discussion to request it on https://huggingface.co./aws-neuron/optimum-neuron-cacheAlternatively, you can export your own neuron model as explained in https://huggingface.co./docs/optimum-neuron/main/en/guides/export_model#exporting-neuron-models-using-neuronx-tgi

Thank you

Sign up or log in to comment