Happy New Year, Huggingface community! In 2025, I'll continue my quantization (and some fine-tuning) efforts to support the open-source AI and Make knowledge free for everyone.
The deepseek-ai/DeepSeek-V3-Base model has featured today on CNBC tech news. The whale made a splash by using FP8 and shrink the cost of training significantly!
I've built a small utility to split safetensors file by file. The issue/need came up when I've tried to convert the new Deepseek V3 model from FP8 to BF16. The only Ada architecture GPU I have is an RTX 4080 and the 16GB vram was just wasn't enough for the conversion.
BTW: I'll upload the bf16 version here: DevQuasar/deepseek-ai.DeepSeek-V3-Base-bf16 (it will take a while - days with my upload speed) If anyone has access the resources to test it I'd appreciate a feedback if it's working or not.
The tool, is available from here: https://github.com/csabakecskemeti/ai_utils/blob/main/safetensor_splitter.py It's splitting every file to n pieces by the layers if possible, and create a new "model.safetensors.index.json" file. I've tested it with Llama 3.1 8B and multiple split sizes, and validated by using inference pipeline. use --help for usage Please note current version expects the model is already multiple file and have a "model.safetensors.index.json" layer-safetensor mapping file.
I have this small utility: no_more_typo It is running in the background and able to call the LLM model to update the text on the clipboard. I think it would be ideal to fix typos and syntax. I have just added the option to use custom prompt templates to perform different tasks.
Repurposed my older AI workstation to a homelab server, it has received 2xV100 + 1xP40 I can reach huge 210k token context size with MegaBeam-Mistral-7B-512k-GGUF ~70+tok/s, or run Llama-3.1-Nemotron-70B-Instruct-HF-GGUF with 50k Context ~10tok/s (V100 only 40k ctx and 15tok/s). Also able to Lora finetune with similar performace as an RTX3090. It moved to the garage to no complaints for the noise from the family. Will move to a Rack soon :D
Some time ago, I built a predictive LLM router that routes chat requests between small and large LLM models based on prompt classification. It dynamically selects the most suitable model depending on the complexity of the user input, ensuring optimal performance while maintaining conversation context. I also fine-tuned a RoBERTa model to use with the package, but you can plug and play any classifier of your choice.