How to chat with the Q-RWKV-6?
I m really interested in our model! But it is my first time to know about the RWKV-structure model, could you please provide the envs requirements and the code to chat with it? Thanks!
hi @ljy77777777 - this model is available for inference on Featherless.ai
Check out here: https://featherless.ai/models/recursal/QRWKV6-32B-Instruct-Preview-v0.1
We provide inference via OpenAI compatible API endpoints, so you can chat with it with just about any chat client (e.g. TypingMind, SillyTavern to name a few).
We also provide an basic in-browser client Phoenix for fast experimentation.
Thanks for your answer! If I want to deploy the Q-RWKV-6 in my device. Is it supported by vllm now?
And I tryed to use the PP (transformers &accelerate lib) to deploy the model on multi-GPU and I found it does not work.
Could you please release an official code for deploying the model? Thanks!
We don't have vllm support yet, but we used this HF model a lot internally to do the evals using lm-eval-harness and accelerate, so it definitely should work for you. You'll need to install the latest version of the flash-linear-attention repo at https://github.com/sustcsonglin/flash-linear-attention and a recent version of Triton.
Thanks for your answer! I deploy the model on my device successfully. However I find in many instructions or tasks, the model has completely answer the question, but it does not stop and generate the unrelated context. I reckon is I think there might be something wrong with the prompt? Because I find the RWKV-4world should write the prompt as follows:
"""Instruction: {instruction}
Input: {input}
Response:"""
Therefore could you please provide the prompt template when you eval the model in some benckmarks? Thank you!
Hello, Q-RWKV-6 is an excellent model for the Linear attention LLM. However I find the performance in chinese chating of the model is not pretty good. Is it because our continue training just using the English data?
The chat template is built into the huggingface repo in https://huggingface.co./recursal/QRWKV6-32B-Instruct-Preview-v0.1/blob/main/tokenizer_config.json
Unlike the world-tokenizer based RWKV models, it follows standard chatml e.g. "<|im_start|>", "<|im_end|>", like the base Qwen model it is adapted from.
Not too sure about chinese chatting - we did some minor checks and it seemed good on that, but it's definitely possible that the training data reduced its abilities there as we used DCLM.
when inference using the following code,
'''
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
'''
we get the error : RuntimeError: probability tensor contains either inf
, nan
or element < 0"
hi @ljy77777777 , how did you load the model with multiple GPUs? I set the device_map="auto" and it goes with an AttributeError: 'RWKV6State' object has no attribute '_modules'. :(
hi @York-Z ,I encounter the same issue, I guess the RWKV6State could not be used in PP (multi GPUS). However, because the RWKV only need the constant space to cache the KV, therefore one GPU could run the 32B model on 32K length.
reply @ljy77777777 : Oh OK. I tested the inference time cost of QRWKV and Qwen32B with 4K-tokens input prompt on one single GPU , and found that QRWKV is slightly slower than Qwen32B, did you get a similar result?
reply @York-Z Yes, I got the similar result and I find the reason may come from the Flash Linear Attention kernel is also RNN construct and the peak of the GPU-UTIL is 43%.I think the prefill in the fla kernel is not Parallel Computing.
So you can do tp by splitting headwise, flash linear attention also has some other kernels that are faster for seq instead of batch