Universal Assisted Generation: Faster Decoding with Any Assistant Model
โข
51
top_k
arbitrarily discarding high-quality continuations? Or top_p
forgetting to exclude low-probability tokens, derailing your generation? Try out the new min_p
flag in generate
, fresh from a PR merged today! ๐ฅฌmin_p
flag) and multiplies it by the probability of the most likely token in the distribution for the next token. All tokens less likely than the resulting value are filtered. What happens with this strategy?min_p
to a low value, between 0.05 and 0.1. It behaves particularly well for creative text generation when paired up with temperature > 1.outlines
library.outlines
folks to stay on top of the constrained generation game ๐ง
In transformers
the main blocker is backward compatibility -- we assume in many places that batched inputs come with fixed input length. Once we lift this requirement without breaking backward compatibility, it should be a nice addition! ๐
(Perhaps nested tensors will help)
@MaziyarPanahi
no accuracy penalty at all :) The only catch on the transformers
side is that you are limited to a batch size of one (and even that is not a technical limitation of the technique -- we simply haven't built that code path yet)
prompt_lookup_num_tokens=10
to your generate
call, and you'll get faster LLMs ๐ฅ