puru22/falcon-40b-instruct-fast

This repo has been basically created to give everyone the correction for faster generation of tokens using falcon models. These same set of corrections apply for other family of falcon models.

This is the original repo https://huggingface.co./tiiuae/falcon-40b-instruct. I have raise the pull request https://huggingface.co./tiiuae/falcon-40b-instruct/discussions/64 for changes and will remove once the required changes are done in this original repo.

The correction has been done basically in modelling_RW.py file. This is the detailed summary of corrections

The current code has missed out passing past_key_values in every forward pass for fast generation of tokens. This results in lot of recompute. This "modelling_RW.py" I am uploading deals with this in the way pytorch huggingface transformers package generation/utils.py wants. All the changes are basically around including past_key_values everywhere. I think this will apply on all falcon models. These are the changes specifically. The same changes apply to pretty much all of the falcon family models with slow generation.

Class RotaryEmbedding forward method Include past_seq_length in forward pass and apply rotary embedding according to the position of the query token
_make_causal_mask function to give masking according to the way F.scaled dot product attention behaves. F.scaled_dot_product attention treats the attention_mask matrix as receiving attentions. For example if attention_mask is [[True, False], [True, True]]. It would mean the first token is "receiving" attentions from first token and not second token. This is unlike what we generally end up thinking which is first token is giving attention to itself and not to the second one. Due to reason the past_key_values attentions are all True in make_causal mask function. Also I have reversed the inequality above that due to the same reason.
Class Attention forward method a) past_key_value length is passed in rotary function b) concatenation of past key and current key is done after permuting the past key shape to match the current key shape c) to keep key_layer shape consistent with the output expectation which is (batch_size, head_dim, seq_length), another permutation done before creating "present" to return in the output d)add an if else depending on whether attention mask has been created or not, currently it just ignores
Class RWModel prepare_attn_mask method Have removed src_length > 1 criteria for making causal mask
RW causal LM prepare inputs for generation Read pastkey values from the input coming from huggingface generate method and dont call convert_to_rw_cache method

puru22
/

falcon-40b-instruct-fast

You need to agree to share your contact information to access this model