Getting an error while using extract_embs to extract geneformer embeddings

#488
by GalenP - opened

While running this code :
embs = embex.extract_embs(
model_directory=model_dir,
input_data_file=token_dir + "m_f_20.dataset",
output_directory=output_dir,
output_prefix="emb",
)
Getting an error:

RuntimeError: The expanded size of the tensor (4096) must match the existing size (2048) at non-singleton dimension 1. Target sizes: [30, 4096]. Tensor sizes: [1, 2048] (I guess the error is cause the data is of 4096 tokens, but the model is expecting 2048)

I'm using CELLxGENE fine-tuned model with max_position_embeddings": 2048,
but when tokenizing on the small token file :

it's giving me a token not found error

How to run embex.extract_embs on the 30m token file.
Besides the init.py in geneformer, do I need to change the token file anywhere and the special toke=False anywhere to run embex.extract_embs

Thanks for your question. It is critical to use the correct token dictionary for the correct model. Please ensure you are using the 30M token dictionary if you are using a model that is fine-tuned from the base model of Geneformer trained on Genecorpus-30M. Also, ensure you set special_token to False for tokenizing new datasets for the 30M model. Furthermore, the 30M model does not have a CLS token so please set the embeddings to "cell" where relevant. Please see the notes about the dictionary in the example and in the documentation.

ctheodoris changed discussion status to closed

Thanks so much for your response and support its verry much appreciated!
so when inferencing with the 30m model and the 30m token file. The predictions are coming fine, but when running :
embex = EmbExtractor(
model_type="CellClassifier",
num_classes=n_classes,
max_ncells=None,
emb_label=["joinid"],
emb_layer=0,
forward_batch_size=30,
nproc=8,
token_dictionary_file="/hpcfs/users/a1841503/Geneformer/geneformer/gene_dictionaries_30m/token_dictionary_gc30M.pkl")
embs = embex.extract_embs(
model_directory=model_dir,
input_data_file=token_dir + "m_f_20.dataset",
output_directory=output_dir,
output_prefix="emb",
)

I am still getting an error 610 model = pu.load_model(
611 self.model_type, self.num_classes, model_directory, mode="eval"
612 )
613 layer_to_quant = pu.quant_layers(model) + self.emb_layer
--> 614 embs = get_embs(
615 model=model,
616 filtered_input_data=downsampled_data,
617 emb_mode=self.emb_mode,
618 layer_to_quant=layer_to_quant,
619 pad_token_id=self.pad_token_id,
620 forward_batch_size=self.forward_batch_size,
621 token_gene_dict=self.token_gene_dict,
622 summary_stat=self.summary_stat,
623 )
...
---> 74 assert cls_present, " token missing in token dictionary"
75 # Check to make sure that the first token of the filtered input data is cls token
76 gene_token_dict = {v: k for k, v in token_gene_dict.items()}

AssertionError: token missing in token dictionary

any suggestions on how to solve this error

The 30M model doesn’t have a CLS token so change the embedding type to “cell”.

Thanks so much for that

thanks solved the issue

Sign up or log in to comment