Ensembl gene ID version used in Geneformer (ENSG)
Hello,geneformer team,
I have a question regarding the file ensembl_mapping_dict_gc30M.pkl.
Could you please clarify which Ensembl human gene annotation version was used to generate the ENSG identifiers in this mapping dictionary?
This information would be very helpful for ensuring consistent cross-species gene mapping and compatibility with the Geneformer tokenizer.
Thank you very much for your help.
Thank you for your question. Most public data is not annotated by version so when we integrate data we go by the Ensembl ID number or if it is provided as a gene name we convert it to the Ensembl ID based on the current version of public tools like MyGene. Generally though, the IDs should be stable for our purposes when we are working with already aligned counts. The 30m token dictionary used the current Ensembl IDs as of early 2021.