Ensembl gene ID version used in Geneformer (ENSG)

#575

by jenny143 - opened Dec 26, 2025

Dec 26, 2025

Hello,geneformer team,
I have a question regarding the file ensembl_mapping_dict_gc30M.pkl.
Could you please clarify which Ensembl human gene annotation version was used to generate the ENSG identifiers in this mapping dictionary?
This information would be very helpful for ensuring consistent cross-species gene mapping and compatibility with the Geneformer tokenizer.
Thank you very much for your help.

ctheodoris

Owner Dec 26, 2025

Thank you for your question. Most public data is not annotated by version so when we integrate data we go by the Ensembl ID number or if it is provided as a gene name we convert it to the Ensembl ID based on the current version of public tools like MyGene. Generally though, the IDs should be stable for our purposes when we are working with already aligned counts. The 30m token dictionary used the current Ensembl IDs as of early 2021.

ctheodoris changed discussion status to closed Dec 26, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment