Instructions to use nferruz/ProtGPT2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nferruz/ProtGPT2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nferruz/ProtGPT2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nferruz/ProtGPT2") model = AutoModelForCausalLM.from_pretrained("nferruz/ProtGPT2") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nferruz/ProtGPT2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nferruz/ProtGPT2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nferruz/ProtGPT2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/nferruz/ProtGPT2
- SGLang
How to use nferruz/ProtGPT2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nferruz/ProtGPT2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nferruz/ProtGPT2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nferruz/ProtGPT2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nferruz/ProtGPT2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use nferruz/ProtGPT2 with Docker Model Runner:
docker model run hf.co/nferruz/ProtGPT2
Example of `training.txt` and `validation.txt` for fine tuning ProtGPT2
@nferruz Hi Noelia,
Thank you for your great work. I sincerely believe it will make a great contribution to protein biology.
I would like to try fine-tuning your ProtGPT2.
In particular, this line of code follows your model card.
python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2
--do_train --do_eval --output_dir output --learning_rate 1e-06
Can you give an example of what the content of training.txt and validation.txt looks like?
Let's say I have this fasta file that I want to turn into training.txt.
>myseq1
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
>myseq2
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG
Should I format it this way:
<|endoftext|>
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG
Or this way?
<|endoftext|>
MTAQIVRTLGWRLIRRTRRQQAGEQPHHPPAPSAPAVPSTPAKQAPTPESGMPSKRALRE
<|endoftext|>
ARERAAAAAGPASSAGPTASGTRPEETASRATNSARDAAGESAARSATGPRDRASPGPTG
Thanks and hope to hear from you again.
Sincerely,
Littleworth.
Hi Littleworth,
It should be like in your second example. Please bear in mind that it must be completely like a fast format, with a newline character every 60 aminoacids:
Like this:
<|endoftext|>
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGEL
FVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHYGCGGVQAAVENPELGLINNWL
LHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTIMQSAWKRGQKVTIHGWA
YGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK
<|endoftext|>
#ANOTHER SEQUENCE
But not like this:
<|endoftext|>
MKDIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGELFVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHYGCGGVQAAVENPELGLINNWLLHIRDIWFKHSSLLGEMPQERRLDTLCELNVMEQVYNLGHSTIMQSAWKRGQKVTIHGWAYGIHDGLLRDLDVTATNRETLEQRYRHGISNLKLKHANHK
<|endoftext|>
#ANOTHER SEQUENCE
Thank you for using ProtGPT2 and posting!
Noelia
@nferruz Thank you so much!
Hello Dear @nferruz ,
Even if my train_file is in the format of following, I got train_samples = 1 in the output. It should be 2 instead of 1 for the following example.
<|endoftext|>
ETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAG
QEQLGRRIHYSQNDLVEYSPVTEKHLTDG
<|endoftext|>
ETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRIDAG
QEQLGRRIHYSQNDLVEYSPVTEKHLTAG
Why does the model not get the input file correctly? Do you have any idea how to resolve it?
Hello!
I believe you mean during training? In that case, the number of samples is the number of groups of 512 tokens that are passed in batches to the model. With those two sequences you’re below 512 tokens, hence you don’t arrive to more than one sample.
Hope this helps
Noelia