Instructions to use rausch/en-t5-sci-continued-pretraining-487k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rausch/en-t5-sci-continued-pretraining-487k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="rausch/en-t5-sci-continued-pretraining-487k")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("rausch/en-t5-sci-continued-pretraining-487k") model = AutoModelForSeq2SeqLM.from_pretrained("rausch/en-t5-sci-continued-pretraining-487k") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use rausch/en-t5-sci-continued-pretraining-487k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rausch/en-t5-sci-continued-pretraining-487k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rausch/en-t5-sci-continued-pretraining-487k", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/rausch/en-t5-sci-continued-pretraining-487k
- SGLang
How to use rausch/en-t5-sci-continued-pretraining-487k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rausch/en-t5-sci-continued-pretraining-487k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rausch/en-t5-sci-continued-pretraining-487k", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rausch/en-t5-sci-continued-pretraining-487k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rausch/en-t5-sci-continued-pretraining-487k", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use rausch/en-t5-sci-continued-pretraining-487k with Docker Model Runner:
docker model run hf.co/rausch/en-t5-sci-continued-pretraining-487k
EN-T5-Sci
Continued-pretrained T5-base on a cleaned English scientific corpus derived from Unpaywall.
Checkpoint: pretraining_logs_lr_001_OPTIMIZED_clean_restart/.../step-487500-val_ppl-3.72168.ckpt (see conversion_info.json for provenance).
Model Details
- Architecture: T5-base (12L / 12L, d_model=768, 220M params)
- Objective: Span corruption (15 % noise, mean span length 3)
- Sequence prep: Sliding windows of 512 tokens with 50 % overlap
- Optimizer: Adafactor with linear warmup (20k) → inverse sqrt decay, lr=1e-3, grad clip=1.0
- Hardware: 4× NVIDIA H100 (mixed precision, gradient accumulation 2, effective batch 384)
Training Data
English scientific text (approx. 230 GB, ~11 M docs) cleaned with DataTrove and custom regex rules (see thesis Section “Automatic Data Preprocessing”). Tokenization via SentencePiece (original T5 vocab).
Evaluation (Global-MMLU, zero-shot, Global benchmark)
| Metric | EN | DE |
|---|---|---|
| Overall accuracy | 0.2687 | 0.2688 |
| Humanities | 0.2419 | 0.2414 |
| STEM | 0.2851 | 0.2858 |
| Social Sciences | 0.3107 | 0.3107 |
| Other | 0.2510 | 0.2514 |
Full plots + per-subtask CSV: evaluation_results/scientific_crosslingual_transfer_eval_full_15k/.
Intended Use
Zero-shot scientific QA, warm-start for downstream fine-tuning on English scientific NLP tasks. Use T5ForConditionalGeneration.from_pretrained("rausch/en-t5-sci-continued-pretraining-487k").
Limitations
- Same T5-base context length (512) and tokenization.
- Evaluated only on Global-MMLU EN/DE; other tasks may require finetuning.
- Training corpus is English-only; no guarantees about other languages.
Citation
Please cite the Bachelor’s thesis (link) and Raffel et al. (2020) for T5.
- Downloads last month
- 5