FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Paper
•
2407.08608
•
Published
•
1
This is an early checkpoint (step 2,000) from a small decoder-only GPT-style experiment. It is shared primarily for transparency and to help others reproduce or build upon the setup. This checkpoint is not production-ready.
train_run1.py (included here) with the exact launch command in RUN_COMMAND.txt.These were implemented in the training/data pipeline for future iterations (beyond this checkpoint):
References:
from transformers import AutoTokenizer, LlamaForCausalLM
import torch
m = 'ethanker/nanomind-step-002000'
tok = AutoTokenizer.from_pretrained(m, use_fast=True)
model = LlamaForCausalLM.from_pretrained(m, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)
model.eval().to('cuda' if torch.cuda.is_available() else 'cpu')
prompt = "Once upon a time,"
inputs = tok(prompt, return_tensors='pt').to(model.device)
out = model.generate(**inputs, do_sample=True, top_p=0.9, temperature=0.8, max_new_tokens=128)
print(tok.decode(out[0], skip_special_tokens=True))
model.safetensors, tokenizer/config filestrain_run1.py (training code snapshot)RUN_COMMAND.txt (exact command used)