For relatively newer versions of Transformers, you might be able to simply specify output_logits=True.
You are correct: out.scores are the post-processed scores. To get raw logits before temperature/top-k/top-p/etc., you should use output_logits=True in generate, or fall back to a manual loop / forward pass if your version does not support it.
I will keep it focused on exactly what you need.
1. What happens inside generate
At each decoding step, generate does roughly:
outputs = model(**model_inputs)
step_logits = outputs.logits[:, -1, :] # raw LM-head logits, [batch, vocab]
scores = step_logits
scores = logits_processor(input_ids, scores) # repetition penalty, min length, etc.
scores = logits_warper(input_ids, scores) # temperature, top_k, top_p, etc.
# sample / argmax from `scores`
Key point:
outputs.logits[:, -1, :] = raw logits from the model head.
scores after logits_processor / logits_warper = processed scores used for sampling.
The official docs call these ālogits processorsā / ālogits warpersā and describe them exactly as functions that modify the logits output of the model before sampling. (Hugging Face)
So your suspicion is correct: out.scores is not guaranteed to be the raw LM-head output. With sampling + top-p, top-k, temperature, etc. it is definitely modified.
2. The intended solution: output_logits=True
Recent transformers versions added a flag to return raw logits directly from generate.
Use:
out = model.generate(
**encoded,
do_sample=True,
temperature=temperature,
top_p=0.9,
return_dict_in_generate=True,
output_scores=True, # processed scores (after processors/warpers)
output_logits=True, # raw logits (before processors/warpers)
pad_token_id=tokenizer.pad_token_id,
max_new_tokens=5000,
)
raw_logits = out.logits # tuple of [batch_size, vocab_size] per step
scores = out.scores # tuple of [batch_size, vocab_size] per step (processed)
The PR that introduced this states explicitly:
output_logits behaves like output_scores, but returns the raw, unprocessed prediction logit scores, i.e. the values before they are processed by logits processors / warpers. (SemanticDiff)
And the generation utilities docs describe:
scores: processed prediction scores.
logits: unprocessed prediction scores (when output_logits=True). (Hugging Face)
So for your exact question:
āunprocessed logits (i.e. before top-k, top-p, temperature is applied) during decodingā
Answer: use output_logits=True and read out.logits instead of out.scores.
3. How the shapes look and how to use them
With return_dict_in_generate=True, output_logits=True, output_scores=True:
out.logits is a tuple of length num_generated_tokens.
Each element: Tensor[batch_size, vocab_size] = raw logits at that step.
out.scores is the same structure, but after processors/warpers.
Example: compute log-probs for the generated tokens from raw logits:
import torch
# stack over time: [batch, steps, vocab]
logits_tensor = torch.stack(out.logits, dim=1)
log_probs = torch.log_softmax(logits_tensor, dim=-1)
# extract generated token ids (excluding prompt)
sequences = out.sequences # [batch, prompt_len + steps]
gen_len = len(out.logits)
gen_tokens = sequences[:, -gen_len:] # [batch, steps]
# gather per-token log-probs
chosen_log_probs = log_probs.gather(
-1, gen_tokens.unsqueeze(-1)
).squeeze(-1) # [batch, steps]
This uses the raw logits, so no top-k/top-p/temperature have been applied yet.
4. If your version does not support output_logits
If you are on an older transformers version where output_logits is not available, there are two common patterns.
4.1. Easiest: temporarily disable all processing
If you turn off all decoding tricks and use greedy decoding, then scores and raw logits effectively coincide:
out = model.generate(
**encoded,
do_sample=False, # greedy
temperature=1.0,
top_k=0,
top_p=1.0,
return_dict_in_generate=True,
output_scores=True,
)
# In this very limited case, out.scores ā raw logits step by step.
This matches how HF docs and forum answers describe scores: for simple greedy decoding with no extra processors, scores equals ālogits for the next token, step by stepā. With more complex decoding, scores diverge from raw logits. (Hugging Face Forums)
You cannot use this trick if you actually need temperature / top-p at generation time.
4.2. Manual decoding loop
You can copy the core of GenerationMixin.sample into your own loop and grab outputs.logits yourself:
from transformers import LogitsProcessorList
def custom_generate_raw_logits(model, input_ids, generation_config):
logits_processor = model._get_logits_processor(generation_config=generation_config)
logits_warper = model._get_logits_warper(generation_config=generation_config)
all_raw = []
all_processed = []
generated = input_ids
with torch.no_grad():
for _ in range(generation_config.max_new_tokens):
outputs = model(generated) # forward
raw = outputs.logits[:, -1, :] # raw logits
all_raw.append(raw.clone())
scores = raw
scores = logits_processor(generated, scores)
scores = logits_warper(generated, scores)
all_processed.append(scores.clone())
probs = scores.softmax(dim=-1)
next_token = probs.multinomial(num_samples=1)
generated = torch.cat([generated, next_token], dim=-1)
return generated, all_raw, all_processed
This mirrors what generate does internally, but you have explicit access to both raw and processed scores at each step. The HF source and issues around logits processors confirm this flow. (Hugging Face)
5. Summary for your code
Your current code:
out = model.generate(
**encoded,
do_sample=True,
temperature=temperature,
top_p=0.9,
return_dict_in_generate=True,
output_scores=True,
pad_token_id=tokenizer.pad_token_id,
max_new_tokens=5000,
)
logits = out.scores # <- processed, not raw
What you want:
out = model.generate(
**encoded,
do_sample=True,
temperature=temperature,
top_p=0.9,
return_dict_in_generate=True,
output_scores=True, # keep if you still want processed scores
output_logits=True, # add this
pad_token_id=tokenizer.pad_token_id,
max_new_tokens=5000,
)
raw_logits = out.logits # before temperature / top-k / top-p / penalties
scores = out.scores # after all decoding transformations
If output_logits is not recognized in your version, upgrade transformers or implement your own decoding loop as in section 4.