Obtain raw logits before decoding scaling is applied

laefro · December 5, 2025, 7:09am

Hi, I am trying to access the unprocessed logits (i.e. before top-k, top-p, temperature is applied) during decoding. So far, I’ve used the generate() method:

out = model.generate(
                    **encoded,
                    do_sample=True,
                    temperature=temperature,
                    top_p=0.9,
                    return_dict_in_generate=True,
                    output_scores=True,  
                    pad_token_id=self.tokenizer.pad_token_id,
                    max_new_tokens=5000,
                )

logits = out.scores

However, from what I have gathered out.scores only gives me the logits after all the decoding transformations have been done. Does anyone know how to achieve this?

Thanks alot in advance!

John6666 · December 5, 2025, 7:59am

For relatively newer versions of Transformers, you might be able to simply specify output_logits=True.

You are correct: out.scores are the post-processed scores. To get raw logits before temperature/top-k/top-p/etc., you should use output_logits=True in generate, or fall back to a manual loop / forward pass if your version does not support it.

I will keep it focused on exactly what you need.

1. What happens inside `generate`

At each decoding step, generate does roughly:

outputs = model(**model_inputs)
step_logits = outputs.logits[:, -1, :]      # raw LM-head logits, [batch, vocab]

scores = step_logits
scores = logits_processor(input_ids, scores)  # repetition penalty, min length, etc.
scores = logits_warper(input_ids, scores)     # temperature, top_k, top_p, etc.

# sample / argmax from `scores`

Key point:

outputs.logits[:, -1, :] = raw logits from the model head.
scores after logits_processor / logits_warper = processed scores used for sampling.

The official docs call these “logits processors” / “logits warpers” and describe them exactly as functions that modify the logits output of the model before sampling. (Hugging Face)

So your suspicion is correct: out.scores is not guaranteed to be the raw LM-head output. With sampling + top-p, top-k, temperature, etc. it is definitely modified.

2. The intended solution: `output_logits=True`

Recent transformers versions added a flag to return raw logits directly from generate.

Use:

out = model.generate(
    **encoded,
    do_sample=True,
    temperature=temperature,
    top_p=0.9,
    return_dict_in_generate=True,
    output_scores=True,   # processed scores (after processors/warpers)
    output_logits=True,   # raw logits (before processors/warpers)
    pad_token_id=tokenizer.pad_token_id,
    max_new_tokens=5000,
)

raw_logits = out.logits     # tuple of [batch_size, vocab_size] per step
scores     = out.scores     # tuple of [batch_size, vocab_size] per step (processed)

The PR that introduced this states explicitly:

output_logits behaves like output_scores, but returns the raw, unprocessed prediction logit scores, i.e. the values before they are processed by logits processors / warpers. (SemanticDiff)

And the generation utilities docs describe:

scores: processed prediction scores.
logits: unprocessed prediction scores (when output_logits=True). (Hugging Face)

So for your exact question:

“unprocessed logits (i.e. before top-k, top-p, temperature is applied) during decoding”

Answer: use output_logits=True and read out.logits instead of out.scores.

3. How the shapes look and how to use them

With return_dict_in_generate=True, output_logits=True, output_scores=True:

out.logits is a tuple of length num_generated_tokens.
Each element: Tensor[batch_size, vocab_size] = raw logits at that step.
out.scores is the same structure, but after processors/warpers.

Example: compute log-probs for the generated tokens from raw logits:

import torch

# stack over time: [batch, steps, vocab]
logits_tensor = torch.stack(out.logits, dim=1)
log_probs = torch.log_softmax(logits_tensor, dim=-1)

# extract generated token ids (excluding prompt)
sequences   = out.sequences                       # [batch, prompt_len + steps]
gen_len     = len(out.logits)
gen_tokens  = sequences[:, -gen_len:]             # [batch, steps]

# gather per-token log-probs
chosen_log_probs = log_probs.gather(
    -1, gen_tokens.unsqueeze(-1)
).squeeze(-1)                                     # [batch, steps]

This uses the raw logits, so no top-k/top-p/temperature have been applied yet.

4. If your version does not support `output_logits`

If you are on an older transformers version where output_logits is not available, there are two common patterns.

4.1. Easiest: temporarily disable all processing

If you turn off all decoding tricks and use greedy decoding, then scores and raw logits effectively coincide:

out = model.generate(
    **encoded,
    do_sample=False,           # greedy
    temperature=1.0,
    top_k=0,
    top_p=1.0,
    return_dict_in_generate=True,
    output_scores=True,
)

# In this very limited case, out.scores ≈ raw logits step by step.

This matches how HF docs and forum answers describe scores: for simple greedy decoding with no extra processors, scores equals “logits for the next token, step by step”. With more complex decoding, scores diverge from raw logits. (Hugging Face Forums)

You cannot use this trick if you actually need temperature / top-p at generation time.

4.2. Manual decoding loop

You can copy the core of GenerationMixin.sample into your own loop and grab outputs.logits yourself:

from transformers import LogitsProcessorList

def custom_generate_raw_logits(model, input_ids, generation_config):
    logits_processor = model._get_logits_processor(generation_config=generation_config)
    logits_warper    = model._get_logits_warper(generation_config=generation_config)

    all_raw = []
    all_processed = []
    generated = input_ids

    with torch.no_grad():
        for _ in range(generation_config.max_new_tokens):
            outputs = model(generated)              # forward
            raw = outputs.logits[:, -1, :]         # raw logits
            all_raw.append(raw.clone())

            scores = raw
            scores = logits_processor(generated, scores)
            scores = logits_warper(generated, scores)
            all_processed.append(scores.clone())

            probs = scores.softmax(dim=-1)
            next_token = probs.multinomial(num_samples=1)
            generated = torch.cat([generated, next_token], dim=-1)

    return generated, all_raw, all_processed

This mirrors what generate does internally, but you have explicit access to both raw and processed scores at each step. The HF source and issues around logits processors confirm this flow. (Hugging Face)

5. Summary for your code

Your current code:

out = model.generate(
    **encoded,
    do_sample=True,
    temperature=temperature,
    top_p=0.9,
    return_dict_in_generate=True,
    output_scores=True,  
    pad_token_id=tokenizer.pad_token_id,
    max_new_tokens=5000,
)

logits = out.scores  # <- processed, not raw

What you want:

out = model.generate(
    **encoded,
    do_sample=True,
    temperature=temperature,
    top_p=0.9,
    return_dict_in_generate=True,
    output_scores=True,   # keep if you still want processed scores
    output_logits=True,   # add this
    pad_token_id=tokenizer.pad_token_id,
    max_new_tokens=5000,
)

raw_logits = out.logits   # before temperature / top-k / top-p / penalties
scores     = out.scores   # after all decoding transformations

If output_logits is not recognized in your version, upgrade transformers or implement your own decoding loop as in section 4.

Topic		Replies	Views
Can I get logits for each sequence I acqired from model.generate()? Beginners	1	1369	November 27, 2020
Decode whisper logits to transcript using forward instead of generate method 🤗Transformers	3	1877	December 20, 2022
[Announcement] GenerationOutputs: Scores, Attentions and Hidden States now available as outputs to generate 🤗Transformers	1	4645	January 13, 2021
How can I obtain the logits via model.generate()? 🤗Transformers	2	3751	October 8, 2024
-inf values for logit score outputs with model.generate 🤗Transformers	3	896	January 2, 2025

Obtain raw logits before decoding scaling is applied

1. What happens inside generate

2. The intended solution: output_logits=True