Confusion with freezing Whisper's feature encoder

progmars · February 10, 2026, 8:41am

I am finetuning Whisper Large v3 Turbo using a script based on transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py at v4.57.6 · huggingface/transformers · GitHub

In general, it seems to be working fine. Using about 40h from Mozilla Common Voice dataset and running jiwer on unseen test data, I could reduce WER and WIL almost to half for Latvian language, when compared to the default non-finetuned non-Turbo Large model. (Haven’t measured timestamp accuracy - not needed for my specific case).

However, when discussing my training results with Google Gemini, it recommended to freeze feature encoder to make training more efficient and not affect the part of the model that “recognizes general human voice features“ and instead train only “the brain of word recognition“. Additionally, it suggested to reduce batch size from 64 to 32 (and half the learning rate) to teach the model “to generalize better“.

I’m usually quite skeptical about everything that AI says, but I got curious and wanted to try it out.
I see that by default the flag is assumed to be True in the training script code:

freeze_feature_encoder: bool = field(
        default=True,

However, all examples I’ve seen are setting it to false in the cmd or sh scripts. I tried setting it to True. And surprise - got AttributeError: ‘WhisperForConditionalGeneration’ object has no attribute ‘freeze_feature_encoder’ on the line:

if model_args.freeze_feature_encoder:
    model.freeze_feature_encoder()

Gemini was quite insistent that I should try it and, as a workaround, suggested the following code:

model.encoder.conv1.weight.requires_grad = False
if model.encoder.conv1.bias is not None:
    model.encoder.conv1.bias.requires_grad = False

model.encoder.conv2.weight.requires_grad = False
if model.encoder.conv2.bias is not None:
    model.encoder.conv2.bias.requires_grad = False

Still, it could not offer any credible sources of information to confirm that it makes sense to freeze the feature encoder when attempting to improve model’s language understanding. When I nagged it with questions, it seems to have latched onto the fact that the setting is default=True,and that it is generally recommended to freeze feature encoder in other types of models besides Whisper, where freeze_feature_encoder function is available.

So, what’s going on there? Why doesn’t WhisperForConditionalGeneration implement freeze_feature_encoder ? Did someone forgot it or is it actually useless idea to apply this for Whisper?

John6666 · February 11, 2026, 5:00am

Seems Whisper is special…?

1) The root cause of your `AttributeError`

You turned on a training-script flag (freeze_feature_encoder=True) that assumes the model object implements a method named freeze_feature_encoder().

That assumption is true for several wav2vec2/Hubert/WavLM-style models (raw waveform in → conv “feature encoder” → transformer), and the HF example scripts call it directly (e.g., the CTC script literally does if model_args.freeze_feature_encoder: model.freeze_feature_encoder()). (GitHub)

But WhisperForConditionalGeneration does not provide freeze_feature_encoder(), so you get:

AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'freeze_feature_encoder'

This is not you doing something wrong; it’s a mismatch between:

a generic example script option (freeze_feature_encoder) (GitHub)
and
what Whisper exposes as an API.

In the seq2seq example, there is also a separate flag/branch for freezing the encoder (not the “feature encoder”), and the script calls model.freeze_encoder() when that path is used. (GitHub)

So the “why” is simply: Whisper doesn’t have that method, because Whisper doesn’t have that kind of module (details next).

2) “Feature encoder” means different things in different speech model families

A) wav2vec2/Hubert/WavLM family (raw waveform models)

These models take raw audio samples and run them through a learned convolutional frontend often called:

“feature extractor”, “feature encoder”, or “conv feature encoder”

Freezing that part is common enough that Transformers has a dedicated API, and even deprecates an older name in favor of freeze_feature_encoder. (GitHub)

So in that family, “freeze feature encoder” has a very concrete meaning:

Don’t update the learned conv frontend that turns waveform → frame-level features.

B) Whisper family (log-Mel spectrogram models)

Whisper does not learn the waveform → spectrogram transform.
Whisper consumes an 80-channel log-Mel spectrogram computed outside the transformer, then applies a small learned “stem” (two conv layers) and then the transformer encoder blocks.

So in Whisper, the closest equivalents are:

Feature extraction (log-Mel): done by the processor/feature extractor code, deterministic, no trainable params → nothing to “freeze”.
Conv stem (2 conv layers): trainable, but tiny compared to the transformer.
Transformer encoder blocks: the big expensive part that learns acoustic/language-relevant representations.
Decoder: the autoregressive text decoder that also acts like an audio-conditional language model. (Hugging Face)

That’s why Transformers’ Whisper class tends to talk about freezing the encoder, not a “feature encoder”.

3) Why Whisper doesn’t implement `freeze_feature_encoder()`

The practical reason

In Whisper, there isn’t a distinct “feature encoder module” in the same sense as wav2vec2-style models.

The paper describes:

log-Mel computation,
then two conv layers,
then the transformer encoder.

The trainable part that’s analogous to a wav2vec2 “feature encoder” would only be those two conv layers—but freezing them:

saves very little compute (most compute is in transformer blocks),
freezes very few parameters (relative to Whisper Large),
is not the standard “freeze feature encoder” knob people mean in ASR discussions.

So Transformers exposes freeze_encoder() (freeze the whole encoder stack), which is a meaningful lever for Whisper fine-tuning, and the seq2seq example script has a path that uses it. (GitHub)

The conceptual reason (your “general voice features vs word brain” framing)

That framing fits wav2vec2-style pipelines a bit better, but it’s still oversimplified—and for Whisper it’s especially misleading:

The encoder is not just “generic voice features”; it’s heavily involved in mapping acoustic patterns into the representations the decoder uses to emit language-specific tokens.
The decoder is not just “word recognition brain”; it also supplies strong language-modeling bias (Whisper’s robustness is partly attributed to a strong decoder LM).

So “freeze feature encoder to protect general voice recognition” is not a well-supported Whisper-specific recommendation.

4) What to do instead (and what “best practice” looks like)

Option 1 (most direct): don’t use `freeze_feature_encoder` for Whisper

Set it to false, because Whisper doesn’t implement it.

If you want to freeze something substantial, use the script’s freeze-encoder path (or do it manually).

Option 2: freeze the encoder (common Whisper knob)

Many people freeze the encoder by setting requires_grad=False on encoder params, e.g. in Whisper Large v3 discussions. (Hugging Face)

Practical patterns (robust across Transformers variants):

# works even if model.freeze_encoder() isn't present
encoder = model.get_encoder() if hasattr(model, "get_encoder") else model.model.encoder
for p in encoder.parameters():
    p.requires_grad = False

If your script supports --freeze_encoder True, that’s usually cleaner than manual code. (GitHub)

Option 3: partial freezing (often better than “all encoder”)

A frequently effective compromise:

freeze lower encoder layers (more acoustic/phonetic),
fine-tune upper encoder layers + decoder (more task/language/domain adaptation).

Example:

encoder = model.get_encoder() if hasattr(model, "get_encoder") else model.model.encoder
N = 16  # example: freeze first N layers
for layer in encoder.layers[:N]:
    for p in layer.parameters():
        p.requires_grad = False

Option 4: LoRA / PEFT (efficient and commonly used for Whisper)

Instead of freezing huge blocks, keep base weights frozen and train small adapters. A practical starting point is PEFT notebooks for Whisper fine-tuning. (Google Colab)

5) About Gemini’s “freeze conv1/conv2” workaround

Freezing model.encoder.conv1/conv2 is freezing Whisper’s conv stem (the “2× Conv1D + GELU” part in the paper figure).

This is not the same as freezing a wav2vec2 “feature encoder”:

it affects a tiny fraction of parameters,
it won’t make training much more efficient,
it’s not clearly connected to “language understanding improvements”.

It can be treated as a mild regularizer experiment, but it is not a standard Whisper best practice.

6) Evidence-based guidance for your Latvian case (40h Common Voice)

Two points matter most:

A) Latvian is multilingual ASR, not English-only

There’s a long-running practical observation from Whisper users: for multilingual ASR, training the encoder can be necessary for good results when adapting. (GitHub)

That directly argues against blanket-freezing the encoder for your scenario.

B) You already got a big WER/WIL improvement

That’s a sign your current setup is learning useful acoustic–token mapping and/or domain adaptation. Freezing too aggressively can:

reduce further gains,
or shift errors (e.g., better LM fluency but worse phonetic discrimination).

A good experimental ladder (minimal wasted time)

Keep everything else fixed and run 3 short ablations:

Baseline (what worked): fine-tune normally.
Decoder-heavy: freeze encoder (all), train decoder only.
Hybrid: freeze only lower encoder layers (plus optionally conv stem), train upper encoder + decoder.

Pick the winner by WER on your true “unseen” test set, and if you care about robustness, add a second test set with different speakers/recording conditions.

7) Batch size 64 → 32 and “halve LR”: what’s actually true

There isn’t a universal rule that smaller batch “generalizes better”. What is common:

If you halve effective batch size, you often need to retune LR (many people scale LR roughly with batch for stability).
Smaller batches can add gradient noise that sometimes helps escape sharp minima, but it can also just make optimization noisier and worse.

Given you already have strong gains, treat this as a standard hyperparameter sweep, not a principle.

8) Similar cases & good references

Core architecture / grounding

Whisper paper (explains log-Mel input, conv stem, encoder/decoder roles).
Transformers Whisper docs (Whisper as encoder–decoder pretrained on 680k hours; decoder maps encoder reps → text). (Hugging Face)

Hugging Face “how to fine-tune Whisper” (widely used baseline recipe)

HF blog: Fine-Tune Whisper For Multilingual ASR with Transformers (walkthrough + code). (Hugging Face)

“Should I fine-tune only decoder?” (directly relevant to your freeze idea)

OpenAI Whisper discussion: reports decoder-only worked less well for multilingual; encoder training was needed. (GitHub)

Concrete “how do I freeze Whisper layers?”

Whisper Large v3 model discussion giving a working example of freezing encoder parameters. (Hugging Face)

Why the flag exists in scripts at all (the wav2vec2-style meaning)

Transformers ASR scripts calling model.freeze_feature_encoder() (CTC example). (GitHub)
Note about freeze_feature_encoder as the preferred API name (vs deprecated freeze_feature_extractor). (GitHub)

Practical industry-style guide (optional, but concrete)

Diabolocom research article focusing on Whisper fine-tuning pitfalls/choices. (Diabolocom)

9) Quick “do this” summary

For Whisper: set freeze_feature_encoder=False (it’s a wav2vec2-style knob, not a Whisper knob). (GitHub)
If you want freezing: prefer freeze_encoder (whole encoder) or partial layer freezing. (GitHub)
For Latvian: do not assume encoder freezing helps; multilingual adaptation often benefits from training at least part of the encoder. (GitHub)

progmars · February 11, 2026, 8:07am

Thank you. Now it makes sense. Looks like run_speech_recognition_seq2seq.py was meant as a more abstract training script, not specifically for Whisper, so they kept that toggle to support other models.

By the way, which AI are you using? Looks like Gemini 3 Pro is too lazy and does not actually investigate the most credible documentation of the model, so quite often its “grounded search“ links lead to very remotely related things and not to what Gemini claims.

system · February 11, 2026, 8:07pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
About finetuning whisper 🤗Transformers	0	228	May 5, 2023
Whisper encoder Beginners	0	163	December 20, 2023
Help for using whisper with embeddings Models	1	447	November 22, 2023
Adding PLP with MFCC for feature extraction on Whisper Beginners	1	299	October 28, 2023
Fine-tuning whisper on sound-event-detection dataset 🤗Transformers	0	167	December 20, 2024