Seems Whisper is special…?
1) The root cause of your AttributeError
You turned on a training-script flag (freeze_feature_encoder=True) that assumes the model object implements a method named freeze_feature_encoder().
That assumption is true for several wav2vec2/Hubert/WavLM-style models (raw waveform in → conv “feature encoder” → transformer), and the HF example scripts call it directly (e.g., the CTC script literally does if model_args.freeze_feature_encoder: model.freeze_feature_encoder()). (GitHub)
But WhisperForConditionalGeneration does not provide freeze_feature_encoder(), so you get:
AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'freeze_feature_encoder'
This is not you doing something wrong; it’s a mismatch between:
- a generic example script option (
freeze_feature_encoder) (GitHub)
and
- what Whisper exposes as an API.
In the seq2seq example, there is also a separate flag/branch for freezing the encoder (not the “feature encoder”), and the script calls model.freeze_encoder() when that path is used. (GitHub)
So the “why” is simply: Whisper doesn’t have that method, because Whisper doesn’t have that kind of module (details next).
2) “Feature encoder” means different things in different speech model families
A) wav2vec2/Hubert/WavLM family (raw waveform models)
These models take raw audio samples and run them through a learned convolutional frontend often called:
- “feature extractor”, “feature encoder”, or “conv feature encoder”
Freezing that part is common enough that Transformers has a dedicated API, and even deprecates an older name in favor of freeze_feature_encoder. (GitHub)
So in that family, “freeze feature encoder” has a very concrete meaning:
Don’t update the learned conv frontend that turns waveform → frame-level features.
B) Whisper family (log-Mel spectrogram models)
Whisper does not learn the waveform → spectrogram transform.
Whisper consumes an 80-channel log-Mel spectrogram computed outside the transformer, then applies a small learned “stem” (two conv layers) and then the transformer encoder blocks.
So in Whisper, the closest equivalents are:
- Feature extraction (log-Mel): done by the processor/feature extractor code, deterministic, no trainable params → nothing to “freeze”.
- Conv stem (2 conv layers): trainable, but tiny compared to the transformer.
- Transformer encoder blocks: the big expensive part that learns acoustic/language-relevant representations.
- Decoder: the autoregressive text decoder that also acts like an audio-conditional language model. (Hugging Face)
That’s why Transformers’ Whisper class tends to talk about freezing the encoder, not a “feature encoder”.
3) Why Whisper doesn’t implement freeze_feature_encoder()
The practical reason
In Whisper, there isn’t a distinct “feature encoder module” in the same sense as wav2vec2-style models.
The paper describes:
- log-Mel computation,
- then two conv layers,
- then the transformer encoder.
The trainable part that’s analogous to a wav2vec2 “feature encoder” would only be those two conv layers—but freezing them:
- saves very little compute (most compute is in transformer blocks),
- freezes very few parameters (relative to Whisper Large),
- is not the standard “freeze feature encoder” knob people mean in ASR discussions.
So Transformers exposes freeze_encoder() (freeze the whole encoder stack), which is a meaningful lever for Whisper fine-tuning, and the seq2seq example script has a path that uses it. (GitHub)
The conceptual reason (your “general voice features vs word brain” framing)
That framing fits wav2vec2-style pipelines a bit better, but it’s still oversimplified—and for Whisper it’s especially misleading:
- The encoder is not just “generic voice features”; it’s heavily involved in mapping acoustic patterns into the representations the decoder uses to emit language-specific tokens.
- The decoder is not just “word recognition brain”; it also supplies strong language-modeling bias (Whisper’s robustness is partly attributed to a strong decoder LM).
So “freeze feature encoder to protect general voice recognition” is not a well-supported Whisper-specific recommendation.
4) What to do instead (and what “best practice” looks like)
Option 1 (most direct): don’t use freeze_feature_encoder for Whisper
Set it to false, because Whisper doesn’t implement it.
If you want to freeze something substantial, use the script’s freeze-encoder path (or do it manually).
Option 2: freeze the encoder (common Whisper knob)
Many people freeze the encoder by setting requires_grad=False on encoder params, e.g. in Whisper Large v3 discussions. (Hugging Face)
Practical patterns (robust across Transformers variants):
# works even if model.freeze_encoder() isn't present
encoder = model.get_encoder() if hasattr(model, "get_encoder") else model.model.encoder
for p in encoder.parameters():
p.requires_grad = False
If your script supports --freeze_encoder True, that’s usually cleaner than manual code. (GitHub)
Option 3: partial freezing (often better than “all encoder”)
A frequently effective compromise:
- freeze lower encoder layers (more acoustic/phonetic),
- fine-tune upper encoder layers + decoder (more task/language/domain adaptation).
Example:
encoder = model.get_encoder() if hasattr(model, "get_encoder") else model.model.encoder
N = 16 # example: freeze first N layers
for layer in encoder.layers[:N]:
for p in layer.parameters():
p.requires_grad = False
Option 4: LoRA / PEFT (efficient and commonly used for Whisper)
Instead of freezing huge blocks, keep base weights frozen and train small adapters. A practical starting point is PEFT notebooks for Whisper fine-tuning. (Google Colab)
5) About Gemini’s “freeze conv1/conv2” workaround
Freezing model.encoder.conv1/conv2 is freezing Whisper’s conv stem (the “2× Conv1D + GELU” part in the paper figure).
This is not the same as freezing a wav2vec2 “feature encoder”:
- it affects a tiny fraction of parameters,
- it won’t make training much more efficient,
- it’s not clearly connected to “language understanding improvements”.
It can be treated as a mild regularizer experiment, but it is not a standard Whisper best practice.
6) Evidence-based guidance for your Latvian case (40h Common Voice)
Two points matter most:
A) Latvian is multilingual ASR, not English-only
There’s a long-running practical observation from Whisper users: for multilingual ASR, training the encoder can be necessary for good results when adapting. (GitHub)
That directly argues against blanket-freezing the encoder for your scenario.
B) You already got a big WER/WIL improvement
That’s a sign your current setup is learning useful acoustic–token mapping and/or domain adaptation. Freezing too aggressively can:
- reduce further gains,
- or shift errors (e.g., better LM fluency but worse phonetic discrimination).
A good experimental ladder (minimal wasted time)
Keep everything else fixed and run 3 short ablations:
- Baseline (what worked): fine-tune normally.
- Decoder-heavy: freeze encoder (all), train decoder only.
- Hybrid: freeze only lower encoder layers (plus optionally conv stem), train upper encoder + decoder.
Pick the winner by WER on your true “unseen” test set, and if you care about robustness, add a second test set with different speakers/recording conditions.
7) Batch size 64 → 32 and “halve LR”: what’s actually true
There isn’t a universal rule that smaller batch “generalizes better”. What is common:
- If you halve effective batch size, you often need to retune LR (many people scale LR roughly with batch for stability).
- Smaller batches can add gradient noise that sometimes helps escape sharp minima, but it can also just make optimization noisier and worse.
Given you already have strong gains, treat this as a standard hyperparameter sweep, not a principle.
8) Similar cases & good references
Core architecture / grounding
- Whisper paper (explains log-Mel input, conv stem, encoder/decoder roles).
- Transformers Whisper docs (Whisper as encoder–decoder pretrained on 680k hours; decoder maps encoder reps → text). (Hugging Face)
Hugging Face “how to fine-tune Whisper” (widely used baseline recipe)
- HF blog: Fine-Tune Whisper For Multilingual ASR with
Transformers (walkthrough + code). (Hugging Face)
“Should I fine-tune only decoder?” (directly relevant to your freeze idea)
- OpenAI Whisper discussion: reports decoder-only worked less well for multilingual; encoder training was needed. (GitHub)
Concrete “how do I freeze Whisper layers?”
- Whisper Large v3 model discussion giving a working example of freezing encoder parameters. (Hugging Face)
Why the flag exists in scripts at all (the wav2vec2-style meaning)
- Transformers ASR scripts calling
model.freeze_feature_encoder() (CTC example). (GitHub)
- Note about
freeze_feature_encoder as the preferred API name (vs deprecated freeze_feature_extractor). (GitHub)
Practical industry-style guide (optional, but concrete)
- Diabolocom research article focusing on Whisper fine-tuning pitfalls/choices. (Diabolocom)
9) Quick “do this” summary
- For Whisper: set
freeze_feature_encoder=False (it’s a wav2vec2-style knob, not a Whisper knob). (GitHub)
- If you want freezing: prefer
freeze_encoder (whole encoder) or partial layer freezing. (GitHub)
- For Latvian: do not assume encoder freezing helps; multilingual adaptation often benefits from training at least part of the encoder. (GitHub)