I ran a few experiments on Colab Free GPU. While there are workarounds, it also seems possible that PEFT isn’t currently behaving as expected:
Short answer
Your understanding is close, but the important correction is:
PEFT is not simply “expecting vision/audio to be on GPU.”
The real problem is that PEFT adapter loading triggers a second Accelerate dispatch/hook pass over a bitsandbytes 4-bit, partially CPU-offloaded Gemma 4 model, and that path is fragile.
You are hitting two different failure modes:
-
Without device_map in PEFT
PEFT calls Accelerate dispatch hooks; Accelerate asks a bitsandbytes Linear4bit module for state_dict(); bitsandbytes tries to serialize nested/double-quant state; that nested quant state contains a meta tensor; .item() on a meta tensor fails.
-
With device_map in PEFT
PEFT does a second dispatch using a device map that was meant for the base model, not the PEFT-wrapped model. The model loads farther, but Gemma 4 generation breaks because Gemma 4’s shared KV cache bookkeeping loses an expected source-layer entry, causing KeyError: 22.
So the answer is:
You can use PEFT with some offload-related options, but passing the same device_map into PeftModel.from_pretrained() is not the right fix for Gemma 4. It changes the dispatch layout and can break Gemma 4’s shared-KV generation path.
What is happening in the first error
Your original PEFT load is:
model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=lora_source_client_name,
is_trainable=False,
)
The failure happens after PEFT starts loading the adapter:
PeftModel.from_pretrained
→ load_adapter
→ accelerate.dispatch_model
→ attach_align_device_hook_on_blocks
→ attach_execution_device_hook
→ module.state_dict()
→ bitsandbytes Linear4bit._save_to_state_dict
→ self.weight.quant_state.as_dict(packed=True)
→ "nested_offset": self.offset.item()
→ Tensor.item() cannot be called on meta tensors
That traceback is very specific. The failing tensor is not a LoRA adapter tensor. It is inside bitsandbytes’ 4-bit quantization state.
The key line is:
"nested_offset": self.offset.item()
That is tied to nested/double quantization. You enabled:
bnb_4bit_use_double_quant=True
In bitsandbytes, nested quantization stores extra quantization-state fields, and QuantState.as_dict(packed=True) serializes those fields. The bitsandbytes source contains the relevant QuantState packing logic, including nested quant-state serialization. (GitHub)
PyTorch’s meta device is not a real data-holding device. Meta tensors carry shape/dtype metadata but no values, so data-dependent operations like .item() are invalid. That is why the exception says Tensor.item() cannot be called on meta tensors. (PyTorch Docs)
So the first error means:
A bitsandbytes nested quantization scalar is still on meta
when Accelerate/PEFT asks bitsandbytes to serialize state_dict.
That is a library interaction issue:
PEFT adapter loading
+ Accelerate dispatch hooks
+ bitsandbytes 4-bit Linear4bit
+ double quant / nested quant_state
+ CPU/GPU offload / meta placeholders
It is not simply:
vision/audio are on CPU, PEFT wants them on GPU
Why device_map={"": 0} works
When you use:
device_map = {"": 0}
you avoid the fragile cross-device path.
Everything is on GPU 0:
no CPU-offloaded towers
no CPU/disk hooks for those modules
less meta placeholder machinery
less redispatch complexity during PEFT load
That does not prove the model or adapter are wrong. It proves that the all-GPU path avoids the failure surface.
This distinction matters:
| Setup |
What happens |
device_map={"": 0} |
No split dispatch; PEFT works. |
| custom CPU/GPU map |
Accelerate offload hooks are involved; PEFT triggers redispatch; bitsandbytes quant-state/meta problems appear. |
Why passing device_map to PEFT causes KeyError: 22
You tried:
model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=lora_source_client_name,
device_map=device_map,
is_trainable=False
)
That gets past the first loading problem, but generation fails later:
key_states, value_states = shared_kv_states[self.kv_shared_layer_index]
KeyError: 22
This is a different failure.
Gemma 4 has a shared KV cache architecture. In Gemma 4, the last num_kv_shared_layers decoder layers do not compute their own key/value projections; they reuse K/V tensors from an earlier non-shared layer of the same attention type. Hugging Face’s Gemma 4 blog describes this shared-KV-cache optimization explicitly. (Hugging Face)
So inside generation, Gemma 4 needs something like this:
shared_kv_states[source_layer_index] = (key_states, value_states)
Then later:
key_states, value_states = shared_kv_states[self.kv_shared_layer_index]
Your error says the later layer expected:
shared_kv_states[22]
but that key did not exist.
That means the layer that should have populated shared_kv_states[22] either:
- did not run in the expected way;
- ran under a hook/layout that did not capture/store the expected K/V state;
- had its execution order/state propagation changed by the second dispatch;
- or had Gemma 4’s shared-state bookkeeping disrupted by PEFT/Accelerate wrapper hooks.
The important point:
Passing device_map into PEFT changes the PEFT-wrapped model’s dispatch/hook structure. For Gemma 4, that can break the shared-KV path during generation.
This is why device_map in PEFT is not a good fix even if it avoids the original loading error.
Why PEFT device_map is not equivalent to base-model device_map
Your base model is loaded like this:
base_model = Gemma4ForConditionalGeneration.from_pretrained(
...,
device_map=device_map,
offload_folder=...,
)
That is the correct place to put the base model’s device map.
After PEFT wrapping, module names and structure are different. PEFT wraps the model, often under paths like:
base_model.model...
base_model.model.model...
So the same raw map:
{
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0
}
may not mean the same thing after wrapping.
Accelerate’s device_map is module-name based and recursively applies placement to submodules. Its docs describe dispatch_model as spreading modules across GPU, CPU, or disk according to a device map. (GitHub)
So there are two different operations:
Base from_pretrained(..., device_map=...)
initial placement of the base model
PeftModel.from_pretrained(..., device_map=...)
redispatch of the PEFT-wrapped model
The second one is not just “offload PEFT too.” It can reattach hooks and alter runtime behavior.
What PEFT offload options actually do
PEFT does have offload-related controls. The PEFT docs/source mention ephemeral_gpu_offload, which can be used when loading adapters with partially offloaded modules. (GitHub)
A safer PEFT call shape is:
model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=lora_source_client_name,
is_trainable=False,
offload_dir=r"E:\Folder\offload_temp",
offload_buffers=True,
ephemeral_gpu_offload=True,
torch_device="cuda:0",
)
Notice what is not there:
device_map=device_map
This is the distinction:
| Option |
Meaning |
device_map on base from_pretrained |
Places the base model modules across CPU/GPU/disk. |
offload_folder on base from_pretrained |
Folder for base-model offload during Transformers loading. |
offload_dir on PEFT load |
Folder used by Accelerate/PEFT redispatch if offloaded modules are involved. |
offload_buffers=True |
Also offload buffers when hooks need it. |
ephemeral_gpu_offload=True |
Temporarily move offloaded pieces to GPU when needed. |
device_map on PEFT |
Re-dispatches the PEFT-wrapped model; risky here. |
There are known PEFT issue patterns around offload_dir not being propagated/found during PeftModel.from_pretrained() on offloaded base models. (GitHub)
So yes, you can try to make PEFT loading offload-aware, but I would do it with offload_dir, offload_buffers, and ephemeral_gpu_offload, not by passing the original base device_map again.
The llm_int8_enable_fp32_cpu_offload=True confusion
This setting is confusing:
llm_int8_enable_fp32_cpu_offload=True
The name says int8, but in current Transformers/bitsandbytes integration, CPU/disk dispatch with quantized models often uses this flag as the gate that allows some modules to remain in full precision on CPU when a custom device_map contains CPU/disk entries. The Transformers bitsandbytes docs discuss CPU/GPU offload under the bitsandbytes quantization workflow, and public error messages for this path instruct users to enable llm_int8_enable_fp32_cpu_offload=True when modules are dispatched to CPU/disk. (Hugging Face)
For your use case:
4-bit + custom CPU/GPU device_map
I would keep it enabled unless you move back to all-GPU.
So:
All-GPU path
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
CPU/GPU split path
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_enable_fp32_cpu_offload=True,
)
The bnb_4bit_use_double_quant=True part
Your original Tensor.item() error points very strongly at this setting:
bnb_4bit_use_double_quant=True
Double quantization is normally useful because it saves additional memory. But it also creates nested quantization state. Your traceback fails specifically while bitsandbytes serializes nested quant-state metadata:
"nested_offset": self.offset.item()
So as a diagnostic, test this:
bnb_4bit_use_double_quant=False
If the first error disappears with double quant off, then the issue is specifically:
bitsandbytes nested quant_state
+ PEFT/Accelerate redispatch
+ meta/offload
If it still fails, then the broader issue is:
bitsandbytes Params4bit / Linear4bit
+ PEFT/Accelerate redispatch
+ offloaded base model
Either way, the failure is still in the quantized/offloaded dispatch stack, not simply in PEFT’s preference for GPU placement.
The model.multi_modal_projector key may not be valid
Gemma 4 is multimodal. The official Transformers docs describe the base Gemma 4 model as comprising a vision backbone, an audio backbone, and a language model; the conditional-generation model includes the language modeling head. (Hugging Face)
But exact implementation module names matter.
You used:
"model.multi_modal_projector": "cpu"
I would not assume this key exists for every Gemma 4 checkpoint/implementation.
Before using it, run:
for name, module in base_model.named_modules():
lname = name.lower()
if any(k in lname for k in ["vision", "audio", "project", "embed", "multi"]):
print(name, type(module).__name__)
Then confirm:
print(base_model.hf_device_map)
If the module key does not exist, Accelerate may ignore it or warn that it does not match any submodule. The fallback "": 0 will place everything else on GPU, but your mental model of what was offloaded will be wrong.
For Gemma 4, bridge-like names may be closer to:
model.embed_vision
model.embed_audio
model.audio_tower.output_proj
depending on the specific implementation.
For first stable testing, I would use only:
device_map = {
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
Then add bridge/projector modules only after verifying exact names.
The memory warning matters
This warning is important:
no modules could be assigned to device 0 due to insufficient memory:
0: 5668601858 bytes required
That means Accelerate’s current allocation attempt already believes GPU 0 needs about 5.7 GB more free memory for the proposed dispatch plan.
This is not the same as the Python exception, but it is a warning sign. It says the model placement plan is already under memory pressure.
Under memory pressure, the system is more likely to use CPU/disk offload and meta placeholders. That increases the chance that PEFT/Accelerate/bitsandbytes redispatch enters an unsupported or fragile path.
So treat the memory warning as part of the diagnosis:
The dispatch plan is already tight.
PEFT loading then triggers another dispatch/hook pass.
That pass touches bnb quant-state/meta tensors.
The process crashes.
Recommended code shape
Base model loading
Use a raw Windows path:
OFFLOAD_DIR = r"E:\Folder\offload_temp"
Then:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=False, # first diagnostic; turn on later
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_enable_fp32_cpu_offload=True,
)
device_map = {
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
}
base_model = Gemma4ForConditionalGeneration.from_pretrained(
MODEL_REGISTRY[model_id_to_load],
quantization_config=quant_config,
device_map=device_map,
max_memory=max_memory,
offload_folder=OFFLOAD_DIR,
dtype=torch.bfloat16,
attn_implementation="sdpa",
trust_remote_code=False,
low_cpu_mem_usage=True,
)
I changed three things:
bnb_4bit_use_double_quant=False
for the first diagnostic;
"model.multi_modal_projector": "cpu"
removed until verified;
low_cpu_mem_usage=True
because you are explicitly operating near memory/offload boundaries.
PEFT loading
Do not pass device_map here.
Use:
model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=lora_source_client_name,
is_trainable=False,
offload_dir=OFFLOAD_DIR,
offload_buffers=True,
ephemeral_gpu_offload=True,
torch_device="cuda:0",
)
Then:
model.eval()
model.config.use_cache = True
if hasattr(model, "generation_config") and model.generation_config is not None:
model.generation_config.use_cache = True
Input placement for generation
For mixed CPU/GPU dispatched models, avoid blindly doing:
inputs = inputs.to("cuda:0")
Instead place token tensors on the input embedding device:
def get_input_embedding_device(model):
candidates = [model]
if hasattr(model, "base_model"):
candidates.append(model.base_model)
if hasattr(model.base_model, "model"):
candidates.append(model.base_model.model)
for obj in candidates:
try:
emb = obj.get_input_embeddings()
if emb is not None and hasattr(emb, "weight"):
return emb.weight.device
except Exception:
pass
return torch.device("cuda:0")
input_device = get_input_embedding_device(model)
for key, value in inputs.items():
if torch.is_tensor(value):
inputs[key] = value.to(input_device)
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
use_cache=True,
)
This does not solve the PEFT loading bug, but it avoids a separate CPU/CUDA mismatch after loading succeeds.
Should you create an issue?
Yes.
This is not just a configuration question. You have two issue-worthy failures.
Issue 1: PEFT load on offloaded 4-bit Gemma 4 fails through bitsandbytes/meta state
Suggested title:
PeftModel.from_pretrained on offloaded 4-bit Gemma4 hits bitsandbytes nested QuantState meta tensor
Include this core traceback:
PeftModel.from_pretrained
→ load_adapter
→ dispatch_model
→ attach_execution_device_hook
→ module.state_dict()
→ bitsandbytes Linear4bit._save_to_state_dict
→ weight.quant_state.as_dict(packed=True)
→ "nested_offset": self.offset.item()
→ Tensor.item() cannot be called on meta tensors
Likely owners:
| Component |
Why |
| bitsandbytes |
The failing .item() call is inside bnb quant-state serialization. |
| Accelerate |
It calls state_dict() while attaching dispatch hooks. |
| PEFT |
It triggers the redispatch during adapter loading. |
| Transformers |
It integrates Gemma 4, bnb quantization, and device maps. |
I would file first at Transformers or PEFT, and mention bitsandbytes/Accelerate in the issue body. If you can reduce it to a pure Linear4bit / dispatch_model repro, then bitsandbytes or Accelerate becomes the better primary repo.
Issue 2: Passing device_map into PEFT breaks Gemma 4 shared-KV generation
Suggested title:
Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22
Include:
PeftModel.from_pretrained(..., device_map=device_map)
then:
generate()
→ Gemma4 self_attn
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22
Likely owners:
| Component |
Why |
| Transformers |
shared_kv_states is Gemma 4 model logic. |
| PEFT |
The issue appears after PEFT wrapping/adapter loading. |
| Accelerate |
The second dispatch/hook layout is likely the trigger. |
I would file this one at Transformers, because the final failure is Gemma 4 model logic, then cross-link PEFT/Accelerate if maintainers request it.
What I would not do
Do not pass the same device map to PEFT
Avoid:
model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=lora_source_client_name,
device_map=device_map,
is_trainable=False,
)
It is not equivalent to “offload PEFT too.” It can trigger a second dispatch of the wrapped model and break Gemma 4 shared-KV state.
Do not use merge_and_unload() in this path
This line is not the direct cause here:
if isinstance(base_model, PeftModel):
base_model = base_model.merge_and_unload()
But merging/unloading adapters into quantized/offloaded models is another fragile path. Keep it out of the first stable inference path.
Do not offload bridge/projector modules until verified
Start with:
"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,
Only add bridge modules after verifying exact names and after PEFT loading works.
Practical decision tree
If you need it working now
Use:
device_map = {"": 0}
That is the known-good operational path.
If E4B does not fit all-GPU with your adapter, try:
bnb_4bit_use_double_quant=False
or use a smaller Gemma 4 variant for runtime.
If you need CPU-offloaded vision/audio
Try this sequence:
- Base model with vision/audio CPU only.
bnb_4bit_use_double_quant=False.
- No PEFT
device_map.
- PEFT with
offload_dir, offload_buffers=True, ephemeral_gpu_offload=True.
use_cache=True.
- Inputs placed on embedding device.
If that still fails, it is not a configuration issue anymore; it is a current compatibility bug.
If you want the cleanest bug report
Use a minimal matrix:
| Case |
Expected |
| all GPU + PEFT |
works |
| split base without PEFT |
works |
split base + PEFT without PEFT device_map |
Tensor.item() meta or Params4bit failure |
split base + PEFT with PEFT device_map |
KeyError: 22 during generate |
That is a strong report.
Final answer to your direct question
“Am I offloading vision/audio to CPU?”
Yes, at least for the keys that actually exist:
"model.vision_tower": "cpu"
"model.audio_tower": "cpu"
But verify model.multi_modal_projector; that key may not exist or may not be the correct bridge module name.
“Does PEFT expect those to be on GPU?”
No, not in the simple sense.
The real issue is that PEFT adapter loading triggers a second dispatch/hook process over a quantized, partially offloaded model. That path currently interacts badly with bitsandbytes 4-bit quantization state and Gemma 4’s shared-KV architecture.
“Can I offload on PEFT as well?”
Partly, yes, but not by passing the same device_map.
Use PEFT/offload arguments:
offload_dir=...
offload_buffers=True
ephemeral_gpu_offload=True
torch_device="cuda:0"
Do not pass:
device_map=device_map
for Gemma 4 PEFT generation.
“Do I need to create an issue?”
Yes. This is issue-worthy.
The strongest issue is:
PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment because bitsandbytes nested QuantState contains meta tensor.
The second issue is:
Passing device_map to PeftModel.from_pretrained on Gemma4 causes shared_kv_states KeyError during generate.
Both are real cross-library edge cases, not just user error.
Short summary
- The all-GPU path works because it avoids split-dispatch/offload hooks.
- The first error is caused by PEFT/Accelerate touching bitsandbytes 4-bit nested quant-state while something is still on
meta.
- The second error is caused by passing
device_map into PEFT, which can break Gemma 4’s shared-KV generation bookkeeping.
- PEFT does not simply require vision/audio on GPU.
- Do not pass the same
device_map to PEFT.
- Use
offload_dir, offload_buffers=True, and ephemeral_gpu_offload=True instead.
- Start with only
vision_tower and audio_tower on CPU; verify any projector/bridge module name before offloading it.
- File issues with Transformers/PEFT/Accelerate/bitsandbytes; this is a genuine integration bug surface.