CPU offloading error scenario

I have a problem I am trying to offload the vision and audio to CPU. Via a device map. But I get below error. If I change to device_map={“”:0} then everything works correct. Just checking if somebody can reproduce it and need if I need to change anything. Or if I need to create a issue at PEFT, Transformers, Accelerate or BitandBytes.

If I understand the problem correct I am offloading model.vision_tower, model.multi_modal_projector and model.audio_tower. But PEFT expects those to be on GPU. Can’t I offload on PEFT as well?

Thanks in advance to all

Model: Gemma 4 E4B IT

Device Map:
device_map = {
“model.vision_tower”: “cpu”,
“model.multi_modal_projector”: “cpu”,
“model.audio_tower”: “cpu”,
“”: 0 # This sets the rest of the model to GPU 0
}

Bits and Bytes Config:

    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,

Model Kwargs:

“dtype”: torch.bfloat16,
“attn_implementation”: “sdpa”,
“trust_remote_code”: False,
“low_cpu_mem_usage”: False

Loading Code:

base_model = Gemma4ForConditionalGeneration.from_pretrained(
MODEL_REGISTRY[model_id_to_load],
quantization_config=quant_config,
device_map=device_map,
max_memory=max_memory,
offload_folder=“e:\Folder\offload_temp”,
**MODEL_KWARGS[model_id_to_load]
)

PEFT loading:

    if isinstance(base_model, PeftModel):
        base_model = base_model.merge_and_unload()

    model = PeftModel.from_pretrained(
        base_model,
        lora_path,
        adapter_name=lora_source_client_name,
        is_trainable=False
    )

Versions:
Transformers: v5.6.2
Accelerate: v1.14.0.dev0
BitsandBytes: v0.49.2
Torch: 2.8.0+cu129
Peft: v0.19.1

Below Error happens after PEFT Loading:

2026-04-24 11:35:10,080 | Worker (6904) | INFO | Based on the current allocation process, no modules could be assigned to the following devices due to insufficient memory:

  • 0: 5668601858 bytes required
    These minimum requirements are specific to this allocation attempt and may vary. Consider increasing the available memory for these devices to at least the specified minimum, or adjusting the model config.
    2026-04-24 11:35:12,470 | Worker (6904) | ERROR | Worker error: Tensor.item() cannot be called on meta tensors
    Traceback (most recent call last):
    File “E:\Folder\inference_worker.py”, line 414, in inference_worker_loop
    model = _worker_load_model(model_id_to_load, lora_source_client_name, supports_image, supports_audio)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File “E:\Folder\inference_worker.py”, line 344, in _worker_load_model
    model = PeftModel.from_pretrained(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File “E:\Folder\gemma_env\Lib\site-packages\peft\peft_model.py”, line 582, in from_pretrained
    load_result = model.load_adapter(
    ^^^^^^^^^^^^^^^^^^^
    File “E:\Folder\gemma_env\Lib\site-packages\peft\peft_model.py”, line 1475, in load_adapter
    dispatch_model(
    File “E:\Folder\gemma_env\Lib\site-packages\accelerate\big_modeling.py”, line 432, in dispatch_model
    attach_align_device_hook_on_blocks(
    File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
    File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
    File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(

    Previous line repeated 3 more times \$\$ $$

File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 677, in attach_align_device_hook_on_blocks
attach_execution_device_hook(
File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 470, in attach_execution_device_hook
attach_execution_device_hook(
File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 459, in attach_execution_device_hook
if not hasattr(module, “_hf_hook”) and len(module.state_dict()) > 0:
^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 2260, in state_dict
module.state_dict(
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 2257, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File “E:\Folder\gemma_env\Lib\site-packages\bitsandbytes\nn\modules.py”, line 525, in _save_to_state_dict
for k, v in self.weight.quant_state.as_dict(packed=True).items():
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\bitsandbytes\functional.py”, line 581, in as_dict
“nested_offset”: self.offset.item(),
^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch_meta_registrations.py”, line 7457, in meta_local_scalar_dense
raise RuntimeError(“Tensor.item() cannot be called on meta tensors”)
RuntimeError: Tensor.item() cannot be called on meta tensors
2026-04-24 11:35:12,517 | Worker (15192) | ERROR | Worker returned error: Worker error: Tensor.item() cannot be called on meta tensors

1 Like

If I add the device map to the peft I get below error.

if isinstance(base_model, PeftModel):
    base_model = base_model.merge_and_unload()

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=lora_source_client_name,
    device_map=device_map,
    is_trainable=False
)

Error:

2026-04-24 13:38:40,528 | Worker (7392) | ERROR | Worker error: 22
Traceback (most recent call last):
File “E:\Folder\inference_worker.py”, line 511, in inference_worker_loop
outputs = model.generate(
^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\peft\peft_model.py”, line 2122, in generate
outputs = self.base_model.generate(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\utils_contextlib.py”, line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\generation\utils.py”, line 2543, in generate
result = decoding_method(
^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\generation\utils.py”, line 2736, in _sample
outputs = self._prefill(
^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\generation\utils.py”, line 3768, in _prefill
return self(**model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 192, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\utils\generic.py”, line 887, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\models\gemma4\modeling_gemma4.py”, line 2516, in forward
outputs = self.model(
^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 192, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\utils\generic.py”, line 963, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\utils\generic.py”, line 887, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\models\gemma4\modeling_gemma4.py”, line 2374, in forward
outputs = self.language_model(
^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 192, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\utils\generic.py”, line 963, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\utils\output_capturing.py”, line 248, in wrapper
outputs = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\models\gemma4\modeling_gemma4.py”, line 1675, in forward
hidden_states = decoder_layer(
^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\modeling_layers.py”, line 93, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 192, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\models\gemma4\modeling_gemma4.py”, line 1379, in forward
hidden_states, _ = self.self_attn(
^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\torch\nn\modules\module.py”, line 1784, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 192, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\Folder\gemma_env\Lib\site-packages\transformers\models\gemma4\modeling_gemma4.py”, line 1219, in forward
key_states, value_states = shared_kv_states[self.kv_shared_layer_index]
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 22
2026-04-24 13:38:40,553 | Worker (1696) | ERROR | Worker returned error: Worker error: 22

1 Like

I ran a few experiments on Colab Free GPU. While there are workarounds, it also seems possible that PEFT isn’t currently behaving as expected:


Short answer

Your understanding is close, but the important correction is:

PEFT is not simply “expecting vision/audio to be on GPU.”
The real problem is that PEFT adapter loading triggers a second Accelerate dispatch/hook pass over a bitsandbytes 4-bit, partially CPU-offloaded Gemma 4 model, and that path is fragile.

You are hitting two different failure modes:

  1. Without device_map in PEFT
    PEFT calls Accelerate dispatch hooks; Accelerate asks a bitsandbytes Linear4bit module for state_dict(); bitsandbytes tries to serialize nested/double-quant state; that nested quant state contains a meta tensor; .item() on a meta tensor fails.

  2. With device_map in PEFT
    PEFT does a second dispatch using a device map that was meant for the base model, not the PEFT-wrapped model. The model loads farther, but Gemma 4 generation breaks because Gemma 4’s shared KV cache bookkeeping loses an expected source-layer entry, causing KeyError: 22.

So the answer is:

You can use PEFT with some offload-related options, but passing the same device_map into PeftModel.from_pretrained() is not the right fix for Gemma 4. It changes the dispatch layout and can break Gemma 4’s shared-KV generation path.


What is happening in the first error

Your original PEFT load is:

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=lora_source_client_name,
    is_trainable=False,
)

The failure happens after PEFT starts loading the adapter:

PeftModel.from_pretrained
→ load_adapter
→ accelerate.dispatch_model
→ attach_align_device_hook_on_blocks
→ attach_execution_device_hook
→ module.state_dict()
→ bitsandbytes Linear4bit._save_to_state_dict
→ self.weight.quant_state.as_dict(packed=True)
→ "nested_offset": self.offset.item()
→ Tensor.item() cannot be called on meta tensors

That traceback is very specific. The failing tensor is not a LoRA adapter tensor. It is inside bitsandbytes’ 4-bit quantization state.

The key line is:

"nested_offset": self.offset.item()

That is tied to nested/double quantization. You enabled:

bnb_4bit_use_double_quant=True

In bitsandbytes, nested quantization stores extra quantization-state fields, and QuantState.as_dict(packed=True) serializes those fields. The bitsandbytes source contains the relevant QuantState packing logic, including nested quant-state serialization. (GitHub)

PyTorch’s meta device is not a real data-holding device. Meta tensors carry shape/dtype metadata but no values, so data-dependent operations like .item() are invalid. That is why the exception says Tensor.item() cannot be called on meta tensors. (PyTorch Docs)

So the first error means:

A bitsandbytes nested quantization scalar is still on meta
when Accelerate/PEFT asks bitsandbytes to serialize state_dict.

That is a library interaction issue:

PEFT adapter loading
+ Accelerate dispatch hooks
+ bitsandbytes 4-bit Linear4bit
+ double quant / nested quant_state
+ CPU/GPU offload / meta placeholders

It is not simply:

vision/audio are on CPU, PEFT wants them on GPU

Why device_map={"": 0} works

When you use:

device_map = {"": 0}

you avoid the fragile cross-device path.

Everything is on GPU 0:

no CPU-offloaded towers
no CPU/disk hooks for those modules
less meta placeholder machinery
less redispatch complexity during PEFT load

That does not prove the model or adapter are wrong. It proves that the all-GPU path avoids the failure surface.

This distinction matters:

Setup What happens
device_map={"": 0} No split dispatch; PEFT works.
custom CPU/GPU map Accelerate offload hooks are involved; PEFT triggers redispatch; bitsandbytes quant-state/meta problems appear.

Why passing device_map to PEFT causes KeyError: 22

You tried:

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=lora_source_client_name,
    device_map=device_map,
    is_trainable=False
)

That gets past the first loading problem, but generation fails later:

key_states, value_states = shared_kv_states[self.kv_shared_layer_index]
KeyError: 22

This is a different failure.

Gemma 4 has a shared KV cache architecture. In Gemma 4, the last num_kv_shared_layers decoder layers do not compute their own key/value projections; they reuse K/V tensors from an earlier non-shared layer of the same attention type. Hugging Face’s Gemma 4 blog describes this shared-KV-cache optimization explicitly. (Hugging Face)

So inside generation, Gemma 4 needs something like this:

shared_kv_states[source_layer_index] = (key_states, value_states)

Then later:

key_states, value_states = shared_kv_states[self.kv_shared_layer_index]

Your error says the later layer expected:

shared_kv_states[22]

but that key did not exist.

That means the layer that should have populated shared_kv_states[22] either:

  1. did not run in the expected way;
  2. ran under a hook/layout that did not capture/store the expected K/V state;
  3. had its execution order/state propagation changed by the second dispatch;
  4. or had Gemma 4’s shared-state bookkeeping disrupted by PEFT/Accelerate wrapper hooks.

The important point:

Passing device_map into PEFT changes the PEFT-wrapped model’s dispatch/hook structure. For Gemma 4, that can break the shared-KV path during generation.

This is why device_map in PEFT is not a good fix even if it avoids the original loading error.


Why PEFT device_map is not equivalent to base-model device_map

Your base model is loaded like this:

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    ...,
    device_map=device_map,
    offload_folder=...,
)

That is the correct place to put the base model’s device map.

After PEFT wrapping, module names and structure are different. PEFT wraps the model, often under paths like:

base_model.model...
base_model.model.model...

So the same raw map:

{
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0
}

may not mean the same thing after wrapping.

Accelerate’s device_map is module-name based and recursively applies placement to submodules. Its docs describe dispatch_model as spreading modules across GPU, CPU, or disk according to a device map. (GitHub)

So there are two different operations:

Base from_pretrained(..., device_map=...)
    initial placement of the base model

PeftModel.from_pretrained(..., device_map=...)
    redispatch of the PEFT-wrapped model

The second one is not just “offload PEFT too.” It can reattach hooks and alter runtime behavior.


What PEFT offload options actually do

PEFT does have offload-related controls. The PEFT docs/source mention ephemeral_gpu_offload, which can be used when loading adapters with partially offloaded modules. (GitHub)

A safer PEFT call shape is:

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=lora_source_client_name,
    is_trainable=False,
    offload_dir=r"E:\Folder\offload_temp",
    offload_buffers=True,
    ephemeral_gpu_offload=True,
    torch_device="cuda:0",
)

Notice what is not there:

device_map=device_map

This is the distinction:

Option Meaning
device_map on base from_pretrained Places the base model modules across CPU/GPU/disk.
offload_folder on base from_pretrained Folder for base-model offload during Transformers loading.
offload_dir on PEFT load Folder used by Accelerate/PEFT redispatch if offloaded modules are involved.
offload_buffers=True Also offload buffers when hooks need it.
ephemeral_gpu_offload=True Temporarily move offloaded pieces to GPU when needed.
device_map on PEFT Re-dispatches the PEFT-wrapped model; risky here.

There are known PEFT issue patterns around offload_dir not being propagated/found during PeftModel.from_pretrained() on offloaded base models. (GitHub)

So yes, you can try to make PEFT loading offload-aware, but I would do it with offload_dir, offload_buffers, and ephemeral_gpu_offload, not by passing the original base device_map again.


The llm_int8_enable_fp32_cpu_offload=True confusion

This setting is confusing:

llm_int8_enable_fp32_cpu_offload=True

The name says int8, but in current Transformers/bitsandbytes integration, CPU/disk dispatch with quantized models often uses this flag as the gate that allows some modules to remain in full precision on CPU when a custom device_map contains CPU/disk entries. The Transformers bitsandbytes docs discuss CPU/GPU offload under the bitsandbytes quantization workflow, and public error messages for this path instruct users to enable llm_int8_enable_fp32_cpu_offload=True when modules are dispatched to CPU/disk. (Hugging Face)

For your use case:

4-bit + custom CPU/GPU device_map

I would keep it enabled unless you move back to all-GPU.

So:

All-GPU path

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

CPU/GPU split path

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,
)

The bnb_4bit_use_double_quant=True part

Your original Tensor.item() error points very strongly at this setting:

bnb_4bit_use_double_quant=True

Double quantization is normally useful because it saves additional memory. But it also creates nested quantization state. Your traceback fails specifically while bitsandbytes serializes nested quant-state metadata:

"nested_offset": self.offset.item()

So as a diagnostic, test this:

bnb_4bit_use_double_quant=False

If the first error disappears with double quant off, then the issue is specifically:

bitsandbytes nested quant_state
+ PEFT/Accelerate redispatch
+ meta/offload

If it still fails, then the broader issue is:

bitsandbytes Params4bit / Linear4bit
+ PEFT/Accelerate redispatch
+ offloaded base model

Either way, the failure is still in the quantized/offloaded dispatch stack, not simply in PEFT’s preference for GPU placement.


The model.multi_modal_projector key may not be valid

Gemma 4 is multimodal. The official Transformers docs describe the base Gemma 4 model as comprising a vision backbone, an audio backbone, and a language model; the conditional-generation model includes the language modeling head. (Hugging Face)

But exact implementation module names matter.

You used:

"model.multi_modal_projector": "cpu"

I would not assume this key exists for every Gemma 4 checkpoint/implementation.

Before using it, run:

for name, module in base_model.named_modules():
    lname = name.lower()
    if any(k in lname for k in ["vision", "audio", "project", "embed", "multi"]):
        print(name, type(module).__name__)

Then confirm:

print(base_model.hf_device_map)

If the module key does not exist, Accelerate may ignore it or warn that it does not match any submodule. The fallback "": 0 will place everything else on GPU, but your mental model of what was offloaded will be wrong.

For Gemma 4, bridge-like names may be closer to:

model.embed_vision
model.embed_audio
model.audio_tower.output_proj

depending on the specific implementation.

For first stable testing, I would use only:

device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}

Then add bridge/projector modules only after verifying exact names.


The memory warning matters

This warning is important:

no modules could be assigned to device 0 due to insufficient memory:
0: 5668601858 bytes required

That means Accelerate’s current allocation attempt already believes GPU 0 needs about 5.7 GB more free memory for the proposed dispatch plan.

This is not the same as the Python exception, but it is a warning sign. It says the model placement plan is already under memory pressure.

Under memory pressure, the system is more likely to use CPU/disk offload and meta placeholders. That increases the chance that PEFT/Accelerate/bitsandbytes redispatch enters an unsupported or fragile path.

So treat the memory warning as part of the diagnosis:

The dispatch plan is already tight.
PEFT loading then triggers another dispatch/hook pass.
That pass touches bnb quant-state/meta tensors.
The process crashes.

Recommended code shape

Base model loading

Use a raw Windows path:

OFFLOAD_DIR = r"E:\Folder\offload_temp"

Then:

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,  # first diagnostic; turn on later
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,
)

device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    MODEL_REGISTRY[model_id_to_load],
    quantization_config=quant_config,
    device_map=device_map,
    max_memory=max_memory,
    offload_folder=OFFLOAD_DIR,
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    trust_remote_code=False,
    low_cpu_mem_usage=True,
)

I changed three things:

bnb_4bit_use_double_quant=False

for the first diagnostic;

"model.multi_modal_projector": "cpu"

removed until verified;

low_cpu_mem_usage=True

because you are explicitly operating near memory/offload boundaries.


PEFT loading

Do not pass device_map here.

Use:

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=lora_source_client_name,
    is_trainable=False,
    offload_dir=OFFLOAD_DIR,
    offload_buffers=True,
    ephemeral_gpu_offload=True,
    torch_device="cuda:0",
)

Then:

model.eval()

model.config.use_cache = True
if hasattr(model, "generation_config") and model.generation_config is not None:
    model.generation_config.use_cache = True

Input placement for generation

For mixed CPU/GPU dispatched models, avoid blindly doing:

inputs = inputs.to("cuda:0")

Instead place token tensors on the input embedding device:

def get_input_embedding_device(model):
    candidates = [model]

    if hasattr(model, "base_model"):
        candidates.append(model.base_model)
        if hasattr(model.base_model, "model"):
            candidates.append(model.base_model.model)

    for obj in candidates:
        try:
            emb = obj.get_input_embeddings()
            if emb is not None and hasattr(emb, "weight"):
                return emb.weight.device
        except Exception:
            pass

    return torch.device("cuda:0")


input_device = get_input_embedding_device(model)

for key, value in inputs.items():
    if torch.is_tensor(value):
        inputs[key] = value.to(input_device)

outputs = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    use_cache=True,
)

This does not solve the PEFT loading bug, but it avoids a separate CPU/CUDA mismatch after loading succeeds.


Should you create an issue?

Yes.

This is not just a configuration question. You have two issue-worthy failures.

Issue 1: PEFT load on offloaded 4-bit Gemma 4 fails through bitsandbytes/meta state

Suggested title:

PeftModel.from_pretrained on offloaded 4-bit Gemma4 hits bitsandbytes nested QuantState meta tensor

Include this core traceback:

PeftModel.from_pretrained
→ load_adapter
→ dispatch_model
→ attach_execution_device_hook
→ module.state_dict()
→ bitsandbytes Linear4bit._save_to_state_dict
→ weight.quant_state.as_dict(packed=True)
→ "nested_offset": self.offset.item()
→ Tensor.item() cannot be called on meta tensors

Likely owners:

Component Why
bitsandbytes The failing .item() call is inside bnb quant-state serialization.
Accelerate It calls state_dict() while attaching dispatch hooks.
PEFT It triggers the redispatch during adapter loading.
Transformers It integrates Gemma 4, bnb quantization, and device maps.

I would file first at Transformers or PEFT, and mention bitsandbytes/Accelerate in the issue body. If you can reduce it to a pure Linear4bit / dispatch_model repro, then bitsandbytes or Accelerate becomes the better primary repo.


Issue 2: Passing device_map into PEFT breaks Gemma 4 shared-KV generation

Suggested title:

Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate: KeyError 22

Include:

PeftModel.from_pretrained(..., device_map=device_map)

then:

generate()
→ Gemma4 self_attn
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22

Likely owners:

Component Why
Transformers shared_kv_states is Gemma 4 model logic.
PEFT The issue appears after PEFT wrapping/adapter loading.
Accelerate The second dispatch/hook layout is likely the trigger.

I would file this one at Transformers, because the final failure is Gemma 4 model logic, then cross-link PEFT/Accelerate if maintainers request it.


What I would not do

Do not pass the same device map to PEFT

Avoid:

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=lora_source_client_name,
    device_map=device_map,
    is_trainable=False,
)

It is not equivalent to “offload PEFT too.” It can trigger a second dispatch of the wrapped model and break Gemma 4 shared-KV state.

Do not use merge_and_unload() in this path

This line is not the direct cause here:

if isinstance(base_model, PeftModel):
    base_model = base_model.merge_and_unload()

But merging/unloading adapters into quantized/offloaded models is another fragile path. Keep it out of the first stable inference path.

Do not offload bridge/projector modules until verified

Start with:

"model.vision_tower": "cpu",
"model.audio_tower": "cpu",
"": 0,

Only add bridge modules after verifying exact names and after PEFT loading works.


Practical decision tree

If you need it working now

Use:

device_map = {"": 0}

That is the known-good operational path.

If E4B does not fit all-GPU with your adapter, try:

bnb_4bit_use_double_quant=False

or use a smaller Gemma 4 variant for runtime.

If you need CPU-offloaded vision/audio

Try this sequence:

  1. Base model with vision/audio CPU only.
  2. bnb_4bit_use_double_quant=False.
  3. No PEFT device_map.
  4. PEFT with offload_dir, offload_buffers=True, ephemeral_gpu_offload=True.
  5. use_cache=True.
  6. Inputs placed on embedding device.

If that still fails, it is not a configuration issue anymore; it is a current compatibility bug.

If you want the cleanest bug report

Use a minimal matrix:

Case Expected
all GPU + PEFT works
split base without PEFT works
split base + PEFT without PEFT device_map Tensor.item() meta or Params4bit failure
split base + PEFT with PEFT device_map KeyError: 22 during generate

That is a strong report.


Final answer to your direct question

“Am I offloading vision/audio to CPU?”

Yes, at least for the keys that actually exist:

"model.vision_tower": "cpu"
"model.audio_tower": "cpu"

But verify model.multi_modal_projector; that key may not exist or may not be the correct bridge module name.

“Does PEFT expect those to be on GPU?”

No, not in the simple sense.

The real issue is that PEFT adapter loading triggers a second dispatch/hook process over a quantized, partially offloaded model. That path currently interacts badly with bitsandbytes 4-bit quantization state and Gemma 4’s shared-KV architecture.

“Can I offload on PEFT as well?”

Partly, yes, but not by passing the same device_map.

Use PEFT/offload arguments:

offload_dir=...
offload_buffers=True
ephemeral_gpu_offload=True
torch_device="cuda:0"

Do not pass:

device_map=device_map

for Gemma 4 PEFT generation.

“Do I need to create an issue?”

Yes. This is issue-worthy.

The strongest issue is:

PeftModel.from_pretrained on CPU/GPU-dispatched 4-bit Gemma4 fails during Accelerate hook attachment because bitsandbytes nested QuantState contains meta tensor.

The second issue is:

Passing device_map to PeftModel.from_pretrained on Gemma4 causes shared_kv_states KeyError during generate.

Both are real cross-library edge cases, not just user error.


Short summary

  • The all-GPU path works because it avoids split-dispatch/offload hooks.
  • The first error is caused by PEFT/Accelerate touching bitsandbytes 4-bit nested quant-state while something is still on meta.
  • The second error is caused by passing device_map into PEFT, which can break Gemma 4’s shared-KV generation bookkeeping.
  • PEFT does not simply require vision/audio on GPU.
  • Do not pass the same device_map to PEFT.
  • Use offload_dir, offload_buffers=True, and ephemeral_gpu_offload=True instead.
  • Start with only vision_tower and audio_tower on CPU; verify any projector/bridge module name before offloading it.
  • File issues with Transformers/PEFT/Accelerate/bitsandbytes; this is a genuine integration bug surface.

Thanks for the feedback, I am working thru it. Will give feedback.

1 Like

I’ll post a draft of the issue for now (edited):


The good actual issues to raise are these, in this order.

Issue 1 — Primary: PEFT adapter loading fails only on an already CPU/GPU-dispatched bnb 4-bit Gemma 4 model

File first at: huggingface/transformers
Mention/cross-link: huggingface/peft, huggingface/accelerate, bitsandbytes-foundation/bitsandbytes

Suggested title

PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized

Why this is the strongest issue

This is the core, now updated with the latest Colab T4 evidence:

R05 all-GPU 4-bit:
PASS

R03 split CPU/GPU 4-bit:
FAIL_BNB_PARAM_CONSTRUCTOR
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'

The important contrast is:

Works:
device_map = {"": 0}

Fails:
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}

This means the current primary failure is not PEFT + bitsandbytes 4-bit in general. The all-GPU path works. The failure is specific to loading a PEFT adapter on top of an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model.

The previous Gemma4ClippableLinear blocker is still important, but it is no longer the primary issue because the latest repro bypasses it by targeting the inner .linear modules in Gemma 4 multimodal towers.

Current environment

Runtime: Colab Free / Tesla T4
Python: 3.12.13
torch: 2.10.0+cu128
transformers: 5.6.2
accelerate: 1.13.0
peft: 0.19.1
bitsandbytes: 0.49.2
huggingface_hub: 1.12.0
safetensors: 0.7.0
torchao_importable: false

Model and adapter

Base model:
unsloth/gemma-4-E2B-it

Adapter:
Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill

Adapter probe:

adapter_model.safetensors keys: 786

audio: 72
language_like: 490
vision_or_patch: 224

This adapter is genuinely multimodal, so simply excluding vision/audio LoRA would change the adapter semantics.

Why Transformers first

Transformers is still the best first repo because this issue crosses:

  • Gemma 4 model integration;
  • bitsandbytes quantization integration;
  • PEFT adapter integration expectations;
  • device-map loading behavior;
  • Accelerate redispatch/hook behavior;
  • current _is_hf_initialized parameter reconstruction behavior.

Accelerate owns dispatch_model() and hook attachment. bitsandbytes owns Params4bit. PEFT triggers the adapter-loading path. But the user-facing break is in the Transformers + PEFT + bnb integration path.


Issue 2 — Supporting / separate if requested: PEFT target-module compatibility for Gemma4ClippableLinear

File at: huggingface/peft

Suggested title

Gemma4 multimodal LoRA adapters need inner .linear targeting or Gemma4ClippableLinear support

Why it is supporting, not primary

Before patching the local adapter config, PEFT fails earlier:

ValueError: Target module Gemma4ClippableLinear(...) is not supported.

However, the latest repro bypasses that blocker by targeting the inner .linear modules for Gemma 4 vision/audio towers. With that patch:

R05 all-GPU 4-bit:
PASS

So this is real, but it is not the primary split/offload failure.


Issue 3 — Deferred / separate only if you have the exact trace: PEFT device_map breaks Gemma 4 shared-KV generation

File first at: huggingface/transformers
Mention/cross-link: huggingface/peft, huggingface/accelerate

Suggested title

Gemma4 + PeftModel.from_pretrained(device_map=...) breaks shared_kv_states during generate

Why this is deferred

The latest Colab E2B matrix did not reproduce this failure:

R05 all-GPU, no PEFT device_map:
PASS

R03 split CPU/GPU, no PEFT device_map:
FAIL_BNB_PARAM_CONSTRUCTOR before generation

Only file the shared-KV issue if you attach a separate trace where adapter loading succeeds and generation fails here:

Gemma4Attention.forward
→ shared_kv_states[self.kv_shared_layer_index]
→ KeyError: 22

Optional Issue 4 — Docs / UX: PEFT offload-dir / offload-folder handling is confusing

File at: huggingface/peft

Suggested title

Clarify offload_dir/offload_folder behavior for PeftModel.from_pretrained on already-dispatched models

Why it is lower priority

This is useful, but it is not the core bug. The current core bug has a concrete split/offload Params4bit constructor trace.


What I would not file

Not this

CPU offloading is broken.

Too broad. The base model can load in split CPU/GPU form. The failure occurs during PEFT adapter loading on top of the already-dispatched quantized base model.

Better:

PeftModel.from_pretrained fails on an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model during Accelerate hook setup.

Not this

PEFT expects vision/audio towers to be on GPU.

The all-GPU path works, and the split path fails in a bnb/Accelerate parameter reconstruction path. The evidence points to split/offload redispatch, not a generic PEFT requirement that vision/audio must be GPU-resident.

Not this as the main issue

Tensor.item() cannot be called on meta tensors

That may be a related double-quant / QuantState.as_dict() variant, but the latest controlled Colab matrix with double_quant=False lands on:

Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'

Mention the nested_offset.item() issue only if you attach that exact separate trace.


Recommended filing plan

Best plan

Open one primary Transformers issue:

PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized

Include the Gemma4ClippableLinear patch as a “repro setup note,” not as the headline.

Then say:

I can split the Gemma4ClippableLinear target-module compatibility issue into a PEFT issue if maintainers prefer.

If you want the cleanest tracking

Open two separate issues:

  1. Transformers Issue A: split/offload bnb 4-bit + PEFT + Accelerate Params4bit.__new__(_is_hf_initialized) failure.
  2. PEFT Issue B: Gemma 4 multimodal LoRA adapters and Gemma4ClippableLinear target-module compatibility.

Do not open the shared-KV issue from the latest Colab matrix alone, because it was not reproduced there.


Key evidence to include

Include this exact contrast:

Works:
device_map = {"": 0}

Fails:
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}

Include this exact result pair:

R05_ALL_GPU_4BIT:
PASS
generate output shape: (1, 8)
CUDA after generate: allocated 6.507 GiB, reserved 6.693 GiB, free 7.742 GiB

R03_SPLIT_4BIT:
FAIL_BNB_PARAM_CONSTRUCTOR
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'

Include the adapter probe:

Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill

audio: 72
language_like: 490
vision_or_patch: 224

Bottom line

The actual issue to raise now is:

  1. Primary bug: PeftModel.from_pretrained() fails on a CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 base model during Accelerate hook setup, while the same patched multimodal adapter works all-GPU.

  2. Supporting bug: vanilla PEFT currently trips over Gemma 4 multimodal tower Gemma4ClippableLinear wrappers unless the adapter targets inner .linear modules or PEFT supports the wrapper.

  3. Deferred bug: passing device_map into PEFT may break Gemma 4 shared-KV generation, but this needs its own exact trace and should not be merged into the latest Colab E2B primary issue.


Below are ready-to-paste GitHub sections. The ready-to-copy bodies use an outer 4-backtick fence so the inner 3-backtick code fences remain intact when copied.


Issue 1

Target repo

huggingface/transformers

Suggested title

PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.__new__ got unexpected _is_hf_initialized

Suggested labels

bug, Gemma4, PEFT, Accelerate, bitsandbytes, quantization, device_map, cpu-offload

Body

### System Info

- Runtime: Colab Free / Tesla T4
- Python: 3.12.13
- torch: 2.10.0+cu128
- transformers: 5.6.2
- accelerate: 1.13.0
- peft: 0.19.1
- bitsandbytes: 0.49.2
- huggingface_hub: 1.12.0
- safetensors: 0.7.0
- torchao: not importable / removed from environment
- model: `unsloth/gemma-4-E2B-it`
- adapter: `Ayodele01/gemma-4-E2B-Gemini-3.1-Pro-Reasoning-Distill`
- quantization: bitsandbytes 4-bit NF4
- attention implementation: `sdpa`
- trust_remote_code: `False`

### Summary

I can load `unsloth/gemma-4-E2B-it` in bitsandbytes 4-bit, load a patched multimodal LoRA adapter with PEFT, and run a tiny `generate()` smoke test when the whole model is placed on GPU:

```python
device_map = {"": 0}
```

However, the same model/adapter path fails when the base model is loaded with a split CPU/GPU `device_map`:

```python
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}
```

The failure happens during `PeftModel.from_pretrained()`, after PEFT starts adapter loading and calls into Accelerate `dispatch_model()` / hook attachment. The failing path reconstructs a bitsandbytes `Params4bit` object and passes `_is_hf_initialized` into its constructor:

```text
TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```

The contrast is important:

```text
R05 all-GPU 4-bit:
PASS

R03 split CPU/GPU 4-bit:
FAIL_BNB_PARAM_CONSTRUCTOR
```

This suggests the issue is not PEFT + bitsandbytes 4-bit in general. It appears specific to loading a PEFT adapter on top of an already CPU/GPU-dispatched bitsandbytes 4-bit Gemma 4 model, where PEFT triggers an additional Accelerate dispatch/hook path.

### Why the adapter config is patched in this repro

The adapter is a real multimodal LoRA adapter. Its safetensors keys include:

```text
audio: 72
language_like: 490
vision_or_patch: 224
```

Without patching, PEFT first fails earlier with:

```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```

That is because Gemma 4 vision/audio towers expose wrapper modules such as:

```text
model.vision_tower.encoder.layers.0.self_attn.q_proj
→ Gemma4ClippableLinear

model.vision_tower.encoder.layers.0.self_attn.q_proj.linear
→ Linear / Linear4bit
```

The adapter weights use inner `.linear` paths for multimodal towers, for example:

```text
base_model.model.model.audio_tower.layers.0.self_attn.k_proj.linear.lora_A.weight
base_model.model.model.vision_tower.encoder.layers.0.self_attn.q_proj.linear.lora_A.weight
```

So the repro patches only the local copy of `adapter_config.json` to target inner `.linear` modules for Gemma 4 multimodal towers while leaving language tower targets at the usual projection modules.

Patched target expression:

```text
.*(?:model\.language_model\.layers\.\d+\.(?:self_attn|mlp)\.(?:q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)|model\.vision_tower\.encoder\.layers\.\d+\.(?:self_attn|mlp)\.(?:q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)\.linear|model\.audio_tower\.layers\.\d+\.self_attn\.(?:q_proj|k_proj|v_proj)\.linear)$
```

After this patch, the all-GPU case works, which confirms the Gemma4ClippableLinear target-module issue is bypassed for this repro.

### Quantization config

```python
from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_enable_fp32_cpu_offload=True,
)
```

### Case A — all-GPU path works

```python
device_map = {"": 0}
max_memory = {0: "14GiB"}

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    "unsloth/gemma-4-E2B-it",
    quantization_config=quant_config,
    device_map=device_map,
    max_memory=max_memory,
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    trust_remote_code=False,
    low_cpu_mem_usage=True,
)

model = PeftModel.from_pretrained(
    base_model,
    "/content/patched_gemma4_multimodal_inner_linear_adapter_v4",
    is_trainable=False,
)

out = model.generate(**inputs, max_new_tokens=4, do_sample=False, use_cache=True)
```

Observed result:

```text
R05_ALL_GPU_4BIT: PASS
generate output shape: (1, 8)

CUDA after generate:
free_gib: 7.742
total_gib: 14.563
allocated_gib: 6.507
reserved_gib: 6.693
```

### Case B — split CPU/GPU path fails

```python
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}
max_memory = {0: "13GiB", "cpu": "10GiB"}

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    "unsloth/gemma-4-E2B-it",
    quantization_config=quant_config,
    device_map=device_map,
    max_memory=max_memory,
    offload_folder="/content/gemma4_offload_v4",
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    trust_remote_code=False,
    low_cpu_mem_usage=True,
)

model = PeftModel.from_pretrained(
    base_model,
    "/content/patched_gemma4_multimodal_inner_linear_adapter_v4",
    is_trainable=False,
    offload_dir="/content/gemma4_offload_v4",
    offload_buffers=True,
    ephemeral_gpu_offload=True,
    torch_device="cuda:0",
)
```

Observed result:

```text
R03_SPLIT_4BIT: FAIL_BNB_PARAM_CONSTRUCTOR

TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```

Trace tail:

```text
File ".../peft/peft_model.py", line 582, in from_pretrained
    load_result = model.load_adapter(

File ".../peft/peft_model.py", line 1475, in load_adapter
    dispatch_model(

File ".../accelerate/big_modeling.py", line 432, in dispatch_model
    attach_align_device_hook_on_blocks(

File ".../accelerate/hooks.py", line 540, in attach_align_device_hook
    add_hook_to_module(module, hook, append=True)

File ".../accelerate/hooks.py", line 183, in add_hook_to_module
    module = hook.init_hook(module)

File ".../accelerate/hooks.py", line 330, in init_hook
    set_module_tensor_to_device(module, name, "meta")

File ".../accelerate/utils/modeling.py", line 363, in set_module_tensor_to_device
    new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(

TypeError: Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```

### Expected behavior

One of the following:

1. `PeftModel.from_pretrained()` should support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model without reconstructing `Params4bit` with unsupported kwargs.
2. PEFT should avoid redispatching an already-dispatched quantized base model, or should do so without passing `_is_hf_initialized` to bitsandbytes constructors.
3. Accelerate `set_module_tensor_to_device()` should avoid forwarding HF-internal parameter attributes into `bitsandbytes.Params4bit.__new__()` if that constructor does not accept them.
4. bitsandbytes `Params4bit.__new__()` should accept or ignore `_is_hf_initialized` if this attribute is expected in the HF integration path.
5. If this configuration is unsupported, the error should be raised early with a clear message.

### Actual behavior

- All-GPU 4-bit base + patched multimodal PEFT adapter: works.
- Split CPU/GPU 4-bit base + the same patched multimodal PEFT adapter: fails during PEFT adapter loading.
- The failure occurs before generation.
- The failure occurs in the Accelerate dispatch/hook path when reconstructing a bitsandbytes `Params4bit` parameter on the meta-device path.

### Why this seems split/offload-specific

The same environment, same model, same adapter, same adapter patch, same quantization config, and same PEFT call pattern work when the model is all-GPU:

```python
device_map = {"": 0}
```

The failure appears only with:

```python
device_map = {
    "model.vision_tower": "cpu",
    "model.audio_tower": "cpu",
    "": 0,
}
```

Therefore, the issue seems tied to PEFT adapter loading on an already split/offloaded bitsandbytes 4-bit model, not to PEFT + bnb 4-bit generally.

### Related observations

#### 1. Gemma4ClippableLinear target-module blocker

Before applying the local adapter-config patch, both tested Gemma 4 E2B adapters failed earlier with:

```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```

That appears to be the same class of issue as the public PEFT `Gemma4ClippableLinear` support issue. The current repro works around that by targeting the inner `.linear` modules for vision/audio towers.

#### 2. Related `_is_hf_initialized` issue family

There is an existing issue for a similar error family with `Int8Params`:

```text
TypeError: Int8Params.__new__() got an unexpected keyword argument '_is_hf_initialized'
```

This repro appears to be the `Params4bit` variant of that family, reached through PEFT adapter loading and Accelerate hook setup on a split-device model.

### Questions

1. Is `PeftModel.from_pretrained()` expected to support adapter loading on an already CPU/GPU-dispatched bitsandbytes 4-bit model?
2. Should PEFT skip redispatch when the base model already has an `hf_device_map`?
3. Should Accelerate filter `_is_hf_initialized` before reconstructing bitsandbytes parameter classes?
4. Should bitsandbytes `Params4bit` accept or ignore `_is_hf_initialized`?
5. Is the recommended workaround for Gemma 4 E2B on T4 to keep the whole 4-bit model on GPU rather than offloading vision/audio towers to CPU?

### Relevant links

```text
Accelerate big model dispatch docs:
https://huggingface.co/docs/accelerate/package_reference/big_modeling

PEFT PeftModel docs:
https://huggingface.co/docs/peft/package_reference/peft_model

Transformers PEFT integration docs:
https://huggingface.co/docs/transformers/en/peft

Transformers bitsandbytes docs:
https://huggingface.co/docs/transformers/quantization/bitsandbytes

PEFT Gemma4ClippableLinear issue:
https://github.com/huggingface/peft/issues/3129

Related _is_hf_initialized issue:
https://github.com/huggingface/transformers/issues/43872
```


Optional Issue 2

Target repo

huggingface/peft

Suggested title

Gemma4 multimodal LoRA adapters need inner .linear targeting or Gemma4ClippableLinear support

Body

There is also a Gemma 4 target-module compatibility issue before this failure. Without the local `target_modules` patch, PEFT tries to inject LoRA into outer `Gemma4ClippableLinear` wrappers and fails with:

```text
ValueError: Target module Gemma4ClippableLinear(...) is not supported.
```

The adapter is genuinely multimodal, so simply excluding vision/audio targets would change the adapter semantics. The repro instead patches the local adapter config to target the inner `.linear` modules in Gemma 4 vision/audio towers.

With that patch, the all-GPU case passes, but the CPU/GPU-dispatched case still fails with:

```text
Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'
```


My recommendation

Open Issue 1 first. It is now much cleaner than the original draft because the latest evidence isolates the trigger:

all-GPU 4-bit path works
split/offloaded 4-bit path fails during PEFT adapter loading

Do not open the shared-KV issue from this latest Colab run alone, because this run did not reach that failure path.

1 Like

Hi, thanks for all your help. I have tried your recommendations.

On PEFT loading it jumped to my second GPU (GPU 1) which does not have enough VRAM. I use it for RAG.

See below the error. I pasted below all the code sections I used for the test.

QUANT_CONFIGS = {
“gemma4-e4b-it”: BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_use_double_quant=False,
bnb_4bit_compute_dtype=torch.bfloat16,
llm_int8_enable_fp32_cpu_offload=True,
)
}

MODEL_KWARGS = {
“gemma4-e4b-it”: {
“dtype”: torch.bfloat16,
“attn_implementation”: “sdpa”,
“trust_remote_code”: False,
“low_cpu_mem_usage”: True
},
}

OFFLOAD_DIR = r"E:\Folder\offload_temp"

device_map = {
“model.vision_tower”: “cpu”,
“model.audio_tower”: “cpu”,
“”: 0,
}

quant_config = QUANT_CONFIGS.get(model_id_to_load)

base_model = Gemma4ForConditionalGeneration.from_pretrained(
MODEL_REGISTRY[model_id_to_load],
quantization_config=quant_config,
device_map=device_map,
max_memory=max_memory,
offload_folder=OFFLOAD_DIR,
**MODEL_KWARGS[model_id_to_load]
)

model = PeftModel.from_pretrained(
base_model,
lora_path,
adapter_name=lora_source_client_name,
is_trainable=False,
offload_dir=OFFLOAD_DIR,
offload_buffers=True,
ephemeral_gpu_offload=True,
torch_device=“cuda:0”,
)
model.eval()

model.config.use_cache = True
if hasattr(model, “generation_config”) and model.generation_config is not None:
model.generation_config.use_cache = True

past_key_values = QuantoQuantizedCache(
config=model.config,
)

input_device = get_input_embedding_device(model)

for key, value in inputs.items():
if torch.is_tensor(value):
inputs[key] = value.to(input_device)

outputs = model.generate(
**inputs,
past_key_values=past_key_values,
tokenizer=tokenizer,
do_sample=True,
stop_strings=current_stop_strings,
**params,
)

Error:

2026-04-25 20:31:12,735 | Worker (17960) | INFO | Based on the current allocation process, no modules could be assigned to the following devices due to insufficient memory:

  • 0: 5668601858 bytes required
    These minimum requirements are specific to this allocation attempt and may vary. Consider increasing the available memory for these devices to at least the specified minimum, or adjusting the model config.
    2026-04-25 20:31:14,660 | Worker (17960) | ERROR | Worker error: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( CUDA semantics — PyTorch 2.11 documentation )
    Traceback (most recent call last):
    File “E:\Folder\inference_worker.py”, line 460, in inference_worker_loop
    model = _worker_load_model(model_id_to_load, lora_source_client_name, supports_image, supports_audio)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File “E:\Folder\inference_worker.py”, line 364, in _worker_load_model
    model = PeftModel.from_pretrained(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File “E:\Folder\gemma_env\Lib\site-packages\peft\peft_model.py”, line 582, in from_pretrained
    load_result = model.load_adapter(
    ^^^^^^^^^^^^^^^^^^^
    File “E:\Folder\gemma_env\Lib\site-packages\peft\peft_model.py”, line 1475, in load_adapter
    dispatch_model(
    File “E:\Folder\gemma_env\Lib\site-packages\accelerate\big_modeling.py”, line 432, in dispatch_model
    attach_align_device_hook_on_blocks(
    File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
    File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
    File “E:\Folder\gemma_env\Lib\site-packages\accelerate\hooks.py”, line 695, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(

    Previous line repeated 3 more times $$
> File "E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py", line 653, in attach_align_device_hook_on_blocks > add_hook_to_module(module, hook) > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py", line 183, in add_hook_to_module > module = hook.init_hook(module) > ^^^^^^^^^^^^^^^^^^^^^^ > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\hooks.py", line 305, in init_hook > set_module_tensor_to_device(module, name, self.execution_device, tied_params_map=self.tied_params_map) > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\accelerate\\utils\\modeling.py", line 335, in set_module_tensor_to_device > new_value = old_value.to(device, non_blocking=non_blocking) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\bitsandbytes\\nn\\modules.py", line 351, in to > super().to(device=device, dtype=dtype, non_blocking=non_blocking), > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "E:\\Folder\\gemma_env\\Lib\\site-packages\\bitsandbytes\\nn\\modules.py", line 401, in **torch_function** > return super().**torch_function**(func, types, args, kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables ) > 2026-04-25 20:31:14,676 | Worker (17564) | ERROR | Worker returned error: Worker error: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacity of 4.00 GiB of which 0 bytes is free. Of the allocated memory 3.36 GiB is allocated by PyTorch, and 92.14 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables )

This issue “PeftModel.from_pretrained fails only on CPU/GPU-dispatched 4-bit Gemma4: Params4bit.new got unexpected _is_hf_initialized”

That is why i am on “Accelerate: v1.14.0.dev0” it was fixed here “Fix is_hf_initialized attribute by SunMarc · Pull Request #3976 · huggingface/accelerate · GitHub”

I had this error “Gemma4ClippableLinear” I can’t recollect now what fixed it.

Remembered:

# LoRA config — works for both model types
# Gemma4 wraps projections in Gemma4ClippableLinear; target the inner .linear sublayer
is_gemma4_model = "gemma4" in model_id.lower()
if is_gemma4_model:
    lora_targets_full = [
        "q_proj.linear", "k_proj.linear", "v_proj.linear", "o_proj.linear",
        "gate_proj.linear", "up_proj.linear", "down_proj.linear"
    ]
    lora_targets_minimal = ["q_proj.linear", "v_proj.linear", "gate_proj.linear", "up_proj.linear"]
else:
    lora_targets_full = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
    lora_targets_minimal = ["q_proj", "v_proj", "gate_proj", "up_proj"]

For those interested, here is my working configuration for Gemma 4 E4B it and E2B, along with a list of the trials and fixes I went through to get Gemma 4 running.

Gemma 4 is much smarter than the previous Gemma 3 12B I used. Even the smaller E2B model performs better. However, Gemma 4’s KV cache is much larger than that of previous Gemma models, and offloading is not working, so it is still a struggle to get a viable setup.

My setup uses a 12 GB VRAM GPU. Below is a list of my trials and fixes to make VRAM usage work.

  1. VRAM trials and fixes

Because Gemma 4 uses a much larger KV cache, I tried to optimize VRAM usage as much as possible. In my best working state, I have less than 3 GB of VRAM free for the KV cache. This means I can only summarize a ~15 KB text file before running into KV cache OOM (see below — vision is not working for my best setup otherwise my VRAM become less than 2GB free).

1.1

I was blocked from doing any offloading until the _is_hf_initialized bug was fixed. This was recently fixed in Accelerate (or rather, will be fixed in v1.14). I applied the fix by installing the dev version:

pip install git+https://github.com/huggingface/accelerate.git

1.2

I could not use device_map=“auto” until it was fixed in Transformers v5.6.0. Before that, I couldn’t run any max_memory tests. Only device_map={“”: 0} worked.

Even now that device_map=“auto” is fixed, Gemma 4 does not spill over into shared GPU memory like previous Gemma models did.

1.3

If I set max_memory for the GPU to anything lower than what is required to fit the full Gemma 4 E4B model, I get this error during model loading:

Tensor.item() cannot be called on meta tensors

1.4

If I try to offload the Vision and Audio components, I get the same error during PEFT loading:

Tensor.item() cannot be called on meta tensors

(This is the issue discussed in this forum thread.)

1.5

Since I couldn’t offload anything, I purchased a motherboard that supports two GPUs. I added an old 4 GB GPU and moved my RAG model to that card, freeing up about 1.3 GB of VRAM on my main 12 GB GPU.

  1. Vision issues

The vision component was not working correctly — the model only saw grey and distorted shapes and could not identify images or colors.

Unlike Gemma 3, I had to explicitly exclude model.vision_tower and model.multi_modal_projector from quantization before vision started working. I also excluded the audio components, although I haven’t yet confirmed whether that is necessary.

However, excluding the vision and audio components from quantization costs about 1 GB of VRAM. This leaves me with a trade-off:

Quantize vision and audio → <3 GB free VRAM
Do not quantize vision and audio → <2 GB free VRAM


# Needed to add the "llm_int8_skip_modules" section to exclude the vision and audio from quantization.
QUANT_CONFIGS = {
    "gemma4-e4b-it": BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
	    llm_int8_enable_fp32_cpu_offload=True,
	    llm_int8_skip_modules=[
            "model.vision_tower", 
            "model.multi_modal_projector", 
            "model.audio_tower", 
            "lm_head"
        ]
    ),
}

# I can't see as off yet that "low_cpu_mem_usage" do anything.
MODEL_KWARGS = {
    "gemma4-e4b-it": {
        "dtype": torch.bfloat16,
        "attn_implementation": "sdpa",
        "trust_remote_code": False,
        "low_cpu_mem_usage": False
    },
}

# Main model loading

quant_config = QUANT_CONFIGS.get(model_id_to_load)

target_gpu = 0
gpu_props = torch.cuda.get_device_properties(target_gpu)
gpu_total_gb = gpu_props.total_memory / (1024 ** 3)

# I need 0.91 to fit complete Gemma 4 E4B it model

gpu_budget_gb = int(gpu_total_gb * 0.91)
cpu_budget_gb = 24
max_memory = {target_gpu: f"{gpu_budget_gb}GiB", "cpu": f"{cpu_budget_gb}GiB"}

base_model = Gemma4ForConditionalGeneration.from_pretrained(
    MODEL_REGISTRY[model_id_to_load],
    quantization_config=quant_config,
#    device_map={"":0},
    device_map="auto",
#    device_map=device_map,
    max_memory=max_memory,
    offload_folder="e:\\Folder\\offload_temp",
    **MODEL_KWARGS[model_id_to_load]
)

# PEFT model loading

model = PeftModel.from_pretrained(
    base_model,
    lora_path,
    adapter_name=lora_source_client_name,
    is_trainable=False
)

# I tried to quantize kv_cache but it did not help because the kv_cache is loaded full before it is quantizized, so it OOM in the beginning with big context's.

past_key_values = QuantoQuantizedCache(
    config=model.config,
)

# I tried Dynamic cache but that did not help also because same as above the whole content is loaded in kv_cache before Dynamic cache start to do its job. So it OOM on big context's.

past_key_values = DynamicCache(
    config=model.config, 
    offloading=True
)

outputs = model.generate(
    **inputs,
    past_key_values=past_key_values,
    tokenizer=tokenizer,
    do_sample=True,
    stop_strings=current_stop_strings,
    **params,
) 


# PEFT training had an error it works differently than previous Gemma models I had this "Gemma4ClippableLinear" error. To fix the error I needed to add "linear" to the target_modules.

if is_gemma4_model:
    lora_targets_full = [
        "q_proj.linear", "k_proj.linear", "v_proj.linear", "o_proj.linear",
        "gate_proj.linear", "up_proj.linear", "down_proj.linear"
    ]
    lora_targets_minimal = ["q_proj.linear", "v_proj.linear", "gate_proj.linear", "up_proj.linear"]
else:
    lora_targets_full = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
    lora_targets_minimal = ["q_proj", "v_proj", "gate_proj", "up_proj"]

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=lora_targets_full,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

Hope this help anyone struggling with Gemma 4