Qwen-image-edit lora-fine-tuned weight is sparse

I fine-tuned qwen-image-edit-2509 with lora method and checked the result weight matrix and found it is very sparse, meanwhile the weight number is very small and nearby zero.
Here is the statistics of some selected layers:

Is this normal? I wonder it will affect the final effect.

1 Like

Seems normal in this case?


Yes, what you are seeing (very small, near-zero LoRA ΔW values for Qwen-Image-Edit-2509) is normal and expected, and by itself it does not mean your fine-tune is ineffective.

Below is a detailed breakdown of why this happens, how to interpret your statistics, when it would indicate a bug, and how to verify that your LoRA is actually changing Qwen-Image-Edit’s behavior.


1. What LoRA is doing to your weights

For a linear layer with base weight matrix (W_0), LoRA re-parametrizes the adapted weight as

[
W’ = W_0 + \Delta W,\quad \Delta W = \alpha , B A
]

where

  • (A \in \mathbb{R}^{r \times d_\text{in}})
  • (B \in \mathbb{R}^{d_\text{out} \times r})
  • (r) is small (rank)
  • (\alpha) is a scale factor. (AI Bites)

Key properties:

  1. Only A and B are trainable; the big base weight (W_0) is frozen. (ApX Machine Learning)
  2. One of A or B is typically zero-initialized (often B=0), so at step 0 you have (\Delta W = 0) and the model is exactly the base model.
  3. Training only needs to learn a low-rank, small correction (\Delta W) instead of a full dense update.

The original LoRA paper and follow-up surveys explicitly assume that the optimal fine-tuning update (\Delta W) lies in a low-rank, low-energy subspace and can be well-approximated by this small BA term. (arXiv)

So by design:

  • (\Delta W) is much smaller than the base (W_0),
  • (\Delta W) is low-rank,
  • (\Delta W) looks like a dense cloud of tiny values clustered around zero.

That is exactly what you are observing.


2. Interpreting your numbers: not actually “sparse”

From your screenshot, for several to_q projections you computed

delta = (lora_B @ lora_A).flatten()

and reported approximately:

  • mean ≈ 4.8e-08
  • std ≈ 2.86e-04
  • max |ΔW| ≈ 0.002
  • "sparsity" ≈ 0.31% where sparsity = fraction with |ΔW| < 1e-6

Important implications:

  1. Dense, not sparse.

    Your “sparsity” metric says only 0.31% of entries are almost zero (|ΔW| < 1e-6). That means > 99% of the elements are non-zero.
    A truly sparse matrix (in the usual sense) would have the opposite pattern (e.g., 90–99% zero).

  2. Small-magnitude, but non-trivial.

    • Standard deviation ≈ 2.9×10−4
    • Max ≈ 2×10−3

    Those are small compared to typical transformer weights (often ~1e-2 to 1e-1 after pre-training), but they are not “numerical noise”. This is a normal scale for LoRA deltas: small relative to (W_0) yet non-zero.

  3. Zero-mean, bell-shaped histogram is expected.

    • A near-zero mean is what you would expect from a regularized weight distribution.
    • Your histogram, plotted on a log y-axis in a narrow range (−0.0025, 0.0025), looks sharply peaked around 0; that is just a visual exaggeration of a smooth Gaussian-like distribution.

So the matrix is:

  • Dense,
  • Low-rank by construction,
  • Small-magnitude,

which matches exactly what LoRA theory suggests for (\Delta W). (cognativ.com)


3. Why LoRA deltas are supposed to be small

Several independent sources point out that LoRA deltas should be small:

  1. LoRA paper (Hu et al., 2021).
    They show that the full fine-tune update (\Delta W) for large models can be well-approximated by a low-rank correction, which implicitly means (\Delta W) is low-energy (small norm) and does not drastically overwrite the base model.

  2. “A Note on LoRA” (Fomenko et al., 2024).
    This paper explicitly states that LoRA delta weights are small, and then studies serving overhead for swapping those small delta weights in and out. (arXiv)

  3. Surveys and tutorials on LoRA / PEFT.

    • A recent survey of LoRA for foundation models emphasizes that LoRA injects small low-rank updates into pre-trained layers to adapt behavior while preserving the base model. (arXiv)
    • Practical guides (e.g., Nebius “Fundamentals of LoRA”, AI-bites summary, apxml course) describe LoRA as learning a small matrix (\Delta W = BA) instead of a large full update, and explicitly show examples where ΔW values are tightly centered near zero. (AI Bites)
  4. Empirical analyses of LoRA weight distributions.
    Analyses on both LLMs and vision models consistently find LoRA-induced weights have significantly smaller standard deviation and spectral norms than full fine-tuning, yet still produce comparable downstream performance. (arXiv)

Intuitively:

  • The base model already solves “most” of the problem.
  • LoRA only needs to add gentle corrections in a few directions.
  • If those corrections were as large as the original weights everywhere, you’d risk catastrophic forgetting and instability.

So the fact that your ΔW entries lie roughly in [−0.002, 0.002] with std ~3×10−4 is not a red flag; it is close to what you want.


4. Will such small deltas actually affect Qwen-Image-Edit?

Yes. Small but structured deltas can have a large semantic impact because:

  1. They are applied in many layers (attention Q/K/V, MLPs, etc.).
  2. They act on activations that are often order-1 in magnitude.
  3. They compound through the depth of the network.

Real-world LoRAs for Qwen-Image-Edit show exactly this behavior:

  • FlyMy.AI’s “Qwen Image Edit Inscene LoRA” is trained on top of Qwen/Qwen-Image-Edit and is advertised as significantly improving scene coherence, object positioning, and camera perspective for in-scene edits.
  • The model card literally shows side-by-side: output without LoRA vs output with InScene LoRA, where the only difference is this small adapter; the visual change is large (better camera angle, preserved scene, correct actions).
  • Other LoRAs such as lovis93/next-scene-qwen-image-lora-2509 similarly describe clear cinematic behavior changes at modest LoRA strengths (0.7–0.8). (huggingface.co)

Those public Qwen-Image-Edit LoRAs were trained with the same kind of small ΔW updates you are seeing. The fact that they work in practice is strong evidence that small ΔW is enough.

So the key question is not “are the numbers big?” but “does the LoRA change outputs when applied?”.


5. When should you worry?

There are known failure modes where LoRA really does “do nothing”. They look very different from your stats:

  1. Exactly zero deltas (implementation bug).

    • In some PEFT + DeepSpeed setups, users found LoRA lora_B remained all zeros after training; ΔW stayed exactly zero, so the model was identical to the base despite training. (Reddit)
    • In certain Lightning / lit-gpt configs, LoRA layers were instantiated inside a context that prevented proper initialization; A and B never got gradients, staying zero.

    In those cases you’d see:

    • std == 0, max == 0 (or machine epsilon)
    • Or nearly 100% “sparsity” if you threshold around 1e-6

    That is not what your numbers show.

  2. LoRA weights exist but are never applied.

    In some Stable Diffusion / SD-WebUI / InvokeAI issues, LoRAs loaded but the inference pipeline didn’t actually add ΔW to the UNet weights, so outputs were identical to the base model. The LoRA files themselves were fine; the bug was in how they were wired into inference. (huggingface.co)

    In that failure mode:

    • Your matrices look normal on disk.
    • But if you generate images with and without the LoRA, they are visually identical (beyond normal sampling noise).

Your stats (non-zero std, non-zero max, only 0.3% under 1e-6) clearly indicate you are not in the “everything zero” pathological regime.


6. Practical checks you should run

To be confident your Qwen-Image-Edit LoRA is doing what you want, run these checks:

6.1. Confirm per-layer deltas are non-zero

You already did this for some to_q layers. Repeat for the other modules you trained (e.g., to_k, to_v, to_out, any MLP or image-side modules):

def describe_delta(lora_A, lora_B, name, atol=1e-6):
    d = (lora_B @ lora_A).flatten()
    mean = d.mean().item()
    std  = d.std().item()
    maxv = d.abs().max().item()
    sparsity = (d.abs() < atol).float().mean().item()
    print(f"{name:25s} mean={mean:+.2e} std={std:.2e} "
          f"max={maxv:.3e} sparsity={sparsity*100:.2f}%")

You want:

  • Non-zero std and max across the trained modules.
  • Reasonably similar scales between layers (some variation is normal).

6.2. Compare outputs with different LoRA strengths

This is the decisive test.

  1. Load base Qwen-Image-Edit-2509.
  2. Generate a few edited images with LoRA disabled (scale = 0.0 or no adapter).
  3. Load your LoRA and generate with scale = 0.5, 1.0, 2.0 on the same prompts and seed.

You should see:

  • Clear, qualitative differences between scale 0.0 and 1.0.
  • A smooth, stronger effect when you go to 2.0 (sometimes too strong / over-stylized).

For reference, public Qwen-Image-Edit LoRAs (InScene, Next Scene, InStyle, etc.) recommend strengths around ~0.7–0.8 for good balance, and their authors show visually obvious differences vs base.

If you see no visible difference even at high scale, then you should investigate:

  • Did you select the right target modules when training?
  • Are you definitely loading the LoRA into the active pipeline?
  • Was the learning rate / number of steps large enough?

But the small ΔW per element is not itself the problem.

6.3. Compare to a known good Qwen-Image-Edit LoRA

As a sanity check:

  1. Download a public adapter, e.g. flymy-ai/qwen-image-edit-inscene-lora or lovis93/next-scene-qwen-image-lora-2509. (DataCamp)
  2. Run the same ΔW stats (mean, std, max, “sparsity”) on its to_q and to_k matrices.
  3. Compare ranges: you will almost certainly find similar std (1e-4–1e-3) and max (1e-3–1e-2).

If your numbers are in the same ballpark as these known-good LoRAs, that’s strong evidence everything is normal.


7. Short answer, restated

  • LoRA is explicitly built on the assumption that the update matrix (\Delta W) is small and low-rank; it is written as (\Delta W = B A) with small rank r and small magnitude entries. (AI Bites)

  • Your measured ΔW for Qwen-Image-Edit-2509 has:

    • Dense non-zero entries (>99% non-zero by your threshold),
    • Small but non-trivial std (~2.9×10−4),
    • Max around 2×10−3.

    This is exactly what the LoRA literature and real Qwen LoRAs show as normal behavior, not a sign of failure. (arXiv)

  • What actually matters is whether your LoRA changes outputs when applied (and how). That is best checked by toggling the adapter on/off and sweeping the LoRA strength.


8. Curated links for further reading

You asked for links; here is a small, focused set (they should appear as clickable citations):

A. Core LoRA background

  • Original LoRA paper – LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021). Excellent for the exact parameterization (W’ = W_0 + BA) and motivation for low-rank updates.
  • “LoRA paper explained” – accessible walkthrough of the math and ΔW = BA intuition. (AI Bites)
  • Practical LoRA intro in a fine-tuning course (apxml) – very clear explanation of decomposing ΔW into A and B and why the update is small. (ApX Machine Learning)

B. LoRA delta size and theory

  • A Note on LoRA – explicitly discusses how LoRA delta weights are small and how that impacts serving. (arXiv)
  • LoRA surveys for foundation models – overviews of LoRA variants, delta-LoRA, sparsity, etc., with discussion of low-energy updates. (arXiv)

C. Qwen-Image-Edit LoRA examples

  • flymy-ai/qwen-image-edit-inscene-lora – model card showing before/after images for an in-scene LoRA trained on Qwen-Image-Edit. Good visual reference for what a working LoRA can achieve.
  • FlyMyAI/flymyai-lora-trainer – open-source trainer specifically for LoRA on Qwen-Image, Qwen-Image-Edit, and FLUX; useful as a reference for configs and expected behavior. (GitHub)
  • lovis93/next-scene-qwen-image-lora-2509 and related tutorials / spaces – show cinematic “next scene” behavior for Qwen-Image-Edit 2509 using only a LoRA adapter and typical strength settings. (huggingface.co)

If you like, you can take one of those public Qwen-Image-Edit LoRAs, dump its lora_B @ lora_A stats next to yours, and you’ll have a very concrete, side-by-side confirmation that your “tiny, near-zero” ΔW is right where a healthy LoRA usually lives.

2 Likes

As John already mentioned, by default, LoRA B is initialized as a zero matrix, so at the start of the training, it is expected that dW is close to zero. You should monitor other metrics to see how well the training goes, possibly increase the learning rate to see results more quickly.

Depending on framework you use for LoRA training, you may also have other initialization options than zero init for B. E.g. there could be an option for orthogonal initialization, which might result in faster learning. Try out a few options.

2 Likes