About traning LoRa for Z Image Turbo

I’m going to train LoRa for Z Image Turbo. But I don’t like the base output (background and colors). I want to do some color adjustment steps with NAG and Detail Daemon.

Is the base output recommended for LoRa training, or should I train with the most optimal images for my goals?

How will the two approaches affect my results?

1 Like

When using it as a base model for LoRA training, Base might be fundamentally more suitable than Turbo? But Turbo came out first and is more widespread… making the choice difficult.


Background you need for this decision

A LoRA does not “fix” a model in the abstract. It learns to reproduce the distribution of whatever you train it on (subject, lighting, palette, background types, contrast curve, etc.). This is why dataset choice matters more than most hyperparameters.

Z-Image Turbo adds a second complication: it is a few-step distilled model that does not rely on classifier-free guidance (CFG) at inference and, in the official prompting guidance, does not use negative prompts at all. (Hugging Face) Turbo is typically run at ~9 steps with guidance scale 0.0.
That pushes more of the “look” (including background/color tese model prior, and makes classic “negative prompt fixes” weaker unless you use something like NAG.


1) Is the base output recommended for LoRA training?

It depends what your LoRA is for

If your LoRA is primarily identity / subject / concept

Using images that are clean, consistent, and not heavily “styled” is often the most robust path.

  • You want the LoRA to learn what the subject is, not “a specific color grade.”
  • You can keep background/color control as an inference-time knob.

In this case, training on “base-ish” outputs can be acceptable, because you are not trying to reprogram the global aesthetic—just teach a concept.

If your LoRA is primarily look / palette / background behavior

Then training on base-like images is not recommended for your goal, because:

  • The LoRA will learn the base model’s palette/background priors as part of the target distribution.
  • You may end up fighting the LoRA with NAG/Detail Daemon every time.

For a look-changing objective, your training images should already reflect your desired look.


2) Should you train with the most optimal images for your goals?

For your specific complaint (background + colors): generally yes

If you want different backgrounds and color behavior, training on your “optimal” images is the direct route.

But “optimal” should mean:

  • Your intended palette/contrast/white balance
  • Not overprocessed (avoid extreme sharpening / HDR micro-contrast / aggressive artifact removal)

Why the caution: Detail Daemon explicitly notes that pushing it too far produces an oversharpened and/or HDR effect. (GitHub) If you train on images that already have that signature and then also apply Detail Daemon at inference, you can get “double enhancement.”


How the two approaches affect results

Approach A — Train on base-like images (neutral training set)

What improves

  • Better generalization across prompts and scenes.
  • Your LoRA is less likely to “force” one palette/background everywhere.
  • NAG and other inference controls remain clean “steering knobs.”

What stays the same

  • The model’s default background/color tendencies will still show up unless you steer them at inference (NAG, prompt strategy, post).

Typical failure mode

  • You finish training and feel: “My LoRA works, but the colors/background are still wrong.”

Approach B — Train on “final-look” images (your optimal goal images)

What improves

  • Background and palette shift by default, with less per-prompt tweaking.
  • More consistent “house style.”

What gets worse

  • Less flexibility: the LoRA can “drag” unrelated prompts toward your dataset’s look.
  • Higher risk of learning workflow artifacts (over-sharpen, haloing, crunchy texture, too-clean backgrounds).

Typical failure mode

  • “Everything looks like my training set, even when I don’t want it to.”

A practical compromise that works well

Train on images that are 80–90% your target look (good grading, but not extreme), then do the last 10–20% with NAG/Detail Daemon.


Where NAG and Detail Daemon fit

NAG (Normalized Attention Guidance)

NAG is explicitly motivated by the problem that CFG-based negative guidance collapses in few-step regimes, and it restores effective negative-style control by operating in attention space. (arXiv)

Implication for you

  • If you keep training neutral (Approach A), NAG is a strong way to suppress unwanted background/color traits at inference without baking them into the LoRA.
  • If you bake the look into the LoRA (Approach B), you typically need less NAG (otherwise you can over-suppress and lose richness).

Detail Daemon

Detail Daemon adjusts sigma/sampling behavior to enhance detail and can reduce unwanted blur, but can produce an oversharpen/HDR look if pushed. (GitHub)

Implication for you

  • Use it as a finisher, not as the “source of truth” for what your LoRA should learn.
  • If you want to train on images processed with it, keep settings moderate and consistent.

Turbo vs Base as the training base model

Why Z-Image (Base) is generally better for LoRA training

The official Z-Image repo recommends for the Base model:

  • guidance scale 3.0–5.0
  • inference steps 28–50
  • negative prompts strongly recommended (GitHub)

That is exactly the environment where you can:

  • control background/colors at inference (CFG + negatives)
  • train a LoRA without depending on special “distillation-preserving” tricks

Why Z-Image Turbo is riskier to train on directly

Turbo is distilled for fast steps and “no CFG/negative prompts.” (Hugging Face)
A commonly reported issue is that LoRAs trained directly on Turbo can lose the acceleration behavior—images become blurry at “fast” settings while looking normal at slower settings. (Hugging Face)

If you must train on Turbo anyway, there are two established mitigation paths:

  1. Training adapter / de-distillation approach (Ostris adapter) (Hugging Face)
  2. DistillPatch LoRA to restore fast-step behavior after Turbo LoRA training (Hugging Face)

What I would do for your purpose

If your priority is “fix background + colors” and you want stability

Train on Z-Image (Base) with a dataset that reflects your desired palette/background, then:

  • generate on Base when you want maximum control/quality
  • optionally test the same LoRA on Turbo for speed

This leverages Base’s controllability and avoids Turbo’s distillation fragility. (GitHub)

If your priority is “must run on Turbo at 8–9 steps”

Train for Turbo, but plan the pipeline up front:

  1. Train with a Turbo training adapter (Hugging Face)
  2. Evaluate every checkpoint at Turbo inference settings (few steps, guidance off) (Hugging Face)
  3. If fast-step quality degrades, apply DistillPatch (Hugging Face)
  4. Keep your training images “goal-like but not overcooked” to avoid learning artifacts (GitHub)

Concrete “better approach” patterns (high leverage)

Pattern 1: Separate “subject LoRA” and “look LoRA”

  • Train subject/identity on clean images (neutral look).
  • Train look/palette/background on curated goal images.
  • Use lower strengths and avoid stacking too many LoRAs on Turbo (Turbo is more sensitive). (Hugging Face)

Pattern 2: Make the look toggleable with a trigger token

If you want the palette/background sometimes but not always:

  • Caption the look with a distinct token (e.g., my_grade_style) and keep other captions content-focused.
  • Then you can turn the look on/off per prompt.

Pattern 3: Don’t “teach” Detail Daemon artifacts

Use Detail Daemon primarily at inference; if you must use it in dataset creation, keep it mild and consistent to avoid baking in HDR/sharpen artifacts. (GitHub)


Summary answers to your questions

1) Is the base output recommended for LoRA training?

  • Only if your LoRA is for subject/concept and you want maximum generality.
  • If your complaint is specifically background + colors, base-like images will tend to preserve the base look.

2) Should you train with the most optimal images for your goals?

  • Yes, if your goal is to change palette/background behavior.
  • Use “optimal” as “desired look, but not overprocessed,” to avoid training in artifacts (especially from aggressive detail enhancement). (GitHub)

Turbo vs Base for your purpose?

  • For reliable look control: train on Z-Image (Base) (better CFG/negative/steps control). (GitHub)
  • For Turbo-at-8-steps delivery: train with a Turbo training adapter and expect to use DistillPatch if acceleration quality breaks. (Hugging Face)
1 Like

Thank you very much. The base model is very inconsistent and produces poor quality images, or perhaps I’m not getting the desired results because I’m applying the same prompting technique to Turbo as I did to the original. In Turbo, the face changed significantly when the scene changed, so I created one portrait and then converted it into 15 different portraits from various angles using Nano Banana. I trained a Face LoRa with it, and now I can produce images with more consistent faces, almost exactly as I want in Turbo. My goal, as you suggested, is to produce 40 images with different locations, lighting, and elements. Then I will train a new LoRa with it. Currently, with a 1:4 LR, 16 rank, no quantization, and BF16 output, I get a sweet spot of 2400-3000 at 3k stepping. I previously tested it with 4k stepping, and the sweet spots were 2200 and 2600. I want to continue with this approach for now.

However, I still have some questions. For 40 images, is rank 16 ideal, or should I choose 32 (I’ll choose FP32 for quality anyway)? But since all parameters affect each other, finding the optimal solution is really difficult. Will 3-4k step be enough when I go up to 40 images? Should I use 1280 base resolution or 1536 when creating the images? Because Ostris has a 1536 resolution option. If I increase the rank, I think I’ll need to adjust the LR to 5e-5.

To be honest, it’s a very challenging process for someone who wants to achieve perfection.

1 Like

There’s no absolute right answer for training parameters by nature…
It becomes a matter of choosing what kind of LoRA (model) behavior is likely to emerge. And often, you can’t say for sure without actually going through fine-tuning…

The inconsistency in prompts between Turbo and Base seems to be because their prompt quirks are completely different…


Why Z-Image (Base) looks “inconsistent” for you (likely cause)

Z-Image Turbo and Z-Image Base want different prompting and sampling behaviors:

  • Turbo is a few-step distilled model and “does not use negative prompts at all” in the intended inference setup. (Hugging Face)
  • Base is meant to be run with CFG (guidance scale ~3–5), more steps (28–50), and negative prompts strongly recommended. (GitHub)

So if you applied “Turbo-style prompting” (short prompt, no negatives, guidance ~0) to Base, Base can look sloppy or inconsistent; and if you apply “Base-style prompting” (strong negatives + higher guidance) to Turbo, Turbo can drift (including face changes) because it’s not operating how it was distilled to operate. (Hugging Face)

Your “face drifts when the scene changes” observation is very consistent with Turbo’s few-step behavior, and using a face LoRA to stabilize identity is a reasonable workaround.


Your current plan (face LoRA → generate 40 diverse images → train a second LoRA) is valid, but has one big constraint

Turbo tends to degrade when stacking multiple LoRAs. This is widely reported in community testing and discussion. (Hugging Face)

That means your plan is actually strategically good because it can produce a single “combined” LoRA (identity + environment handling) so you don’t have to stack a face LoRA + a scene/style LoRA at inference.

The main risk is feedback-loop learning: if your 40 images are all synthetic outputs from the same pipeline, the second LoRA can learn the pipeline’s artifacts and biases very strongly. The fix is simple: curate hard, and force real diversity (lighting, lenses, backgrounds, compositions, distances, expressions).


The key hyperparameters you asked about

1) Rank 16 vs 32 for a 40-image dataset

Rule of thumb:

  • Rank 16: best default for identity consistency + prompt flexibility, lower overfit risk.
  • Rank 32: use when you specifically need micro-detail capture (skin texture, subtle facial structure across angles, fine accessories), and you have enough variety in the dataset to avoid memorization.

Community experience is mixed: some people find rank 16 “enough,” others report rank 32+ helps photorealistic micro-detail. Treat those reports as anecdotal, but directionally useful. (Reddit)

Recommendation for your 40-image “multi-scene” run

  • Start with rank 16 for the first full run.
  • Only move to rank 32 if you can point to a consistent failure mode that looks like “capacity” (e.g., face loses specific traits across angles, fine facial geometry collapses, small defining details disappear).

2) Will 3–4k steps be enough when moving to 40 images?

Steps depend on batch size/repeats, but you can reason with “steps per image”:

  • If (roughly) batch=1 and repeats=1:

    • 3,000 steps on 15 images ≈ 200 updates per image
    • 3,000 steps on 40 images ≈ 75 updates per image

So if your “sweet spot” previously appeared around 2,400–3,000 steps on a smaller set, the same “training intensity” on 40 images will usually occur later (often ~6k–8k steps).

This matches common Turbo LoRA baselines that treat ~2,500–3,000 as a starting point for small datasets, not a hard universal number. (Hugging Face)

Practical recommendation

  • For 40 images, plan a run to 8,000 steps, but save checkpoints at (for example) 2k / 4k / 6k / 8k and pick the best by consistent evaluation prompts.
  • If you see overdominance early (everything looks like your dataset no matter the prompt), stop earlier and/or reduce LR.

3) 1280 vs 1536 resolution for creating/training images

There are two separate issues here:

(A) What Turbo was “in-domain” for
A Tongyi discussion on resolution/latent sizing strongly implies staying within a domain like 768–1280 around the “1024 grid,” and points to predefined resolution choices used in their app. (Hugging Face)

(B) What tooling exposes
Many workflows expose resolution presets including 1024 / 1280 / 1536. (GitHub)

What this means for you

  • If your priority is identity stability across scenes, 1280 is the safer high-detail choice.
  • 1536 can work, but it’s more likely to behave “out-of-domain” unless your whole pipeline (generation and training and evaluation) is consistently 1536 and your dataset is strong.

Recommendation

  • Generate your 40-image dataset at 1280 (or a mix of 1024 + 1280 buckets).
  • Do final outputs at 1536 using a second pass (upscale/img2img) if needed, rather than forcing the LoRA itself to learn everything at 1536 from only 40 samples.

4) Learning rate changes if you increase rank

A very common safe band for diffusion LoRA LR is 1e-4 to 5e-5. (Hugging Face)

If you go from rank 16 → 32, reducing LR (e.g., 1e-4 → 5e-5) is a reasonable safety move because you’re increasing capacity and the training can become more “aggressive.”

5) FP32 output for “quality”

If your training is actually happening in bf16/fp16 (which is typical in these toolchains), saving the LoRA as fp32 does not recover lost precision; it mostly stores already-rounded values in a larger container. (Reddit)

BF16 is generally considered a strong default for stability/efficiency, and FP32 is rarely the deciding factor for visual quality compared to dataset quality and stopping at the right step. (RunComfy)

Recommendation

  • Keep bf16 unless you have a concrete numerical instability problem you can reproduce.

Turbo-specific training stability: do this if you want to preserve “fast-step Turbo quality”

Training directly on Turbo can break the acceleration behavior; this is why “training adapter / de-distillation” and “DistillPatch” approaches exist.

  • Ostris’ training adapter is explicitly designed for fine-tuning Turbo without collapsing the distilled behavior. (Hugging Face)
  • DiffSynth’s DistillPatch page describes multiple schemes and recommends a strategy that preserves speed while keeping training simpler. (Hugging Face)

If your LoRA starts looking fine at 20–30 steps but falls apart at 8–9 steps, that’s a classic sign you need one of these “preserve acceleration” approaches. (Hugging Face)


A concrete “good solution” configuration for your 40-image run

Baseline run (most likely to succeed)

  • Base: Turbo + training adapter (to keep distillation intact) (Hugging Face)
  • Rank: 16
  • LR: 1e-4 (or 7e-5 if you want safer identity) (Hugging Face)
  • Steps: train to 8,000; evaluate checkpoints at 2k/4k/6k/8k
  • Resolution buckets: 1024 + 1280 (avoid forcing 1536 with only 40 images unless necessary) (Hugging Face)
  • Prompting for dataset creation: remember Turbo ignores negative prompts; encode constraints directly in the positive prompt. (Hugging Face)

“Higher capacity” run (only if baseline lacks detail)

  • Rank: 32
  • LR: 5e-5 (Hugging Face)
  • Steps: still 8,000 (but watch overdominance)

Why this feels so hard (and how to make it tractable)

Hyperparameters interact, but you can make the search manageable by fixing everything except one axis at a time:

  1. Fix dataset + captions → tune steps (pick the best checkpoint).
  2. Only if needed, try rank 32 at a lower LR.
  3. Only if needed, revisit resolution (1280 vs 1536).

If you try to change rank/LR/steps/resolution together, it becomes almost impossible to tell what helped.


One last warning specific to your “generate 40 images, then train”

Because you’re using generated images as training targets, you should deliberately inject variation so the second LoRA learns identity rather than “that one portrait’s hidden artifacts”:

  • vary camera distance (close/medium/full)
  • vary lighting type (softbox, daylight, tungsten, neon)
  • vary background complexity (simple → cluttered)
  • include some “hard cases” (profile view, harsh shadows, backlight)

This reduces memorization and improves the “face stays the same when the scene changes” property you care about.

1 Like

I can’t thank you enough for your detailed explanations. I really appreciate it. I’m going to follow this step exactly now. Face Lora is working well, only the ear isn’t closing :smiley: because it was visible in all the images, so I added this and similar things to the caption and will train it again. This way I’ll get rid of this problem. Even at strength 1, it gives the perfect exposure for the composition (except for the ear, which requires a very specific prompt to close :D).

So, Face Lora seems to be working hypothetically. Now, the next step is to create 40 images with real variety. Also, before doing that, I want to train Look Lora, as you said. I’ve collected many images from Unsplash: travel, gym, cafe, streets, etc. They are all very high quality and high resolution, but I think I’ll need to resize them a bit first.

So, what are your suggestions for Look Lora? Since there will be a lot of detail here, and it won’t be learning just one subject, I think 32 or a higher rank would be logical. I currently have 60 images. Do you think this is enough to train Look Lora? I’m leaving a screenshot below. What are my chances of getting sharp images with the LoRa tool I’ll be creating from these images? I specifically tried to select images where a person might be present.

1 Like

I also trained a LoRa with the base model, now I’m going to try it with the turbo.

1 Like

Claude also suggested that applying an img2img with ZIT to the Nano Banana outputs might be a good idea. But I think the images are good, high-quality, realistic, and consistent (except for the watermark, which I remove with another tool). Do you think it’s necessary?

1 Like

Meanwhile, I think training can be done on the base model and then the turbo can be used. But as you said, 3k steps were insufficient; more are needed.

1 Like

Using the capabilities and knowledge of high-end commercial AI for training open-source models is generally a good idea, not just for Nano Banana.


What a “Look LoRA” can and can’t do for Z-Image Turbo

A Look LoRA is best at learning global visual biases: color palette, contrast curve, lighting mood, lens/render “feel,” background complexity preference. It is not a reliable way to “upgrade sharpness” beyond what the base model + sampling can already produce—unless the base model is already capable and your LoRA is mainly nudging it toward consistently sharp photographic structure.

With Z-Image Turbo, keep in mind:

  • Turbo is a few-step distilled model that does not use negative prompts in the intended setup. (Hugging Face)
  • Training LoRAs directly on Turbo can cause loss of the “8–9 step” behavior (blurry at fast settings) unless you use mitigation like a training adapter or a distillation “patch” approach. (Hugging Face)

This affects your “chances of sharp images” more than rank alone.


Dataset: are ~60 Unsplash photos enough?

Yes—if the look is coherent

60 high-quality photos is enough for one coherent look (e.g., “clean editorial travel photography, natural skin, balanced contrast, slightly warm highlights”). If your set spans many unrelated aesthetics (harsh HDR city night + soft pastel cafe + gritty gym flash + cinematic teal/orange), the LoRA tends to learn a blurry average of the styles and may feel like it “does nothing” or introduces instability.

What your selection is doing right

You intentionally included scenes where people might be present. That is good if your Look LoRA must preserve:

  • skin rendering
  • exposure on faces across lighting types
  • “person in scene” realism

What to improve before training

  • Remove any images with logos/text/signage dominance or remaining watermarks (even small ones). Turbo follows written instructions unusually well, but a Look LoRA trained on text-heavy photos can accidentally increase “text-like artifacts.”
  • Remove duplicates / near-duplicates (same place, same light, same composition).
  • Ensure you have lighting diversity (daylight shade, indoor tungsten, mixed neon, backlight), but keep the grade consistent.

Rank: 16 vs 32 for a Look LoRA (and why “higher rank” is not automatically better)

Rank is capacity. Higher rank can learn more variation, but it also makes it easier to learn unwanted correlations and artifacts. Diffusers’ LoRA config explicitly treats r (rank) as a core capacity parameter. (Hugging Face)

Recommended starting point for your case

  • Rank 16 is usually the best first run for a Look LoRA with ~60 images.

  • Move to Rank 32 only if you can clearly describe a “capacity failure,” such as:

    • the look doesn’t “take” unless you crank LoRA strength high (and then it breaks identity),
    • fine photographic cues aren’t learned (consistent exposure/contrast behavior across conditions),
    • the LoRA learns the palette but not the lighting behavior.

Why not start at 32?
With only 60 images, Rank 32 can more easily overfit on specific scene content (“this kind of street = this color cast”) instead of learning a transferable grade.

Practical rule for your pipeline

  • Look LoRA: r=16 first
  • Face/identity LoRA: r=16–32 (identity sometimes benefits from more capacity)
  • Final combined LoRA (after you generate the 40 diverse shots): decide based on what fails; often r=16 is still enough.

Steps and LR: what changes when you go from 15 → 60 images

When dataset size increases, a fixed step count means fewer updates per image. So if 3k steps was your “sweet spot” on a smaller set, that does not transfer directly.

A simple planning heuristic:

  • Aim for roughly 100–200 effective updates per image as a starting search band for LoRA-style finetunes (varies by trainer, repeats, batch).
  • For 60 images, that often lands closer to “several thousand” steps, and you should rely on checkpoint evaluation, not a single target number.

Recommendation (works with your “sweet spot hunting” approach)

  • Keep your LR in the conservative band (your idea of dropping LR when increasing rank is reasonable).
  • Train to a higher ceiling (e.g., 8k-ish) but save checkpoints frequently and evaluate them under fixed prompts/settings.

If you increase rank to 32, reducing LR (e.g., toward 5e-5) is a common stability move. (This aligns with typical LoRA guidance that higher capacity usually wants gentler updates.) (Hugging Face)


Resolution: 1280 vs 1536 for Look LoRA training images

Use 1280 (or 1024+1280 buckets) for the Look LoRA dataset

Reasons:

  • With 60 images, you want the LoRA to learn global look cues robustly.
  • 1536 increases detail load and can increase overfitting risk if your dataset doesn’t consistently contain similar “detail distribution” (e.g., always sharp, low noise, no motion blur, consistent lens behavior).
  • You can always render final outputs at higher resolution later.

When 1536 makes sense

Only if:

  • your end goal is consistently 1536,
  • your dataset is consistently sharp at that scale,
  • and you are willing to accept more tuning complexity.

Given your stated goal (look + backgrounds + colors, not microtexture replication), 1280 is the safer choice.


“What are my chances of sharp images?”

High, if these conditions hold:

  1. You preserve Turbo’s fast-step behavior

    • LoRAs trained directly on Turbo can lose acceleration quality (blurry at 8 steps) according to DistillPatch documentation. (Hugging Face)
    • Use a Turbo training adapter if your trainer supports it. (https://cnb.cool)
  2. Your training set is consistently sharp

    • No heavy motion blur, no extreme denoise, no aggressive HDR sharpening.
    • Avoid training on strongly “Detail Daemon–crunched” images: the ComfyUI Detail Daemon port explicitly warns it can become oversharpened/HDR if pushed. (GitHub)
  3. Your captions don’t teach the wrong thing

    • Don’t label content too specifically (you don’t want the LoRA to force “travel/cafe/gym”).
    • Do label the look: lighting mood + color intent + “photorealistic photo / editorial photo / natural skin,” etc.

LoRA can reduce sharpness if it breaks Turbo distillation or if it learns “soft” images as part of the look.


Should you img2img your Nano Banana outputs with Z-Image Turbo before using them?

Not strictly necessary, but sometimes beneficial

When it helps

  • If Nano Banana outputs have consistent “generator fingerprints” (texture patterns, odd skin microdetail, repeated noise structure).
  • If you see “off-manifold” artifacts that Turbo tends to reintroduce or amplify.
  • If you want the training images to sit more cleanly in Turbo’s image manifold, which can make LoRA training more stable.

When it hurts

  • If denoise is too high, it can change identity or reintroduce the drift you already solved.
  • If you do it to all images, you can accidentally make the dataset too homogeneous.

Practical compromise

  • Only do it for images that show artifacts or instability.
  • Keep denoise low (roughly in the “light cleanup” range), and keep the same positive prompt constraints (Turbo ignores negative prompts). (Hugging Face)

A good “Look LoRA” recipe for your exact plan

Step 1 — Define the look in one sentence (critical)

Example:

  • “Clean editorial photography, neutral WB, natural skin, balanced contrast, slightly muted saturation, realistic indoor/outdoor lighting.”

If you can’t summarize the look, the dataset is probably too mixed.

Step 2 — Curate the 60 → ~40–60 “coherent” subset

  • Keep variety in location + lighting, but keep color science consistent.
  • Ensure at least ~20–30 images have a clearly visible person/face if you care about exposure on skin.

Step 3 — Train Look LoRA on Turbo with distillation-safe tooling

  • Prefer using the training adapter path when available. (https://cnb.cool)
  • If you later notice “looks fine at 30 steps but blurry at 8–9,” that’s exactly the failure DistillPatch describes. (Hugging Face)

Step 4 — Use Look LoRA to generate your 40-shot matrix images

  • Apply Face LoRA + Look LoRA together only to generate the 40 images (offline).
  • Then train a final combined LoRA so you don’t need to stack LoRAs at inference (Turbo can be sensitive to stacking).

Step 5 — Keep NAG/Detail Daemon as optional finishing, not baked-in

Turbo doesn’t use negative prompts, so any “don’t do X” must be phrased as positive constraints in your prompt. (Hugging Face)
Detail Daemon can help detail, but don’t overuse it in training data or you risk teaching the HDR/oversharpen signature. (GitHub)


Turbo-friendly Look prompt template (for sampling while building your 40 images)

Use one base template and change only {location}, {lighting}, {angle}:

{trigger_face}, {trigger_look}, photorealistic editorial portrait photo,
identity consistent face, natural skin texture, sharp focus on eyes,
balanced exposure, realistic shadows, clean color science, neutral white balance,
{lighting}, {angle}, {location},
high detail, clean edges, no text, no watermark, no logo, no extra faces

Turbo expects you to put constraints in the positive prompt, because negatives aren’t used. (Hugging Face)


If you want one clear recommendation set to start

  • Look LoRA dataset: 40–60 curated images (from your 60), coherent grade, watermark-free.
  • Resolution: 1280 (or 1024+1280 buckets).
  • Rank: start 16; only try 32 if you see a specific capacity shortfall.
  • Turbo training stability: use the Turbo training adapter if possible; be ready to apply a distillation “patch” approach if 8-step sharpness collapses. (https://cnb.cool)
  • Img2img on Nano Banana outputs: optional; do it only for problematic images, low-denoise.
1 Like

Hopefully this title will rank highly on Google because we’ve discussed everything necessary throughout this process and provided very detailed links.

1 Like

Great question! For LoRA training on Z Image Turbo, I’d recommend using the base output as your training foundation.

Why base output works better:

  1. The model has already learned the optimal latent representations for image generation
  2. Training on post-processed images can introduce artifacts that the model wasn’t trained on
  3. If you want color adjustments, you can apply them during inference rather than training

Alternative approach:
If you specifically need the color-adjusted outputs in your LoRA, consider:

  • Fine-tuning on a small dataset of your preferred style (10-20 images)
  • Using IP-Adapter or ControlNet for structural guidance while keeping your color preferences
  • Training a separate color correction LoRA and blending during inference

The key insight is that LoRA learns the “difference” from the base model - so if you train on adjusted images, your LoRA will essentially be learning to correct back to base. Hope this helps!

1 Like

Thanks for the reply. However, there are no IP adapters or similar solutions on Zit yet. I’m currently researching trigger words. Also, when exactly is LoRa considered to have learned? When it responds to every prompt that includes a female description and a trigger word? Or when it responds to a neutral prompt that uses a trigger word without specifying gender?

1 Like

@John6666 When do we assume LoRa has learned? When it responds to gender-specific prompts, or when it responds to neutral prompts as well?

1 Like

The extent to which you should settle on a LoRA with strong influence as a result of fine-tuning will likely come down to personal preference and intended use. Taking checkpoints during training is probably a good idea for reliability. (Train them and choose later.)


In your case, the real test is neutral prompts, not only gender-specific prompts

A LoRA is best judged by validation prompts across checkpoints, not by one “lucky” prompt. The standard training workflow is to save intermediate checkpoints and compare them with a fixed validation prompt or prompt set. (Hugging Face)

For your Face LoRA

I would treat it as “learned” when:

  • with your trigger token and a short neutral prompt, it still gives the same face reliably
  • the face stays stable across scene changes, lighting changes, and camera angle changes
  • you do not need one very specific gendered prompt to “unlock” the correct face

So, for your case:

  • Good sign: trigger + portrait in a cafe still gives the same person
  • Better sign: trigger + person in a cafe / gym / street / travel scene still gives the same person
  • Weak sign: it only works when you say something like “beautiful woman, female portrait…” every time

If it only responds correctly to gender-specific prompts, that usually means the LoRA has learned the caption pattern too narrowly, not the identity robustly.


What neutral prompts should do

Because your Face LoRA represents a specific face, a neutral prompt should still produce that person’s default presentation.
It does not need to become gender-ambiguous.

That means:

  • Neutral prompt should still work
  • Matching gendered prompt should also work
  • Opposite / contradictory gender prompt is not the main success test

In other words, if the subject is female-presenting, then:

  • trigger, portrait photo → should work
  • trigger, woman in streetwear → should work
  • trigger, man in streetwear → not an important test, and if it fully obeys that while losing identity, that can actually mean the LoRA is too weak

The practical rule

For a Face LoRA, call it “learned enough” when:

It works with:

  1. trigger only + simple neutral prompt
  2. trigger + different scenes
  3. trigger + different lighting / angles

If it passes those, then the LoRA has learned the face in a useful way.

For a Look LoRA, call it “learned enough” when:

A neutral content prompt starts to inherit the color / lighting / background mood of your target look, while still remaining editable.


The failure signs

Your LoRA is probably not learned cleanly yet if:

  • it only works with one very specific phrasing
  • it breaks when the scene changes
  • it needs repeated gender wording to stay on-model
  • it ignores simple neutral prompts

Your LoRA is probably overfit if:

  • every prompt collapses to the same framing / expression / lighting
  • the model starts forcing the same traits everywhere
  • it becomes less responsive to normal prompt changes

Overfitting and “language drift” are well-known concerns in image finetuning, which is why prior-preservation style techniques exist in related workflows. (Hugging Face)


For Z-Image Turbo specifically

Because Turbo is a few-step distilled model and does not use negative prompts in the intended setup, your evaluation should be done with the same Turbo-style inference setup you actually plan to use. (Hugging Face)

So the best test is not:

  • “Does it work under one carefully engineered prompt?”

The best test is:

  • “Does it still hold identity under my normal Turbo workflow, with simple positive prompts, across multiple scenes?”

Bottom line

For your case:

  • Gender-specific prompts working = good, but not enough
  • Neutral prompts also working = stronger sign the LoRA has really learned the face
  • The best standard is identity stability under neutral prompts across scene changes

That is the checkpoint where I would say: the LoRA has actually learned what you wanted.

1 Like

Thank you very much.

I concluded from my tests that the caption structure significantly impacts the learning process. If the dataset is very small and only has one face LORA, as you mentioned, if the caption is too detailed, the system becomes dependent on it. Therefore, in such cases, the CDO (Caption Drop Out) should be increased slightly. In a completely caption-less training, the LORA becomes completely locked onto the trigger word. So, a middle ground is needed. If you only increase the steps, it becomes an overfit. For example, to get the ideal result with a dataset of 15-20 images, you should use 4000-5000 steps. If you can’t, there’s a problem with your caption structure or the dataset. Using more steps increases the learning speed; I can say this clearly. This way, you can reach the sweet spot around 3k steps. If you start with low steps and then update them, I’ve found that it often results in regression. So, starting with a high step yields the ideal result (at least in my tests).

4k steps
For example, in one dataset, the first full-body samplers were male, but the first clear female figure appeared at 1200 steps, while it started to resemble my character at 2400-2600 steps. In another dataset, it still didn’t appear at 3800 steps. I think the reason for this is the visuals (the caption structure is exactly the same). Because the first dataset mainly shows only the face and a few upper body images. The second dataset, on the other hand, generally shows the below the shoulders. So, if you are going to train Face Lora using the method I used, your visuals should definitely be only the face and neck from various angles. They should be as close as possible. This increases the learning speed and prevents you from overfitting.

Also, there should definitely be 3-5 upper body images (I’m saying this for Face Lora; your character Lora should already have a lot of variety), otherwise you will never see your character when you want a full body image. Also, based on the results I obtained from my previous full character and current Face Lora tests, I can clearly say that the most consistent caption structure is exactly as follows.

[trigger_word], [framing], [pose/action], [facial expression], [clothing short], [accessories (if any)], [location], [lighting type], [background]

When you apply this caption structure, even if your faceLoRa dataset is full studio (which I think is more effective), you can get output in any environment you want. Because you give the caption inputs you don’t want to teach it, and LoRa only adapts to the identity.

I can’t say I’m very good at explaining and summarizing what I’ve learned, but these are my conclusions. If anyone needs help with this process in the future, I can give better answers and conclusions in a Q&A session.

I think this discussion will shed light on this for many people.

I also agree with John because when you describe a woman, LoRa already gives you what you want in 1k steps. But for example, when you want a scene where she’s angry, you see a man. Because you haven’t learned that your character can be angry, and I think in the base model, this association is predominantly associated with men. In short, if you want to ensure LoRa is fully trained, your sampling structure should definitely be as follows. Yes, taking too many samples slows down training. But what do you want? Quality? Or just getting it over with? The choice is yours.


          - prompt: "[trigger] in a close-up portrait with a serious intense expression. Hard street lighting casts sharp shadows across the face from the left side. Standing on an urban sidewalk at night with the street visible behind"
          - prompt: "[trigger] in a close-up portrait with a soft relaxed expression, her lips slightly parted. She has thick straight bangs cut just above the eyes with hair falling past the ears, framing the face. She is outdoors in a park with green trees and grass stretching behind her. Soft natural daylight wraps evenly around her face."
          - prompt: "[trigger] laughing with eyes closed and mouth open in a genuine candid moment. Half body shot, standing in front of a suburban house with a white fence. Bright natural daylight fills the scene."
          - prompt: "[trigger] crying with visible tears rolling down her cheeks, her brows furrowed and eyes glistening. Her hair is pulled up in a high messy bun with loose strands falling around the face. Half body shot, she is standing in front of a house with the front porch visible behind her. Soft overcast daylight with muted tones throughout."
          - prompt: "[trigger] with yellow hair and round glasses, posing casually for the camera in a half body shot. Looking out over a volcanic landscape with the crater and steam visible in the distance. Wide-angle perspective."
          - prompt: "A knee-up shot of [trigger], looking at the camera, wearing a plain white T-shirt and jeans, standing barefoot on a sandy beach. Bright midday sunlight, with the sharp blue ocean and horizon stretching behind."
            height: 1248
            width: 832
          - prompt: "A knee-up shot of [trigger], looking at the camera, wearing a yellow floral sundress, standing outdoors in a green park. Her hair is tied back in a long ponytail swaying to one side. Bright natural daylight, with trees and a walking path visible in the background."
            height: 1248
            width: 832
          - prompt: "A woman with vibrant dark brown hair playing chess at an outdoor park table. An explosion goes off in the far background with smoke and debris rising into the sky."
          - prompt: "A Siberian man wearing traditional fur-lined clothing is crouching in front of an igloo, holding a steaming cup of hot tea with both hands. Snow covers the ground and the cold air is visible."
          - prompt: "A native American woman wearing traditional clothing is standing in front of a teepee, holding a bow and arrow. The sun is setting behind her, casting a warm glow on the scene."
1 Like

Also, the default settings are generally good. Of course, you may need to change them depending on your dataset and goals, but I think the biggest factor will always be your step, dataset, and caption structure. Because, as someone who has tested many different settings, no other setting changes the result as much as these three.

1 Like

Yes, I just ran the test again. I updated the ratios of the photos showing below the shoulders using Birme and re-trained them. I’m currently getting the first female image at 2400.

1 Like

I’ve currently trained with 45 visuals in 4500 steps. But I think it’s learning very erratically; even the Native American character is starting to resemble mine, but one with a neutral prompt still doesn’t respond after 3250 steps. I think I should try increasing the CDO. Or should I use DOP? Or is this many visuals too much for ZIT? Also, should I try training with the base model?

Ah, this process is really difficult; even my inferences remain unanswered, always giving very different results.

1 Like

I disabled cache text embeddings and increased CDO, but nothing changed. Ugh, this is really driving me crazy.

1 Like

I can prove it. For ZIT training, instead of using natural language flow, create tags that precisely follow the template I provided above. That’s the format that works best.

1 Like