Continued pretraining of Llama 3-8b on a new language

Alok64 · July 23, 2025, 7:22pm

Trying to perform CPT of llama on a new language (Language is similar to Hindi, hence some tokens already present). The model’s validation loss seems to plateau very early on into the training. Here 1 epoch is around 6k steps and validation loss seems to already be lowest at step 750.

My dataset is around 100k size. Im using Lora as well

Apply LoRA

lora_config = LoraConfig(
r=128,
lora_alpha=64,
target_modules=[“q_proj”, “k_proj”, “v_proj”, “o_proj”,“gate_proj”, “up_proj”, “down_proj”, “lm_head”],
lora_dropout=0,
bias=“none”,
task_type=“CAUSAL_LM”
)
model = get_peft_model(model, lora_config)

Here are my training args

sft_config = SFTConfig(
learning_rate=1e-3,
lr_scheduler_type=“cosine”,

  per_device_train_batch_size=8,
  warmup_ratio=0.05,
  num_train_epochs=5,
  max_grad_norm = 1.0,

  logging_steps=250,
  eval_strategy="steps",
  eval_steps=250,
  save_strategy="steps",
  save_steps=500,
  output_dir="./llama-cpt",

  bf16=True,
  fp16=False,
  dataset_text_field="text",

  logging_dir="./logs",
  save_total_limit=2,

  report_to="none",
  max_seq_length=512,

  dataset_num_proc = 8,

  load_best_model_at_end=True,
  metric_for_best_model="eval_loss",
  greater_is_better=False,

)

trainer = SFTTrainer(
model=model,
processing_class=tokenizer,
train_dataset=dataset[“train”],
eval_dataset=dataset[“test”],
args=sft_config,
callbacks=[early_stopping_callback]
)

Ive tried different arangement, like more r value, embed_head and lm_head added onto the modules, different leaerning rates, etc. But similar trend in validation loss, either its around this range or around the range of 1.59-1.60.

Moreover, Ive also tried mistral-7b-v0.1, same issues.

I thought it might be because the model is not able to learn because of less tokens, so tried vocab expansion, but same issues.

What else could i try?

John6666 · July 23, 2025, 11:55pm

Continuous pre-training is maybe more advanced than normal fine-tuning…
Here are some resources that may be helpful.

Topic		Replies	Views
FREQUENT LOSS SPIKING in CONTINUE TRAINING LLM 🤗Transformers	2	1146	April 4, 2024
Continual Training on my own checkpoint 🤗Transformers	1	100	June 27, 2024
Is there any actual performance improvement when using LoRA alone for SFT on the LLaMA 3.2 3B base model? Beginners	2	280	June 20, 2025
Poor results (val_loss) on fine-tuning the NLLB-200-600M with LoRA for French-Wolof translation 🤗Transformers	3	464	October 1, 2024
Fine Tune with/without LORA 🤗Transformers	1	312	October 7, 2024

Continued pretraining of Llama 3-8b on a new language

Apply LoRA

Related topics