Intended Use

A Hebrew classifier that routes free-form queries to single legal section IDs from Israelโ€™s Planning & Building Law (Amendment 116). | This model exposes legal sections through free-form Hebrew queries for two primary actors:

  1. Enforcement side โ€“ ask โ€œwhich section applies to violation X?โ€ and get a deterministic section ID for consistent enforcement.
  2. Enforced party โ€“ ask โ€œwhat are my rights after receiving an order/fine/warning?โ€ and get routed to the relevant section from which rights and obligations can be derived. recommendations: | Use as a routing head before Retrieval-Augmented Generation (RAG). Generate answers only over retrieved, authoritative sources and include citations. Not legal advice.

Inputs & Outputs

  • Input: UTF-8 Hebrew text (free-form).
  • Output: Section ID string (e.g., 243, 239, general, neutral) via id2label.

Model Details

  • Architecture: BertForSequenceClassification
  • Context window: ~512 tokens (max_position_embeddings=512)
  • Files: weights (.safetensors), config.json (with label mapping), tokenizer files, license.

๐Ÿ“˜ Training Process & Methodology

๐Ÿ”น Deep Supervised Training

The model was trained through intensive supervised fine-tuning across multiple iterations.
Each round of training was carefully monitored and validated.

๐Ÿ”น Data Manipulation & Expansion

The Hebrew legal dataset was manipulated and expanded between training rounds:

  • Paraphrasing queries
  • Balancing under-represented sections
  • Enriching edge cases

This iterative process ensured that the modelโ€™s weights became progressively stronger.

๐Ÿ”น Evaluation & Feature Reinforcement

After every training cycle:

  • Performance was measured with a confusion matrix and F1 score
  • Weak spots were identified and reinforced
  • Feature vectors were sharpened for improved classification accuracy

๐Ÿ”น RAG Integration (Recommended)

The model outputs a deterministic section ID (e.g., 216).
For production use, it is recommended to pair this classifier with RAG (Retrieval-Augmented Generation):

  • Use the predicted section ID
  • Retrieve the authoritative legal text from an external source
  • Return a complete, trustworthy answer

โœ… Result

The outcome is a powerful NLP classification tool:

  • Robust enough for free-form Hebrew legal queries
  • Suitable for large-scale dataset preparation toward future LLM training
  • A practical foundation for building Hebrew Legal NLP pipelines

Role in Large-Scale Hebrew LLM Corpus

This classifier is a central component in constructing a large Hebrew corpus for LLM training.
It provides deterministic routing to canonical section IDs, supports de-duplication and curriculum design, and mitigates Hebrew tokenization pitfalls (RTL marks, zero-width chars, niqqud, clitics) via consistent normalization and section-aware segmentation.


Recommended usage (RAG)

Use with Retrieval-Augmented Generation (RAG).
For consistent answers and to avoid hallucinations:

  1. Classify โ†’ section_id with this model.
  2. Retrieve authoritative passages (official sources) using the section_id as a hard filter.
  3. Generate with your Hebrew LLM over the retrieved chunks only, and cite sources.

Why this works: stable routing, Hebrew-aware normalization, and section-aware segmentation reduce noise and keep answers aligned with the law text.


Fine-tuning / PEFT

  • Treat labels as section IDs. To add/modify classes, update config.json:
    • id2label, label2id, and num_labels.
  • You can full-finetune or use PEFT (LoRA) on the classification head.
    • Typical PEFT hints: r = 8โ€“16, alpha = 16โ€“32, dropout โ‰ˆ 0.05, LR 2e-4โ€ฆ5e-4, batch 16โ€“64, max seq length 512.
    • Keep most of the encoder frozen; a few epochs usually suffice.
  • Include general / neutral as catch-all classes for out-of-scope inputs.
  • After training, export the new mapping back into config.json.

The examples below demonstrate the modelโ€™s routing power from its pretrained weights. Adapting it to broader Israeli enforcement domains is straightforward with PEFT or light finetuning.


Attribution & Thanks โ€” HeBERT

This work builds upon the open-source HeBERT model. Huge thanks to the HeBERT team and community ๐Ÿ™. This repository is not affiliated with the HeBERT authors; any mistakes are mine. HeBERT is released under Apache-2.0, and I keep the same license here. If you extend or fine-tune this model, please retain this acknowledgement.

Usage

Quick example

from transformers import pipeline

REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)

out = clf("ืžืคืงื— ื ืชืŸ ืœื™ ืฆื• ื”ืคืกืงืช ืขื‘ื•ื“ื”. ืžื” ื–ื” ืื•ืžืจ?", max_length=512)[0]
print(out["label"], out["score"])  # section id + confidence


image/png

image/png

image/png

image/png

image/png

image/png

Usage

Quick examples

from transformers import pipeline
clf = pipeline("text-classification",
               model="david-di-castro/didibert-enforcement116-he-cls",
               token="hf_...", 
               truncation=True)
out = clf("ืžืคืงื— ื ืชืŸ ืœื™ ืฆื• ื”ืคืกืงืช ืขื‘ื•ื“ื”. ืžื” ื–ื” ืื•ืžืจ?")[0]
print(out["label"], out["score"])
from transformers import pipeline

REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)

out = clf("ื‘ื ื™ื” ื‘ื—ื•ืฃ ื”ื™ื ื–ื” ื—ืžื•ืจ?")[0]
section = out["label"]  
score   = out["score"]   
print(f"section: {section}, score: {score:.4f}")


from transformers import pipeline

REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)

out = clf("ื‘ื ื™ืชื™ ืœืœื ื”ื™ืชืจ ื•ืงื™ื‘ืœืชื™ ืฆื• ื”ืจื™ืกื” ื”ืคืจืชื™ ืื•ืชื• ื•ื”ืžืฉื›ืชื™ ืœื‘ื ื•ืช ื”ืจืฉื•ืช ื”ืจืกื” ืœื™ ื’ื ืืช ืชื•ืกืคืช ื”ื‘ื ื™ื”,ื–ื” ื—ื•ืงื™?")[0]
section = out["label"]  
score   = out["score"]   
print(f"section: {section}, score: {score:.4f}")

Fine-tuning

A) Continue training (single-label)

from transformers import AutoModelForSequenceClassification, AutoTokenizer

REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok  = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO)  # 53 labels as shipped
# Train with HF Trainer as usual (labels are integer class ids)

B) Multi-label training (one example can map to multiple sections)

Use binary targets (one-hot vector) and enable the built-in BCE loss:


from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "david-di-castro/didibert-enforcement116-he-cls",
    problem_type="multi_label_classification",
)
# Provide labels as float32 vectors of length num_labels (0/1).

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained( "david-di-castro/didibert-enforcement116-he-cls", problem_type="multi_label_classification", ) #Provide labels as float32 vectors of length num_labels (0/1).



Multi-label inference:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok  = AutoTokenizer.from_pretrained(REPO)
mdl  = AutoModelForSequenceClassification.from_pretrained(REPO).eval()

text = "ื˜ืงืกื˜ ื‘ืขื‘ืจื™ืชโ€ฆ"
with torch.inference_mode():
    b = tok(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    probs = mdl(**b).logits.sigmoid()[0]
    idxs  = (probs >= 0.5).nonzero().flatten().tolist()
    labels = [mdl.config.id2label[str(i)] for i in idxs]
print(labels, probs[idxs].tolist())

C) Change the label set size (add/remove classes)

Keep the pretrained encoder; re-init a new classification head:


from transformers import AutoModelForSequenceClassification

NEW_NUM = 80  # your new class count
model = AutoModelForSequenceClassification.from_pretrained(
    "david-di-castro/didibert-enforcement116-he-cls",
    num_labels=NEW_NUM,
    ignore_mismatched_sizes=True,  # preserves encoder weights, re-inits head
)
# Update mapping for transparency
id2label = {i: f"CLASS_{i}"}  # replace with your real names
label2id = {v: k for k, v in id2label.items()}
model.config.id2label = id2label
model.config.label2id = label2id

PEFT / LoRA (parameter-efficient finetuning)

LoRA is usually enough to adapt to new enforcement domains while keeping the base stable.

Install:

pip install -U peft transformers accelerate

Wrap the model with LoRA (example targets BERT attention projections):


from peft import LoraConfig, get_peft_model
from transformers import AutoModelForSequenceClassification, AutoTokenizer

REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok  = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO)

lora_cfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    target_modules=["query","key","value","dense"]  # typical for BERT
)
model = get_peft_model(model, lora_cfg)

# Train with Trainer as usual. Keep seq length 512; batch 16โ€“64; LR 2e-4..5e-4; a few epochs.

Export mapping after training :


model.config.id2label = {i: name_i for i, name_i in enumerate(your_label_names)}
model.config.label2id = {v: k for k, v in model.config.id2label.items()}
model.config.save_pretrained("./your-finetuned-repo")

Metrics to track

--Single-label: accuracy, macro/micro F1.

--Multi-label: example-based F1, macro/micro F1, average precision (mAP). Tune the decision threshold (e.g., 0.3โ€“0.6) on a validation set.

Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support