Intended Use

A Hebrew classifier that routes free-form queries to single legal section IDs from Israel’s Planning & Building Law (Amendment 116). | This model exposes legal sections through free-form Hebrew queries for two primary actors:

Enforcement side – ask “which section applies to violation X?” and get a deterministic section ID for consistent enforcement.
Enforced party – ask “what are my rights after receiving an order/fine/warning?” and get routed to the relevant section from which rights and obligations can be derived. recommendations: | Use as a routing head before Retrieval-Augmented Generation (RAG). Generate answers only over retrieved, authoritative sources and include citations. Not legal advice.

Inputs & Outputs

Input: UTF-8 Hebrew text (free-form).
Output: Section ID string (e.g., 243, 239, general, neutral) via id2label.

Model Details

Architecture: BertForSequenceClassification
Context window: ~512 tokens (max_position_embeddings=512)
Files: weights (.safetensors), config.json (with label mapping), tokenizer files, license.

📘 Training Process & Methodology

🔹 Deep Supervised Training

The model was trained through intensive supervised fine-tuning across multiple iterations.
Each round of training was carefully monitored and validated.

🔹 Data Manipulation & Expansion

The Hebrew legal dataset was manipulated and expanded between training rounds:

Paraphrasing queries
Balancing under-represented sections
Enriching edge cases

This iterative process ensured that the model’s weights became progressively stronger.

🔹 Evaluation & Feature Reinforcement

After every training cycle:

Performance was measured with a confusion matrix and F1 score
Weak spots were identified and reinforced
Feature vectors were sharpened for improved classification accuracy

🔹 RAG Integration (Recommended)

The model outputs a deterministic section ID (e.g., 216).
For production use, it is recommended to pair this classifier with RAG (Retrieval-Augmented Generation):

Use the predicted section ID
Retrieve the authoritative legal text from an external source
Return a complete, trustworthy answer

✅ Result

The outcome is a powerful NLP classification tool:

Robust enough for free-form Hebrew legal queries
Suitable for large-scale dataset preparation toward future LLM training
A practical foundation for building Hebrew Legal NLP pipelines

Role in Large-Scale Hebrew LLM Corpus

This classifier is a central component in constructing a large Hebrew corpus for LLM training.
It provides deterministic routing to canonical section IDs, supports de-duplication and curriculum design, and mitigates Hebrew tokenization pitfalls (RTL marks, zero-width chars, niqqud, clitics) via consistent normalization and section-aware segmentation.

Recommended usage (RAG)

Use with Retrieval-Augmented Generation (RAG).
For consistent answers and to avoid hallucinations:

Classify → section_id with this model.
Retrieve authoritative passages (official sources) using the section_id as a hard filter.
Generate with your Hebrew LLM over the retrieved chunks only, and cite sources.

Why this works: stable routing, Hebrew-aware normalization, and section-aware segmentation reduce noise and keep answers aligned with the law text.

Fine-tuning / PEFT

Treat labels as section IDs. To add/modify classes, update config.json:
- id2label, label2id, and num_labels.
You can full-finetune or use PEFT (LoRA) on the classification head.
- Typical PEFT hints: r = 8–16, alpha = 16–32, dropout ≈ 0.05, LR 2e-4…5e-4, batch 16–64, max seq length 512.
- Keep most of the encoder frozen; a few epochs usually suffice.
Include general / neutral as catch-all classes for out-of-scope inputs.
After training, export the new mapping back into config.json.

The examples below demonstrate the model’s routing power from its pretrained weights. Adapting it to broader Israeli enforcement domains is straightforward with PEFT or light finetuning.

Attribution & Thanks — HeBERT

This work builds upon the open-source HeBERT model. Huge thanks to the HeBERT team and community 🙏. This repository is not affiliated with the HeBERT authors; any mistakes are mine. HeBERT is released under Apache-2.0, and I keep the same license here. If you extend or fine-tune this model, please retain this acknowledgement.

Usage

Quick example

from transformers import pipeline

REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)

out = clf("מפקח נתן לי צו הפסקת עבודה. מה זה אומר?", max_length=512)[0]
print(out["label"], out["score"])  # section id + confidence

Usage

Quick examples

from transformers import pipeline
clf = pipeline("text-classification",
               model="david-di-castro/didibert-enforcement116-he-cls",
               token="hf_...", 
               truncation=True)
out = clf("מפקח נתן לי צו הפסקת עבודה. מה זה אומר?")[0]
print(out["label"], out["score"])

from transformers import pipeline

REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)

out = clf("בניה בחוף הים זה חמור?")[0]
section = out["label"]  
score   = out["score"]   
print(f"section: {section}, score: {score:.4f}")



from transformers import pipeline

REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)

out = clf("בניתי ללא היתר וקיבלתי צו הריסה הפרתי אותו והמשכתי לבנות הרשות הרסה לי גם את תוספת הבניה,זה חוקי?")[0]
section = out["label"]  
score   = out["score"]   
print(f"section: {section}, score: {score:.4f}")

Fine-tuning

A) Continue training (single-label)

from transformers import AutoModelForSequenceClassification, AutoTokenizer

REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok  = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO)  # 53 labels as shipped
# Train with HF Trainer as usual (labels are integer class ids)

B) Multi-label training (one example can map to multiple sections)

Use binary targets (one-hot vector) and enable the built-in BCE loss:


from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "david-di-castro/didibert-enforcement116-he-cls",
    problem_type="multi_label_classification",
)
# Provide labels as float32 vectors of length num_labels (0/1).

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained( "david-di-castro/didibert-enforcement116-he-cls", problem_type="multi_label_classification", ) #Provide labels as float32 vectors of length num_labels (0/1).



Multi-label inference:


import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok  = AutoTokenizer.from_pretrained(REPO)
mdl  = AutoModelForSequenceClassification.from_pretrained(REPO).eval()

text = "טקסט בעברית…"
with torch.inference_mode():
    b = tok(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    probs = mdl(**b).logits.sigmoid()[0]
    idxs  = (probs >= 0.5).nonzero().flatten().tolist()
    labels = [mdl.config.id2label[str(i)] for i in idxs]
print(labels, probs[idxs].tolist())

C) Change the label set size (add/remove classes)

Keep the pretrained encoder; re-init a new classification head:


from transformers import AutoModelForSequenceClassification

NEW_NUM = 80  # your new class count
model = AutoModelForSequenceClassification.from_pretrained(
    "david-di-castro/didibert-enforcement116-he-cls",
    num_labels=NEW_NUM,
    ignore_mismatched_sizes=True,  # preserves encoder weights, re-inits head
)
# Update mapping for transparency
id2label = {i: f"CLASS_{i}"}  # replace with your real names
label2id = {v: k for k, v in id2label.items()}
model.config.id2label = id2label
model.config.label2id = label2id

PEFT / LoRA (parameter-efficient finetuning)

LoRA is usually enough to adapt to new enforcement domains while keeping the base stable.

Install:

pip install -U peft transformers accelerate

Wrap the model with LoRA (example targets BERT attention projections):


from peft import LoraConfig, get_peft_model
from transformers import AutoModelForSequenceClassification, AutoTokenizer

REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok  = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO)

lora_cfg = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
    target_modules=["query","key","value","dense"]  # typical for BERT
)
model = get_peft_model(model, lora_cfg)

# Train with Trainer as usual. Keep seq length 512; batch 16–64; LR 2e-4..5e-4; a few epochs.

Export mapping after training :


model.config.id2label = {i: name_i for i, name_i in enumerate(your_label_names)}
model.config.label2id = {v: k for k, v in model.config.id2label.items()}
model.config.save_pretrained("./your-finetuned-repo")

Metrics to track

--Single-label: accuracy, macro/micro F1.

--Multi-label: example-based F1, macro/micro F1, average precision (mAP). Tune the decision threshold (e.g., 0.3–0.6) on a validation set.

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32