Intended Use
A Hebrew classifier that routes free-form queries to single legal section IDs from Israelโs Planning & Building Law (Amendment 116). | This model exposes legal sections through free-form Hebrew queries for two primary actors:
- Enforcement side โ ask โwhich section applies to violation X?โ and get a deterministic section ID for consistent enforcement.
- Enforced party โ ask โwhat are my rights after receiving an order/fine/warning?โ and get routed to the relevant section from which rights and obligations can be derived. recommendations: | Use as a routing head before Retrieval-Augmented Generation (RAG). Generate answers only over retrieved, authoritative sources and include citations. Not legal advice.
Inputs & Outputs
- Input: UTF-8 Hebrew text (free-form).
- Output: Section ID string (e.g.,
243,239,general,neutral) viaid2label.
Model Details
- Architecture:
BertForSequenceClassification - Context window: ~512 tokens (
max_position_embeddings=512) - Files: weights (
.safetensors),config.json(with label mapping), tokenizer files, license.
๐ Training Process & Methodology
๐น Deep Supervised Training
The model was trained through intensive supervised fine-tuning across multiple iterations.
Each round of training was carefully monitored and validated.
๐น Data Manipulation & Expansion
The Hebrew legal dataset was manipulated and expanded between training rounds:
- Paraphrasing queries
- Balancing under-represented sections
- Enriching edge cases
This iterative process ensured that the modelโs weights became progressively stronger.
๐น Evaluation & Feature Reinforcement
After every training cycle:
- Performance was measured with a confusion matrix and F1 score
- Weak spots were identified and reinforced
- Feature vectors were sharpened for improved classification accuracy
๐น RAG Integration (Recommended)
The model outputs a deterministic section ID (e.g., 216).
For production use, it is recommended to pair this classifier with RAG (Retrieval-Augmented Generation):
- Use the predicted section ID
- Retrieve the authoritative legal text from an external source
- Return a complete, trustworthy answer
โ Result
The outcome is a powerful NLP classification tool:
- Robust enough for free-form Hebrew legal queries
- Suitable for large-scale dataset preparation toward future LLM training
- A practical foundation for building Hebrew Legal NLP pipelines
Role in Large-Scale Hebrew LLM Corpus
This classifier is a central component in constructing a large Hebrew corpus for LLM training.
It provides deterministic routing to canonical section IDs, supports de-duplication and curriculum design, and mitigates Hebrew tokenization pitfalls (RTL marks, zero-width chars, niqqud, clitics) via consistent normalization and section-aware segmentation.
Recommended usage (RAG)
Use with Retrieval-Augmented Generation (RAG).
For consistent answers and to avoid hallucinations:
- Classify โ
section_idwith this model. - Retrieve authoritative passages (official sources) using the
section_idas a hard filter. - Generate with your Hebrew LLM over the retrieved chunks only, and cite sources.
Why this works: stable routing, Hebrew-aware normalization, and section-aware segmentation reduce noise and keep answers aligned with the law text.
Fine-tuning / PEFT
- Treat labels as section IDs. To add/modify classes, update
config.json:id2label,label2id, andnum_labels.
- You can full-finetune or use PEFT (LoRA) on the classification head.
- Typical PEFT hints:
r = 8โ16,alpha = 16โ32,dropout โ 0.05, LR2e-4โฆ5e-4, batch16โ64, max seq length512. - Keep most of the encoder frozen; a few epochs usually suffice.
- Typical PEFT hints:
- Include
general/neutralas catch-all classes for out-of-scope inputs. - After training, export the new mapping back into
config.json.
The examples below demonstrate the modelโs routing power from its pretrained weights. Adapting it to broader Israeli enforcement domains is straightforward with PEFT or light finetuning.
Attribution & Thanks โ HeBERT
This work builds upon the open-source HeBERT model. Huge thanks to the HeBERT team and community ๐. This repository is not affiliated with the HeBERT authors; any mistakes are mine. HeBERT is released under Apache-2.0, and I keep the same license here. If you extend or fine-tune this model, please retain this acknowledgement.
Usage
Quick example
from transformers import pipeline
REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)
out = clf("ืืคืงื ื ืชื ืื ืฆื ืืคืกืงืช ืขืืืื. ืื ืื ืืืืจ?", max_length=512)[0]
print(out["label"], out["score"]) # section id + confidence
Usage
Quick examples
from transformers import pipeline
clf = pipeline("text-classification",
model="david-di-castro/didibert-enforcement116-he-cls",
token="hf_...",
truncation=True)
out = clf("ืืคืงื ื ืชื ืื ืฆื ืืคืกืงืช ืขืืืื. ืื ืื ืืืืจ?")[0]
print(out["label"], out["score"])
from transformers import pipeline
REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)
out = clf("ืื ืื ืืืืฃ ืืื ืื ืืืืจ?")[0]
section = out["label"]
score = out["score"]
print(f"section: {section}, score: {score:.4f}")
from transformers import pipeline
REPO = "david-di-castro/didibert-enforcement116-he-cls"
clf = pipeline("text-classification", model=REPO, truncation=True)
out = clf("ืื ืืชื ืืื ืืืชืจ ืืงืืืืชื ืฆื ืืจืืกื ืืคืจืชื ืืืชื ืืืืฉืืชื ืืื ืืช ืืจืฉืืช ืืจืกื ืื ืื ืืช ืชืืกืคืช ืืื ืื,ืื ืืืงื?")[0]
section = out["label"]
score = out["score"]
print(f"section: {section}, score: {score:.4f}")
Fine-tuning
A) Continue training (single-label)
from transformers import AutoModelForSequenceClassification, AutoTokenizer
REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO) # 53 labels as shipped
# Train with HF Trainer as usual (labels are integer class ids)
B) Multi-label training (one example can map to multiple sections)
Use binary targets (one-hot vector) and enable the built-in BCE loss:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"david-di-castro/didibert-enforcement116-he-cls",
problem_type="multi_label_classification",
)
# Provide labels as float32 vectors of length num_labels (0/1).
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained( "david-di-castro/didibert-enforcement116-he-cls", problem_type="multi_label_classification", ) #Provide labels as float32 vectors of length num_labels (0/1).
Multi-label inference:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok = AutoTokenizer.from_pretrained(REPO)
mdl = AutoModelForSequenceClassification.from_pretrained(REPO).eval()
text = "ืืงืกื ืืขืืจืืชโฆ"
with torch.inference_mode():
b = tok(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
probs = mdl(**b).logits.sigmoid()[0]
idxs = (probs >= 0.5).nonzero().flatten().tolist()
labels = [mdl.config.id2label[str(i)] for i in idxs]
print(labels, probs[idxs].tolist())
C) Change the label set size (add/remove classes)
Keep the pretrained encoder; re-init a new classification head:
from transformers import AutoModelForSequenceClassification
NEW_NUM = 80 # your new class count
model = AutoModelForSequenceClassification.from_pretrained(
"david-di-castro/didibert-enforcement116-he-cls",
num_labels=NEW_NUM,
ignore_mismatched_sizes=True, # preserves encoder weights, re-inits head
)
# Update mapping for transparency
id2label = {i: f"CLASS_{i}"} # replace with your real names
label2id = {v: k for k, v in id2label.items()}
model.config.id2label = id2label
model.config.label2id = label2id
PEFT / LoRA (parameter-efficient finetuning)
LoRA is usually enough to adapt to new enforcement domains while keeping the base stable.
Install:
pip install -U peft transformers accelerate
Wrap the model with LoRA (example targets BERT attention projections):
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
REPO = "david-di-castro/didibert-enforcement116-he-cls"
tok = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForSequenceClassification.from_pretrained(REPO)
lora_cfg = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05, bias="none",
target_modules=["query","key","value","dense"] # typical for BERT
)
model = get_peft_model(model, lora_cfg)
# Train with Trainer as usual. Keep seq length 512; batch 16โ64; LR 2e-4..5e-4; a few epochs.
Export mapping after training :
model.config.id2label = {i: name_i for i, name_i in enumerate(your_label_names)}
model.config.label2id = {v: k for k, v in model.config.id2label.items()}
model.config.save_pretrained("./your-finetuned-repo")
Metrics to track
--Single-label: accuracy, macro/micro F1.
--Multi-label: example-based F1, macro/micro F1, average precision (mAP). Tune the decision threshold (e.g., 0.3โ0.6) on a validation set.
- Downloads last month
- 2





