--- language: - ko - en license: apache-2.0 library_name: sentence-transformers tags: - sentence-transformers - text-embeddings - retrieval - mteb - korean - multilingual - e5 pipeline_tag: sentence-similarity --- # laal-embedding-v0 **laal-embedding-v0** is a Sentence-Transformers embedding model fine-tuned from **`intfloat/multilingual-e5-large-instruct`** for improved **retrieval-oriented semantic search**, with a focus on **Korean fire-safety and legal-domain text**. * **Base model:** `intfloat/multilingual-e5-large-instruct` * **Embedding dimension:** 1024 * **Similarity function:** cosine * **Max tokens:** 512 * **Architecture:** XLM-RoBERTa (24 layers) * **HF repo:** [https://huggingface.co/jjp97/laal-embedding-v0](https://huggingface.co/jjp97/laal-embedding-v0) > ⚠️ **Important** > This model uses **fixed instruction prefixes** defined in `config_sentence_transformers.json`. > **Always pass raw text to `encode()`**. > Do **NOT** manually prepend instruction strings. --- ## Prompting (Important) This model applies different fixed prefixes depending on the input type. ### Query prefix ``` Instruct: Given a web search query, retrieve relevant passages that answer the query. Query: ``` ### Passage prefix ``` title: none text: ``` These prefixes are **automatically applied** by Sentence-Transformers via `config_sentence_transformers.json`. ### Correct usage ✅ ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("jjp97/laal-embedding-v1") q_emb = model.encode_query("화재 시 대피 방법") p_emb = model.encode_document("화재가 발생하면 즉시 119에 신고하고 안전한 경로로 대피해야 한다.") ``` ### Incorrect usage ❌ (double-prefixing) ```python # Do NOT do this q = "Instruct: Given a web search query, retrieve relevant passages...\nQuery: 화재 시 대피 방법" emb = model.encode(q) ``` --- ## Training ### Objective * Contrastive learning (InfoNCE) * In-batch negatives * Temperature (`tau`): **0.05** * Regularization: **GOR (spread-out loss)** * `gor_lambda = 0.001` * `gor_max_samples = 64` ### Data * Training examples: **43,983** * Format: (query, positive passage) * Hard negatives: **enabled** * `max_hn_per_example_train = 2` > Training data consists of domain-specific Korean fire-safety and legal documents > (private / curated dataset). ### Hyperparameters (summary) * Batch size: **512** * Epochs: **3** * Learning rate: **1e-5** * Warmup ratio: **0.1** * Approx. total steps: **255** --- ## Model Architecture This model follows the standard Sentence-Transformers pipeline: 1. **Transformer**: XLM-RoBERTa (24 layers, hidden size 1024) 2. **Pooling**: mean pooling 3. **Normalization**: L2 normalization --- ## Intended Use * Retrieval and semantic search (RAG pipelines) * Domain-specific QA (fire safety, legal text) * Embedding-based similarity and clustering (best performance in retrieval-style settings) --- ## Evaluation ### Sanity check Query–passage cosine similarity shows reasonable separation between relevant and irrelevant passages. ### MTEB This model is intended for evaluation on the **MTEB leaderboard**. When reporting results, please specify: * model name: `jjp97/laal-embedding-v0` * exact revision (commit hash) * benchmark suite used --- ## Limitations * Performance may degrade if instruction prefixes are manually added (double-prefixing). * Fine-tuned primarily for retrieval; performance on classification/STS tasks may vary. * Domain bias toward Korean fire-safety / legal text. --- ## License * Released under **Apache-2.0**, following the base model license. --- ## Acknowledgements * Base model: **Multilingual E5**, Liang Wang et al., *Multilingual E5 Text Embeddings*, arXiv:2402.05672 * Sentence-Transformers library --- ## Citation If you use this model, please cite: ```bibtex @misc{laal_embedding_v0_2025, title = {laal-embedding-v0}, author = {Park, Jeongjae}, year = {2025}, howpublished = {Hugging Face model card}, } ```