--- library_name: transformers datasets: - ds4sd/DocLayNet-v1.2 base_model: - microsoft/layoutlmv3-base --- # Model Card for kbsooo/layoutlmv3_finetuned_doclaynet ## Model Details ### Model Description This model is a fine-tuned version of [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) for token classification on the DocLayNet dataset. It is designed to classify each token in a document image based on both textual and layout information. - **Developed by:** kbsooo - **Model type:** LayoutLMv3ForTokenClassification - **Language(s) (NLP):** Korean (document-oriented) - **License:** Check DocLayNet and LayoutLMv3 licenses - **Finetuned from model:** microsoft/layoutlmv3-base ### Model Sources - **Repository:** [Hugging Face Model Hub](https://huggingface.co/kbsooo/layoutlmv3_finetuned_doclaynet) - **Paper (optional):** [LayoutLMv3 Paper](https://arxiv.org/abs/2112.01041) ## Uses ### Direct Use This model can be used for: - Token classification in document images (e.g., identifying headings, paragraphs, tables, images, lists) - Document understanding tasks where layout + text information is important ### Downstream Use - Can be integrated into pipelines for document information extraction - Useful for document analysis applications: invoice parsing, form processing, etc. ### Out-of-Scope Use - Not intended for languages or layouts not represented in the DocLayNet dataset - Not suitable for free-form text without document structure ## Bias, Risks, and Limitations - The model may misclassify tokens if the document layout or language differs from the training data - Biases may exist due to dataset composition (DocLayNet) - Limited to 10 classes of document layout elements ### Recommendations - Users should preprocess documents similarly to the training setup (tokenization + bounding boxes + image) - Verify predictions, especially in production or high-stakes scenarios ## How to Get Started with the Model ```python from transformers import LayoutLMv3ForTokenClassification, AutoProcessor import torch repo = "kbsooo/layoutlmv3_finetuned_doclaynet" model = LayoutLMv3ForTokenClassification.from_pretrained(repo) processor = AutoProcessor.from_pretrained(repo) image = ... # PIL.Image or np.array text = "Sample document text" encoding = processor(image, text, return_tensors="pt") outputs = model(**encoding) preds = torch.argmax(outputs.logits, dim=-1) print(preds) ``` ## Training Details ### Training Data - Dataset: DocLayNet-v1.2 - Train/Validation split: 200/100 samples - Columns: input_ids, attention_mask, bbox, labels, pixel_values, n_words_in, n_words_out ### Training Procedure - Optimizer: AdamW - Learning rate: 5e-5 - Epochs: 5 - Mixed precision: FP16 optional - Loss: Cross-entropy per token ## Evaluation - Sample metrics (from validation set): - Avg Train Loss: 0.134 - Avg Val Loss: 0.458 - Token prediction accuracy should be checked against the DocLayNet labels ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** NVIDIA A100 - **Hours used:** ~1 hr for 5 epochs (for small dataset) ## Technical Specifications ### Model Architecture and Objective - Base model: LayoutLMv3 - Task: Token classification for document layout elements - Input: Tokenized text, bounding boxes, and document images - Output: Token-wise logits for 10 classes ### Compute Infrastructure - Training performed on Google Colab Pro (A100 GPU) - Framework: PyTorch + Hugging Face Transformers ## Citation **BibTeX:** ```bibtex @article{huang2022layoutlmv3, title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking}, author={Huang, Zejiang and et al.}, journal={arXiv preprint arXiv:2112.01041}, year={2022} } ``` **APA:** Huang, Z., et al. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv preprint arXiv:2112.01041.