document_parsing_donut_v1

Overview

This model is an implementation of the Donut (Document Understanding Transformer) architecture. Unlike traditional OCR-based systems, this model is OCR-free, meaning it maps raw document images directly to structured JSON outputs. It is fine-tuned to parse complex layouts such as invoices, receipts, and technical forms without a separate text recognition step.

Model Architecture

The model utilizes a vision-encoder text-decoder framework:

  • Encoder: A Swin Transformer that processes high-resolution images into visual features.
  • Decoder: A BART-based multi-lingual transformer that generates text tokens in a sequence-to-sequence manner.
  • Objective: The model is trained using a cross-entropy loss to predict the next token based on both the visual input and preceding tokens: L=βˆ’βˆ‘t=1Tlog⁑P(yt∣y<t,x)\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t | y_{<t}, \mathbf{x})

Intended Use

  • Automated Data Entry: Extracting key-value pairs from digitized business documents.
  • Layout Analysis: Identifying structural components (headers, tables, footers) in multi-page PDFs.
  • Archival Digitization: Converting historical scanned documents into searchable, structured data.

Limitations

  • Resolution Sensitivity: Performance drops significantly if images are scaled below 960x1280 pixels.
  • Language Bias: While capable, accuracy is highest for Latin-script documents; CJK and Arabic scripts require specialized fine-tuning.
  • Handwriting: The model is optimized for printed text and may struggle with highly cursive or disorganized handwriting.
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support