Building a Local Document Version Comparison System

aashithak · March 10, 2026, 10:26am

I’m currently building a intelligent document comparison pipeline and so far there is nothing intelligent about it. I would really appreciate some guidance from the community.

The goal of the system is to compare a new incoming document (PDF or DOCX) with a repository folder of existing documents and determine whether:

it is a new document,
a modified version of an existing document or a duplicate,
or an unrelated file.

If a version relationship is detected, the system also tries to identify and summarise the changes between the documents.

Current architecture

The pipeline currently works like this:

Input processing Accepts PDF or DOCX files. DOCX files are converted to PDF. Text is extracted page-by-page using PyMuPDF.
Paragraph extraction Text is segmented into paragraphs. Paragraph embeddings are generated using a local bge-en-v1.5 embedding model.
Similarity analysis A paragraph similarity matrix is built using semantic embeddings. I used Hungarian algorithm to align paragraphs between the old and new documents. TF-IDF and character n-gram similarity are used as lexical signals with weightage 20 % and 15% respectively
Change detection Based on the alignment, the system classifies paragraphs as:

unchanged, modified, added, removed
Change explanation I am using a local language model (currently FLAN-T5) is used to summarise the differences.

I am struggling with document structure and alignment:

1. Paragraph extraction

PDF text extraction is very noisy. Paragraph boundaries are often inconsistent due to line breaks, layout blocks, and PDF formatting artifacts.

2. Paragraph alignment

I’m currently using the Hungarian algorithm to match paragraphs between document versions.
However, sections can move or expand, which causes paragraph matches to “slide” and produce incorrect alignments.

3. Semantic vs lexical comparison

it is difficult to distinguish between a rephrased paragraph, a modified paragraph and a completely new paragraph

5. Change summarisation

The summaries produced are very weak explanations most often than not its just sentences copied directly from document and pasted in summary because the input evidence from the diff stage is not structured enough.

Can anyone please help me with these issues? I need better alternatives for paragraph segmentation from PDFs, what approaches for stable paragraph alignment between document versions, whether sequence alignment methods or section-based comparison work better than global matching, best practices for generating useful change summaries from document diffs.

John6666 · March 10, 2026, 9:39pm

For now, I’ve created a demo that (probably) works by avoiding that issue.

Topic		Replies	Views
Document-processing and comparison pipeline Models	9	126	February 4, 2026
Document Similarity of long documents e.g. legal contracts 🤗Transformers	6	9114	July 2, 2024
Compare 2 long texts Beginners	0	1587	May 2, 2023
How to compare the meaning of documents 🤗Transformers	2	1080	April 3, 2024
Comparision of text documents using AlpacaEval Models	0	95	April 29, 2024

Building a Local Document Version Comparison System

Current architecture

Related topics