I’m currently building a intelligent document comparison pipeline and so far there is nothing intelligent about it. I would really appreciate some guidance from the community.
The goal of the system is to compare a new incoming document (PDF or DOCX) with a repository folder of existing documents and determine whether:
-
it is a new document,
-
a modified version of an existing document or a duplicate,
-
or an unrelated file.
If a version relationship is detected, the system also tries to identify and summarise the changes between the documents.
Current architecture
The pipeline currently works like this:
-
Input processing Accepts PDF or DOCX files. DOCX files are converted to PDF. Text is extracted page-by-page using PyMuPDF.
-
Paragraph extraction Text is segmented into paragraphs. Paragraph embeddings are generated using a local bge-en-v1.5 embedding model.
-
Similarity analysis A paragraph similarity matrix is built using semantic embeddings. I used Hungarian algorithm to align paragraphs between the old and new documents. TF-IDF and character n-gram similarity are used as lexical signals with weightage 20 % and 15% respectively
-
Change detection Based on the alignment, the system classifies paragraphs as:
unchanged, modified, added, removed
-
Change explanation I am using a local language model (currently FLAN-T5) is used to summarise the differences.
I am struggling with document structure and alignment:
1. Paragraph extraction
PDF text extraction is very noisy. Paragraph boundaries are often inconsistent due to line breaks, layout blocks, and PDF formatting artifacts.
2. Paragraph alignment
I’m currently using the Hungarian algorithm to match paragraphs between document versions.
However, sections can move or expand, which causes paragraph matches to “slide” and produce incorrect alignments.
3. Semantic vs lexical comparison
it is difficult to distinguish between a rephrased paragraph, a modified paragraph and a completely new paragraph
5. Change summarisation
The summaries produced are very weak explanations most often than not its just sentences copied directly from document and pasted in summary because the input evidence from the diff stage is not structured enough.
Can anyone please help me with these issues? I need better alternatives for paragraph segmentation from PDFs, what approaches for stable paragraph alignment between document versions, whether sequence alignment methods or section-based comparison work better than global matching, best practices for generating useful change summaries from document diffs.