ngoc commited on
Commit
6125baf
·
1 Parent(s): 6d6604d

update readme

Browse files
Files changed (2) hide show
  1. LICENSE.md +276 -0
  2. README.md +205 -0
LICENSE.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # **ProtonX Text Correction Model LICENSE AGREEMENT (v1.2-NC)**
2
+
3
+
4
+ **Effective Date:** 21 November 2025
5
+
6
+ **Copyright Holder:** PROTONX TECHNOLOGY COMPANY LIMITED
7
+
8
+ ---
9
+
10
+ ## **Preamble**
11
+
12
+ This License Agreement (“Agreement”) governs Your use, reproduction, modification, and distribution of the ProtonX Text Correction Model and related assets (“Model Materials”). It is designed to maximize openness for research and internal use while preventing unauthorized commercial redistribution of the model.
13
+
14
+ WHEREAS, Licensor has developed the ProtonX Text Correction model and intends to release it under an open and controlled framework;
15
+
16
+ WHEREAS, traditional open-source licenses (e.g., MIT) do not fully address complexities specific to AI language models—including model weights, fine-tuning data, privacy of user text, and downstream liabilities;
17
+
18
+ NOW, THEREFORE, the parties agree as follows.
19
+ Where this Agreement conflicts with the MIT License, the MIT License prevails.
20
+ Where MIT is silent, this Agreement supplements it.
21
+
22
+ ---
23
+
24
+ # **1. Definitions and Interpretation**
25
+
26
+ ### **1.1 Licensor**
27
+
28
+ “Licensor” means PROTONX TECHNOLOGY COMPANY LIMITED.
29
+
30
+ ### **1.2 Licensee**
31
+
32
+ “You” or “Licensee” means any individual or entity exercising rights under this Agreement.
33
+
34
+ ### **1.3 Model Materials**
35
+
36
+ “Model Materials” include but are not limited to:
37
+
38
+ (a) model architecture and trained parameters (model weights)
39
+ (b) preprocessing, training, inference, and fine-tuning code
40
+ (c) datasets, evaluation scripts, or dataset descriptions
41
+ (d) documentation, metadata, and configuration files
42
+
43
+ The authoritative version is hosted at:
44
+ **[hhttps://github.com/protonx-engineering/protonx-text-correction](hhttps://github.com/protonx-engineering/protonx-text-correction)**
45
+
46
+ ### **1.4 Outputs**
47
+
48
+ “Outputs” means all text generated by the Model Materials, including corrected text, grammar fixes, style changes, and rewritten sentences.
49
+
50
+ ### **1.5 Priority of Terms**
51
+
52
+ MIT License prevails where conflicts occur.
53
+ This Agreement governs all areas not addressed by MIT.
54
+
55
+ ### **1.6 Derivative Works**
56
+
57
+ “Derivative Works” means any modified, fine-tuned, distilled, merged, or adapted model derived in whole or in part from the Model Materials.
58
+
59
+ ---
60
+
61
+ # **2. Grant of Rights (Non-Commercial Use Only)**
62
+
63
+ ### **2.1 Copyright License**
64
+
65
+ Licensor grants Licensee a perpetual, worldwide, non-exclusive, royalty-free, **non-commercial** license to:
66
+
67
+ * use, test, host, and run the Model Materials
68
+ * reproduce, modify, fine-tune, and create Derivative Works
69
+ * integrate the Model into internal systems
70
+ * deploy internally for research or non-commercial use
71
+
72
+ **Licensee is NOT permitted to sell, resell, sublicense, rent, monetize, or commercially distribute the Model Materials or any Derivative Works.**
73
+
74
+ ---
75
+
76
+ ### **2.2 Commercial Restriction**
77
+
78
+ The following activities are strictly prohibited without a separate **paid Commercial License** from Licensor:
79
+
80
+ * selling model weights
81
+ * selling fine-tuned versions
82
+ * providing paid API access
83
+ * embedding the model into paid software
84
+ * offering the model through SaaS, PaaS, or enterprise subscriptions
85
+ * monetizing access directly or indirectly
86
+
87
+ ---
88
+
89
+ ### **2.3 Internal Enterprise Use**
90
+
91
+ Enterprises may use the Model Materials internally, but must not expose or commercialize the model externally without approval.
92
+
93
+ ---
94
+
95
+ ### **2.4 Derivative Works**
96
+
97
+ Derivative Works may be created, but:
98
+
99
+ * must remain non-commercial
100
+ * may not be sold or commercially distributed
101
+ * may not be uploaded to commercial model marketplaces
102
+
103
+ ---
104
+
105
+ ### **2.5 API Exception**
106
+
107
+ If the model is accessed via ProtonX official API, the ProtonX API Terms apply instead of this Agreement.
108
+
109
+ ---
110
+
111
+ # **3. Acceptable Use Policy and Prohibited Uses**
112
+
113
+ Violation of this Section results in **automatic license termination**.
114
+
115
+ ### **3.1 Responsible Use**
116
+
117
+ Licensee must use the Model ethically, lawfully, and with respect for user privacy.
118
+
119
+ ### **3.2 Enterprise On-Prem Deployments**
120
+
121
+ On-premise or closed-source internal deployments are permitted if non-commercial.
122
+
123
+ ### **3.3 Prohibited Uses**
124
+
125
+ #### **(a) Illegal or Harmful Content Generation**
126
+
127
+ Including but not limited to:
128
+
129
+ * generating fraudulent or deceptive documents
130
+ * rewriting text to conceal criminal intent
131
+ * assisting phishing, impersonation, malware, or cyberattacks
132
+
133
+ #### **(b) Privacy Violations**
134
+
135
+ Including:
136
+
137
+ * processing sensitive personal data without valid consent
138
+ * re-identifying anonymized text
139
+ * extracting protected attributes for discriminatory use
140
+
141
+ #### **(c) Copyright Abuse**
142
+
143
+ Model must not be used for:
144
+
145
+ * large-scale extraction of copyrighted books/articles
146
+ * reconstructing copyrighted content without rights
147
+ * generating “corrected” or modified versions of protected works without permission
148
+
149
+ #### **(d) Commercial Redistribution (Strictly Prohibited)**
150
+
151
+ Licensee shall NOT:
152
+
153
+ * sell the Model or Derivative Works
154
+ * sell access via API or platform
155
+ * provide paid services built on the Model
156
+ * distribute the Model on marketplaces for commercial gain
157
+ * embed the Model in commercial SaaS without permission
158
+
159
+ ---
160
+
161
+ # **4. Intellectual Property and Contributions**
162
+
163
+ ### **4.1 Ownership**
164
+
165
+ Licensor retains all rights to the original Model Materials.
166
+
167
+ ### **4.2 Patent License**
168
+
169
+ Licensee receives a royalty-free patent license to necessary claims.
170
+ If Licensee initiates a patent claim against Licensor, the patent license terminates.
171
+
172
+ ### **4.3 Outputs**
173
+
174
+ Outputs are determined by user inputs.
175
+ Licensor assumes no responsibility for:
176
+
177
+ * legal compliance of outputs
178
+ * copyright violations caused by Licensee use
179
+
180
+ ### **4.4 Trademarks**
181
+
182
+ Use of “ProtonX”, “ProtonX Text Correction”, or associated logos requires written permission.
183
+
184
+ ---
185
+
186
+ # **5. Data Governance, Privacy & Security**
187
+
188
+ ### **5.1 Data Quality and Bias**
189
+
190
+ Licensee must use legal, ethical datasets for any further training.
191
+
192
+ ### **5.2 Privacy Requirements**
193
+
194
+ (a) No processing of sensitive personal data without consent.
195
+ (b) Follow data-minimization principles.
196
+ (c) When building user-facing products, provide clear privacy policies.
197
+
198
+ ### **5.3 Security Measures**
199
+
200
+ Licensee must maintain reasonable security controls such as encryption and access management.
201
+
202
+ ### **5.4 Further Training**
203
+
204
+ User data may be used for re-training only with explicit informed consent.
205
+
206
+ ---
207
+
208
+ # **6. Warranty, Liability, and Risk Allocation**
209
+
210
+ ### **6.1 “AS IS” Basis**
211
+
212
+ Model Materials are provided without warranty of any kind.
213
+
214
+ ### **6.2 Output Accuracy Disclaimer**
215
+
216
+ Licensor is not liable for errors, misinterpretations, or downstream damages caused by Outputs.
217
+
218
+ ### **6.3 Limitation of Liability**
219
+
220
+ Licensor is not liable for:
221
+
222
+ * indirect or consequential damages
223
+ * loss of data
224
+ * business interruption
225
+ * misuse of the Model
226
+
227
+ Maximum liability under this License = **$0**.
228
+
229
+ ---
230
+
231
+ # **7. Attribution**
232
+
233
+ ### **7.1 Distribution Requirements**
234
+
235
+ Licensee must include this Agreement when sharing the Model Materials (non-commercial only).
236
+
237
+ ### **7.2 Notices**
238
+
239
+ All copyright and attribution notices must be preserved.
240
+
241
+ ### **7.3 Attribution Encouraged**
242
+
243
+ Not required but recommended:
244
+
245
+ > “Built with ProtonX Text Correction Model.”
246
+
247
+ ---
248
+
249
+ # **8. Governing Law & Dispute Resolution**
250
+
251
+ ### **8.1 Governing Law**
252
+
253
+ This Agreement is governed by the laws of **[Select: Vietnam / Singapore / UK]**.
254
+
255
+ ### **8.2 Dispute Resolution**
256
+
257
+ Disputes shall undergo amicable negotiation first.
258
+ If unresolved, they shall be resolved by binding arbitration at:
259
+ **[e.g., Singapore International Arbitration Centre (SIAC)]**
260
+
261
+ ---
262
+
263
+ # **9. Regulatory Compliance & License Updates**
264
+
265
+ Licensor may publish updated versions to comply with new regulations.
266
+ Licensee must migrate within **90 days**.
267
+
268
+ ---
269
+
270
+ # **10. Security Reporting**
271
+
272
+ Security vulnerabilities may be reported to:
273
274
+
275
+ Reports must be coordinated and not publicly disclosed until a fix is confirmed.
276
+
README.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license_file: LICENSE.md
3
+ library_name: protonx-text-correction
4
+ tags:
5
+ - text-to-text
6
+ language:
7
+ - vi
8
+ ---
9
+
10
+ <div align="center">
11
+
12
+ <p align="center">
13
+ <img src="https://storage.googleapis.com/mle-courses-prod/users/61b6fa1ba83a7e37c8309756/private-files/678dadd0-603b-11ef-b0a7-998b84b38d43-ProtonX_logo_horizontally__1_.png" width="260"/>
14
+ </p>
15
+
16
+ <h1 align="center">
17
+ Distilled High-Accuracy Vietnamese Legal Document Correction
18
+ </h1>
19
+
20
+ [![GitHub](https://img.shields.io/badge/ProtonX-GitHub-black?logo=github)](https://github.com/protonx-engineering/protonx-text-correction)
21
+ [![HuggingFace](https://img.shields.io/badge/HuggingFace-Model-black?logo=huggingface)](https://huggingface.co/protonx-models/protonx-tc)
22
+ [![Website](https://img.shields.io/badge/protonx.co-Website-blue)](https://protonx.co)
23
+
24
+ </div>
25
+
26
+ ---
27
+
28
+ ## **Introduction**
29
+
30
+ ### **Distilled ProtonX Legal Text Correction (v1.2-NC)**
31
+
32
+ This model is a distilled version of the [ProtonX Legal Text Correction](https://huggingface.co/protonx-models/protonx-legal-tc)
33
+
34
+ A **specialized Vietnamese correction model** engineered for **high-accuracy OCR post-processing**, especially **to fix noisy PaddleOCR outputs** in enterprise and legal workflows.
35
+
36
+ #### **Best Use Case (Primary Focus)**: **Fixing PaddleOCR text errors**
37
+
38
+ <img src="https://protonx.co/assets/img/paddle-ocr-protonx.png">
39
+
40
+ The model is optimized to clean up real-world OCR mistakes such as:
41
+
42
+ * missing or incorrect diacritics
43
+ * broken word segmentation
44
+ * misrecognized legal terms
45
+ * punctuation artifacts
46
+ * formatting inconsistencies
47
+
48
+ Built on a Seq2Seq Transformer architecture, the model is trained on 70,000 correction pairs, including 20,000 pairs manually annotated by expert Vietnamese annotators, covering:
49
+
50
+ * official legal documents
51
+ * OCR outputs from scanned PDFs
52
+ * colloquial → standardized legal text
53
+
54
+ Strict constraints ensure:
55
+
56
+ * **Correction ≠ rewriting**
57
+ * meaning of legal text must never change
58
+ * no hallucination / no added legal terms
59
+ * confidence-based correction
60
+ * no paraphrasing
61
+
62
+ ---
63
+
64
+ ## **LICENSE**
65
+
66
+ This model is released under the ProtonX Text Correction Model License (v1.2-NC).
67
+
68
+ See [LICENSE.md](./LICENSE.md) for full terms, conditions, and usage restrictions.
69
+
70
+ ## **Highlights**
71
+
72
+
73
+ 1. **ROUGE-L: 97.64**
74
+ - Achieved on the ProtonX Legal Correction Validation Dataset. The evaluation dataset will be released in an upcoming public release.
75
+ - The model is half the size of the teacher model.
76
+
77
+
78
+ ---
79
+
80
+ ## **Quick Usage with Transformers**
81
+
82
+ ```python
83
+ import torch
84
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
85
+
86
+ model_path = "protonx-models/protonx-legal-tc"
87
+
88
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
89
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
90
+
91
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
92
+ model.to(device)
93
+ model.eval()
94
+
95
+ examples = [
96
+ "can cu bo luat lao dong 2019 va cac van ban huong dan thuc hien.",
97
+ ]
98
+
99
+ for text in examples:
100
+ inputs = tokenizer(
101
+ text,
102
+ return_tensors="pt",
103
+ truncation=True,
104
+ max_length=128
105
+ ).to(device)
106
+
107
+ with torch.no_grad():
108
+ outputs = model.generate(
109
+ **inputs,
110
+ num_beams=10,
111
+ max_new_tokens=32,
112
+ length_penalty=1.0,
113
+ early_stopping=True,
114
+ repetition_penalty=1.2,
115
+ no_repeat_ngram_size=2,
116
+ pad_token_id=tokenizer.pad_token_id,
117
+ eos_token_id=tokenizer.eos_token_id,
118
+ )
119
+
120
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
121
+
122
+ print(f"Input: {text}")
123
+ print(f"Output: {result}")
124
+ print("-" * 30)
125
+ ```
126
+
127
+ ---
128
+
129
+ ## **Benchmark**
130
+
131
+ ### **ProtonX Legal Text Correction Validation Dataset**
132
+
133
+ | Metric | Score |
134
+ | ------------- | --------- |
135
+ | **ROUGE-L** | **97.64** |
136
+
137
+ ---
138
+
139
+
140
+ ## **Training Details**
141
+
142
+ * Model: Seq2Seq Transformer
143
+ * Legal-domain augmentation
144
+ * Beam search decoding
145
+ * Max sequence length: 64 tokens total (32 tokens for input and 32 tokens for output).
146
+ * High-precision diacritic + punctuation restoration
147
+
148
+ ### Domain Coverage:
149
+
150
+ * Government decrees
151
+ * Resolutions
152
+ * Contract clauses
153
+ * Administrative procedures
154
+ * OCR-normalized scanned documents
155
+
156
+ ---
157
+
158
+ ## **Example Outputs**
159
+
160
+
161
+ **Input:**
162
+
163
+ ```
164
+ 2.Báo vé an ninh mang là phòng ngìaphát hiēn,ngǎn chǎn xù ly hành vi
165
+ ```
166
+
167
+ **Output:**
168
+
169
+ ```
170
+ 2. Bảo vệ an ninh mạng là phòng ngừa phát hiện, ngăn chặn xử lý hành vi
171
+ ```
172
+
173
+ ---
174
+
175
+ ## **Use Cases**
176
+
177
+ * Legal OCR text normalization
178
+ * Standardizing government documents
179
+ * Contract proofreading
180
+ * Preprocessing for legal RAG systems
181
+ * Administrative workflow automation
182
+ * Compliance document processing
183
+
184
+ ---
185
+
186
+ ## **Limitations**
187
+
188
+ * Does not paraphrase or rewrite legal clauses
189
+ * Cannot restore missing semantic content
190
+ * Primarily optimized for Vietnamese
191
+ * Not designed for informal social media slang
192
+
193
+ ---
194
+
195
+ ## **Future Work**
196
+
197
+ * Achieving even higher ROUGE-L performance on legal-domain datasets
198
+ * Extending maximum sequence length from 64 to 256 tokens for long-clause legal documents
199
+ ---
200
+
201
+ ## **Acknowledgments**
202
+
203
+ Thanks to:
204
+
205
+ * [vit5-base](https://huggingface.co/VietAI/vit5-base)