Update README.md
Browse files
README.md
CHANGED
|
@@ -85,7 +85,7 @@ All corpora except Europarl and Tilde were collected from [Opus](https://opus.nl
|
|
| 85 |
|
| 86 |
### Data preparation
|
| 87 |
|
| 88 |
-
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.
|
| 89 |
|
| 90 |
|
| 91 |
#### Tokenization
|
|
|
|
| 85 |
|
| 86 |
### Data preparation
|
| 87 |
|
| 88 |
+
All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
|
| 89 |
|
| 90 |
|
| 91 |
#### Tokenization
|