projecte-aina
/

aina-translator-de-ca

Model card Files Files and versions

AudreyVM commited on Dec 11, 2023

Commit

58ba693

·

1 Parent(s): e0ff2bc

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -85,7 +85,7 @@ All corpora except Europarl and Tilde were collected from [Opus](https://opus.nl
 ### Data preparation
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization

 ### Data preparation
+ All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 #### Tokenization