AvitoTech
/

SigLIP-Base-for-animal-identification

Image Feature Extraction

Transformers

Safetensors

siglip

zero-shot-image-classification

Model card Files Files and versions

xet

Community

AvitoTech1

korallll commited on 12 days ago

Commit

1d59f03

verified ·

1 Parent(s): 79fa6cb

Update README.md (#2)

Browse files

- Update README.md (8da3cd8cc7e0696e85ab463afb77ddca0f3b64e9)

Co-authored-by: kiril <korallll@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +13 -1

README.md CHANGED Viewed

@@ -151,7 +151,19 @@ print(f"Embedding shape: {embedding.shape}") # Embedding shape: torch.Size([1, 7
 If you use this model in your research or applications, please cite our work:
 ```
-BibTeX citation will be added upon paper publication.
 ```
 ## Use Cases

 If you use this model in your research or applications, please cite our work:
 ```
+@Article{jimaging12010030,
+AUTHOR = {Kudryavtsev, Vasiliy and Borodin, Kirill and Berezin, German and Bubenchikov, Kirill and Mkrtchian, Grach and Ryzhkov, Alexander},
+TITLE = {From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification},
+JOURNAL = {Journal of Imaging},
+VOLUME = {12},
+YEAR = {2026},
+NUMBER = {1},
+ARTICLE-NUMBER = {30},
+URL = {https://www.mdpi.com/2313-433X/12/1/30},
+ISSN = {2313-433X},
+ABSTRACT = {Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091 unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.},
+DOI = {10.3390/jimaging12010030}
+}
 ```
 ## Use Cases