|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- clip |
|
|
- multimodal |
|
|
- contrastive-learning |
|
|
- cultural-heritage |
|
|
- reevaluate |
|
|
- information-retrieval |
|
|
datasets: |
|
|
- xuemduan/reevaluate-image-text-pairs |
|
|
model-index: |
|
|
- name: REEVALUATE CLIP Fine-tuned Models |
|
|
results: |
|
|
- task: |
|
|
type: image-text-retrieval |
|
|
name: Image-Text Retrieval |
|
|
dataset: |
|
|
name: Cultural Heritage Hybrid Dataset |
|
|
type: xuemduan/reevaluate-image-text-pairs |
|
|
metrics: |
|
|
- name: I2T R@1 |
|
|
type: recall@1 |
|
|
value: <TOBE_FILL_IN> |
|
|
- name: I2T R@5 |
|
|
type: recall@5 |
|
|
value: <TOBE_FILL_IN> |
|
|
- name: T2I R@1 |
|
|
type: recall@1 |
|
|
value: <TOBE_FILL_IN> |
|
|
--- |
|
|
|
|
|
|
|
|
# Domain-Adaptive CLIP for Multimodal Retrieval |
|
|
|
|
|
The fine-tuned CLIP (Vit-L/14) used in **Knowledge-Enhanced Multimodal Retrieval** |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## 📦 Available Models |
|
|
|
|
|
| Model | Description | Data Type | |
|
|
|--------|--------------|-----------| |
|
|
| `reevaluate-clip` | Fine-tuned on images, query texts, and description texts | Image+Text | |
|
|
--- |
|
|
|
|
|
## 🧾 Dataset |
|
|
|
|
|
The models were trained and evaluated on the **REEVLAUATE Image-Text Pair Dataset**, which contains **43,500 image–text pairs** derived from Wikidata and Pilot Museums. |
|
|
|
|
|
Each artefact is described by: |
|
|
- `Image`: artefact image |
|
|
- `Description text`: BLIP-generated natural language portion + meatadata portion |
|
|
- `Query text`: User query-like text |
|
|
|
|
|
Dataset: [xuemduan/reevaluate-image-text-pairs](https://huggingface.co/datasets/xuemduan/reevaluate-image-text-pairs) |
|
|
|
|
|
--- |
|
|
|
|
|
## 🚀 Usage |
|
|
|
|
|
```python |
|
|
from transformers import CLIPProcessor, CLIPModel |
|
|
from PIL import Image |
|
|
|
|
|
model = CLIPModel.from_pretrained("xuemduan/reevaluate-clip") |
|
|
processor = CLIPProcessor.from_pretrained("xuemduan/reevaluate-clip") |
|
|
|
|
|
image = Image.open("artefact.jpg") |
|
|
text = "yellow flower paintings" |
|
|
|
|
|
image_embeds = model.get_image_features(**processor(images=image, return_tensors="pt")) |
|
|
text_embeds = model.get_text_features(**processor(text=[text], return_tensors="pt")) |
|
|
|
|
|
# normalize |
|
|
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True) |
|
|
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True) |
|
|
|
|
|
similarity = (image_embeds @ text_embeds.T) |
|
|
print(similarity) |
|
|
|