| [**中文说明**](README_CN.md) | [**English**](README.md) | |
| # 项目介绍 | |
| 本项目旨在提供更好的中文CLIP模型。该项目使用的训练数据均为公开可访问的图像URL及相关中文文本描述,总量达到400M。经过筛选后,我们最终使用了100M的数据进行训练。 | |
| 本项目于QQ-ARC Joint Lab, Tencent PCG完成。 | |
| 更详细的信息可以参考[QA-CLIP项目的主页面](https://huggingface.co/TencentARC/QA-CLIP)。我们也在github上开源了模型,[QA-CLIP](https://github.com/TencentARC-QQ/QA-CLIP),welcome to star! | |
| <br><br> | |
| ## 实验结果 | |
| 针对图文检索任务,我们在[MUGE Retrieval](https://tianchi.aliyun.com/muge)、[Flickr30K-CN](https://github.com/li-xirong/cross-lingual-cap)和[COCO-CN](https://github.com/li-xirong/coco-cn)上进行了zero-shot测试。 | |
| 针对图像零样本分类任务,我们在ImageNet数据集上进行了测试。测试结果见下表: | |
| **Flickr30K-CN Zero-shot Retrieval (Official Test Set)**: | |
| <table border="1" width="120%"> | |
| <tr align="center"> | |
| <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th> | |
| </tr> | |
| <tr align="center"> | |
| <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.8</td><td>76.0</td><td>84.6</td><td>60.0</td><td>85.9</td><td>92.0</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.5</b></td><td><b>77.4</b></td><td><b>86.1</b></td><td><b>67.1</b></td><td><b>87.9</b></td><td><b>93.2</b></td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.7</td><td>86.9</td><td>92.8</td><td>74.6</td><td>93.5</td><td>97.1</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>63.8</b></td><td><b>88.0</b></td><td><b>93.2</b></td><td><b>78.4</b></td><td><b>96.1</b></td><td><b>98.5</b></td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>68.0</td><td>89.7</td><td>94.4</td><td>80.2</td><td>96.6</td><td>98.2</td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">AltClip<sub>ViT-L/14</sub></td><td><b>69.7</b></td><td>90.1</td><td><b>94.8</b></td><td>84.8</td><td>97.7</td><td>99.1</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td>69.3</td><td><b>90.3</b></td><td>94.7</td><td><b>85.3</b></td><td><b>97.9</b></td><td><b>99.2</b></td> | |
| </tr> | |
| </table> | |
| <br> | |
| **MUGE Zero-shot Retrieval (Official Validation Set)**: | |
| <table border="1" width="120%"> | |
| <tr align="center"> | |
| <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th> | |
| </tr> | |
| <tr align="center"> | |
| <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>RN50</sub></td><td>42.6</td><td>68.5</td><td>78.0</td><td>30.0</td><td>56.2</td><td>66.9</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>44.0</b></td><td><b>69.9</b></td><td><b>79.5</b></td><td><b>32.4</b></td><td><b>59.5</b></td><td><b>70.3</b></td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>52.1</td><td>76.7</td><td>84.4</td><td>38.7</td><td>65.6</td><td>75.1</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>53.2</b></td><td><b>77.7</b></td><td><b>85.1</b></td><td><b>40.7</b></td><td><b>68.2</b></td><td><b>77.2</b></td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>56.4</td><td>79.8</td><td>86.2</td><td>42.6</td><td>69.8</td><td>78.6</td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>29.6</td><td>49.9</td><td>58.8</td><td>21.4</td><td>42.0</td><td>51.9</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>57.4</b></td><td><b>81.0</b></td><td><b>87.7</b></td><td><b>45.5</b></td><td><b>73.0</b></td><td><b>81.4</b></td> | |
| </tr> | |
| </table> | |
| <br> | |
| **COCO-CN Zero-shot Retrieval (Official Test Set)**: | |
| <table border="1" width="120%"> | |
| <tr align="center"> | |
| <th>Task</th><th colspan="3">Text-to-Image</th><th colspan="3">Image-to-Text</th> | |
| </tr> | |
| <tr align="center"> | |
| <td>Metric</td><td>R@1</td><td>R@5</td><td>R@10</td><td>R@1</td><td>R@5</td><td>R@10</td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>RN50</sub></td><td>48.1</td><td>81.3</td><td>90.5</td><td>50.9</td><td>81.1</td><td>90.5</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>50.1</b></td><td><b>82.5</b></td><td><b>91.7</b></td><td><b>56.7</b></td><td><b>85.2</b></td><td><b>92.9</b></td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>62.2</td><td>87.1</td><td>94.9</td><td>56.3</td><td>84.0</td><td>93.3</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>62.9</b></td><td><b>87.7</b></td><td><b>94.7</b></td><td><b>61.5</b></td><td><b>87.6</b></td><td><b>94.8</b></td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>64.9</td><td>88.8</td><td>94.2</td><td>60.6</td><td>84.4</td><td>93.1</td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">AltClip<sub>ViT-L/14</sub></td><td>63.5</td><td>87.6</td><td>93.5</td><td>62.6</td><td><b>88.5</b></td><td><b>95.9</b></td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>65.7</b></td><td><b>90.2</b></td><td><b>95.0</b></td><td><b>64.5</b></td><td>88.3</td><td>95.1</td> | |
| </tr> | |
| </table> | |
| <br> | |
| **Zero-shot Image Classification on ImageNet**: | |
| <table border="1" width="120%"> | |
| <tr align="center"> | |
| <th>Task</th><th colspan="1">ImageNet</th> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>RN50</sub></td><td>33.5</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>RN50</sub></td><td><b>35.5</b></td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>ViT-B/16</sub></td><td>48.4</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>ViT-B/16</sub></td><td><b>49.7</b></td> | |
| </tr> | |
| <tr align="center"> | |
| <td width="120%">CN-CLIP<sub>ViT-L/14</sub></td><td>54.7</td> | |
| </tr> | |
| <tr align="center", style="background-color: Honeydew;"> | |
| <td width="120%">QA-CLIP<sub>ViT-L/14</sub></td><td><b>55.8</b></td> | |
| </tr> | |
| </table> | |
| <br> | |
| <br><br> | |
| # 使用教程 | |
| ## 推理代码 | |
| 推理代码示例: | |
| ```python | |
| from PIL import Image | |
| import requests | |
| from transformers import ChineseCLIPProcessor, ChineseCLIPModel | |
| model = ChineseCLIPModel.from_pretrained("TencentARC/QA-CLIP-ViT-L-14") | |
| processor = ChineseCLIPProcessor.from_pretrained("TencentARC/QA-CLIP-ViT-L-14") | |
| url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg" | |
| image = Image.open(requests.get(url, stream=True).raw) | |
| # Squirtle, Bulbasaur, Charmander, Pikachu in English | |
| texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"] | |
| # compute image feature | |
| inputs = processor(images=image, return_tensors="pt") | |
| image_features = model.get_image_features(**inputs) | |
| image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) # normalize | |
| # compute text features | |
| inputs = processor(text=texts, padding=True, return_tensors="pt") | |
| text_features = model.get_text_features(**inputs) | |
| text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) # normalize | |
| # compute image-text similarity scores | |
| inputs = processor(text=texts, images=image, return_tensors="pt", padding=True) | |
| outputs = model(**inputs) | |
| logits_per_image = outputs.logits_per_image # this is the image-text similarity score | |
| probs = logits_per_image.softmax(dim=1) | |
| ``` | |
| <br><br> | |
| # 致谢 | |
| 项目代码基于<b>[Chinese-CLIP](https://github.com/OFA-Sys/Chinese-CLIP)</b>实现,非常感谢他们优秀的开源工作。 | |
| <br><br> |