YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
CLIP-GPT2 Image Captioning Model
This is a fine-tuned image captioning model that combines CLIP vision encoder with GPT-2 language model using a projection layer.
Model Architecture
- Vision Encoder: CLIP (openai/clip-vit-base-patch16)
- Language Model: GPT-2
- Projection Layer: Multi-layer perceptron to bridge CLIP and GPT-2 embeddings
- Training: Fine-tuned on Flickr30k dataset
Performance
- Training Steps: 176,000
- Accuracy: ~40% on test samples
- Dataset: Flickr30k (158,915 samples)
Usage
from transformers import CLIPProcessor, GPT2Tokenizer
import torch
from PIL import Image
# Load the model
model = ClipGPT2ImageCaptioner.from_pretrained("your-username/clip-gpt2-image-captioner")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Generate caption
image = Image.open("your_image.jpg")
caption = model.generate_caption(image, clip_processor, tokenizer)
print(caption)
Training Details
- Loss Function: Cross-entropy loss
- Optimizer: AdamW
- Learning Rate: 1e-4
- Batch Size: 2
- Gradient Accumulation: 4 steps
Model Files
model.pt: Trained model weightsconfig.json: Model configurationREADME.md: This file
Citation
If you use this model, please cite:
@misc{clip-gpt2-image-captioner,
title={CLIP-GPT2 Image Captioning with Projection Layer},
author={Your Name},
year={2024},
url={https://huggingface.co/your-username/clip-gpt2-image-captioner}
}
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support