YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CLIP-GPT2 Image Captioning Model

This is a fine-tuned image captioning model that combines CLIP vision encoder with GPT-2 language model using a projection layer.

Model Architecture

Vision Encoder: CLIP (openai/clip-vit-base-patch16)
Language Model: GPT-2
Projection Layer: Multi-layer perceptron to bridge CLIP and GPT-2 embeddings
Training: Fine-tuned on Flickr30k dataset

Performance

Training Steps: 176,000
Accuracy: ~40% on test samples
Dataset: Flickr30k (158,915 samples)

Usage

from transformers import CLIPProcessor, GPT2Tokenizer
import torch
from PIL import Image

# Load the model
model = ClipGPT2ImageCaptioner.from_pretrained("your-username/clip-gpt2-image-captioner")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate caption
image = Image.open("your_image.jpg")
caption = model.generate_caption(image, clip_processor, tokenizer)
print(caption)

Training Details

Loss Function: Cross-entropy loss
Optimizer: AdamW
Learning Rate: 1e-4
Batch Size: 2
Gradient Accumulation: 4 steps

Model Files

model.pt: Trained model weights
config.json: Model configuration
README.md: This file

Citation

If you use this model, please cite:

@misc{clip-gpt2-image-captioner,
  title={CLIP-GPT2 Image Captioning with Projection Layer},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/your-username/clip-gpt2-image-captioner}
}

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support