YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CLIP-GPT2 Image Captioning Model

This is a fine-tuned image captioning model that combines CLIP vision encoder with GPT-2 language model using a projection layer.

Model Architecture

  • Vision Encoder: CLIP (openai/clip-vit-base-patch16)
  • Language Model: GPT-2
  • Projection Layer: Multi-layer perceptron to bridge CLIP and GPT-2 embeddings
  • Training: Fine-tuned on Flickr30k dataset

Performance

  • Training Steps: 176,000
  • Accuracy: ~40% on test samples
  • Dataset: Flickr30k (158,915 samples)

Usage

from transformers import CLIPProcessor, GPT2Tokenizer
import torch
from PIL import Image

# Load the model
model = ClipGPT2ImageCaptioner.from_pretrained("your-username/clip-gpt2-image-captioner")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Generate caption
image = Image.open("your_image.jpg")
caption = model.generate_caption(image, clip_processor, tokenizer)
print(caption)

Training Details

  • Loss Function: Cross-entropy loss
  • Optimizer: AdamW
  • Learning Rate: 1e-4
  • Batch Size: 2
  • Gradient Accumulation: 4 steps

Model Files

  • model.pt: Trained model weights
  • config.json: Model configuration
  • README.md: This file

Citation

If you use this model, please cite:

@misc{clip-gpt2-image-captioner,
  title={CLIP-GPT2 Image Captioning with Projection Layer},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/your-username/clip-gpt2-image-captioner}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support