Project Overview

This project presents an end-to-end vision-language model that generates captions for images using a transformer-based architecture. It leverages a ViT encoder and transformer decoder trained on the Flickr8k dataset for natural image description.

Key Features

  • ViT-based image encoding
  • Transformer decoder for natural language generation
  • Custom PyTorch Dataset for Flickr8k
  • Integration with Hugging Face Transformers
  • Evaluation using image-captioning benchmarks

Methodology

The model uses the ViTFeatureExtractor to process raw images and AutoTokenizer to handle caption tokens. A custom PyTorch Dataset class is used to load and pair images with their captions. The EncoderDecoderModel is fine-tuned using teacher forcing and evaluates the model’s captioning ability using BLEU and accuracy metrics.

Technical Implementation

  • Dataset: Flickr8k
  • Encoder: Vision Transformer (ViT)
  • Decoder: GPT2-style decoder from Hugging Face
  • Training: Cross-entropy loss, Adam optimizer
  • Evaluation: BLEU score, qualitative caption comparison

Applications

  • Image accessibility and narration
  • Vision-language understanding
  • Caption-based human-UAV interaction
  • Visual storytelling

Results

The model demonstrates promising caption quality with coherent and contextually relevant descriptions. Fine-tuning yielded better fluency and semantic accuracy compared to baseline CNN-RNN setups.

Conclusion

This project showcases a scalable and modular framework for image captioning, leveraging modern transformer architectures for vision and language fusion. It lays the foundation for more advanced V-L-A (Vision-Language-Action) systems.

Project Information

  • Category: Vision-Language
  • Duration: 1 week
  • Completed: 2025
  • Dataset: Flickr8k
  • Tools: PyTorch, Hugging Face

Technologies Used

  • Python
  • PyTorch
  • Transformers
  • OpenCV
  • ViT

Interested in this project?

If you'd like to collaborate or discuss potential applications, feel free to reach out!

Contact Me