ViT-based Image Captioning - J P K Varma Pothuri

Vision-Language Model for Image Captioning

Home Projects Vision-Language Model for Image Captioning

Project Overview

This project presents an end-to-end vision-language model that generates captions for images using a transformer-based architecture. It leverages a ViT encoder and transformer decoder trained on the Flickr8k dataset for natural image description.

Key Features

ViT-based image encoding
Transformer decoder for natural language generation
Custom PyTorch Dataset for Flickr8k
Integration with Hugging Face Transformers
Evaluation using image-captioning benchmarks

Methodology

The model uses the ViTFeatureExtractor to process raw images and AutoTokenizer to handle caption tokens. A custom PyTorch Dataset class is used to load and pair images with their captions. The EncoderDecoderModel is fine-tuned using teacher forcing and evaluates the model’s captioning ability using BLEU and accuracy metrics.

Technical Implementation

Dataset: Flickr8k
Encoder: Vision Transformer (ViT)
Decoder: GPT2-style decoder from Hugging Face
Training: Cross-entropy loss, Adam optimizer
Evaluation: BLEU score, qualitative caption comparison

Applications

Image accessibility and narration
Vision-language understanding
Caption-based human-UAV interaction
Visual storytelling

Results

The model demonstrates promising caption quality with coherent and contextually relevant descriptions. Fine-tuning yielded better fluency and semantic accuracy compared to baseline CNN-RNN setups.

Conclusion

This project showcases a scalable and modular framework for image captioning, leveraging modern transformer architectures for vision and language fusion. It lays the foundation for more advanced V-L-A (Vision-Language-Action) systems.

Project Information

Category: Vision-Language
Duration: 1 week
Completed: 2025
Dataset: Flickr8k
Tools: PyTorch, Hugging Face

Technologies Used

Python
PyTorch
Transformers
OpenCV
ViT

View on GitHub

Interested in this project?

If you'd like to collaborate or discuss potential applications, feel free to reach out!

Contact Me