Project Overview
This project presents an end-to-end vision-language model that generates captions for images using a transformer-based architecture. It leverages a ViT encoder and transformer decoder trained on the Flickr8k dataset for natural image description.
Key Features
- ViT-based image encoding
- Transformer decoder for natural language generation
- Custom PyTorch Dataset for Flickr8k
- Integration with Hugging Face Transformers
- Evaluation using image-captioning benchmarks
Methodology
The model uses the ViTFeatureExtractor to process raw images and AutoTokenizer to handle caption tokens. A custom PyTorch Dataset class is used to load and pair images with their captions. The EncoderDecoderModel is fine-tuned using teacher forcing and evaluates the model’s captioning ability using BLEU and accuracy metrics.
Technical Implementation
- Dataset: Flickr8k
- Encoder: Vision Transformer (ViT)
- Decoder: GPT2-style decoder from Hugging Face
- Training: Cross-entropy loss, Adam optimizer
- Evaluation: BLEU score, qualitative caption comparison
Applications
- Image accessibility and narration
- Vision-language understanding
- Caption-based human-UAV interaction
- Visual storytelling
Results
The model demonstrates promising caption quality with coherent and contextually relevant descriptions. Fine-tuning yielded better fluency and semantic accuracy compared to baseline CNN-RNN setups.
Conclusion
This project showcases a scalable and modular framework for image captioning, leveraging modern transformer architectures for vision and language fusion. It lays the foundation for more advanced V-L-A (Vision-Language-Action) systems.