Project Overview
This project explores the use of Vision Transformers (ViT) for object detection by applying self-attention mechanisms to image patches. The model emphasizes accuracy, computational efficiency, and explainability. It bridges the gap between CNN-based models and transformer-based architectures in the context of lightweight computer vision tasks.
Key Features
- Patch Embeddings: Converts image into 16x16 patches and projects them into embedding vectors.
- Multi-Head Self Attention: Captures global context across all patches, enabling effective spatial reasoning.
- Pretrained Backbones: Uses HuggingFace's ViT-B/16 pretrained weights for transfer learning.
- Object Detection Head: A lightweight MLP head fine-tuned for binary/multi-class detection.
Methodology
Images were processed into patches and passed through a ViT encoder. The CLS token output was connected to a detection head trained using cross-entropy loss. Transfer learning enabled faster convergence on a small subset of annotated aerial drone images.
We used an SGD optimizer with cosine annealing and conducted training for 10 epochs on a single GPU with early stopping and validation checkpoints.
Applications
- Lightweight UAV detection pipelines
- Vision-based robotics with limited compute
- Real-time object monitoring in resource-constrained settings
Conclusion
This project demonstrates how Vision Transformers can be rapidly adapted for real-world detection tasks. The modular ViT-based pipeline shows promising results in low-data regimes and opens the door for scalable transformer-based perception systems in robotics.