Multi-Axis Attention Network using MaxViT for Fine-Grained Image Classification and Discriminative Feature Learning
Main Article Content
Abstract
Fine-grained visual categorization (FGVC) is a challenging task in computer vision, which involves discrimination between visually similar subcategories of a big object class. The challenge is further exacerbated in the case of bird species classification, where inter-class variation is subtle and spatially localized, while huge intra-class variation arises due to pose, lighting, and background variations. To address these challenges, we propose the weakly supervised and unified approach Multi-Axis Attention Network with MaxViT for Fine-Grained Image Classification and Discriminative Feature Learning.
Our method capitalizes on the strength of the Multi-Axis Vision Transformer (MaxViT), which utilizes convolutional encoding with axial and block-wise self-attention. The combined attention pattern enables the model to capture both local texture information and global contextual information necessary for fine-grained classification. Our model, unlike conventional methods based on part annotations or region-level supervision, is trained end-to-end with image-level supervision alone. To exhibit the model's predictions and to determine regions liable for classification, we utilize Grad-CAM on the last attention layers. The heatmaps obtained show that the model tends to attend to highly discriminative regions like wing structures, beak shapes, and feather textures, giving strong visual confirmation of the model's decision-making. Our framework is assessed on two benchmark datasets, CUB-200-2011 and NABirds, with 95.1% and 94.2% Top-1 classification accuracy, respectively. These results validate the high effectiveness and interpretability of MaxViT as a backbone for fine-grained visual classification tasks under weak supervision.