Abstract:Accessible communication based on sign language recognition (SLR) is the key to emergency medical assistance for the hearing-impaired community. Balancing the capture of both local and global information in SLR for emergency medicine poses a significant challenge. To address this, we propose a novel approach based on the inter-learning of visual features between global and local information. Specifically, our method enhances the perception capabilities of the visual feature extractor by strategically leveraging the strengths of convolutional neural networks, which are adept at capturing local features, and visual transformers, which perform well at perceiving global features. Furthermore, to mitigate the issue of overfitting caused by the limited availability of sign language data for emergency medical applications, we introduce an enhanced short temporal module for data augmentation through additional subsequences. Experiments results on three publicly available sign language datasets demonstrate the efficacy of the proposed approach.