Detection Using Mask Adaptive Transformers in Unmanned Aerial Vehicle Imagery<sub>*</sub><sup> </sup>

Detection Using Mask Adaptive Transformers in Unmanned Aerial Vehicle Imagery* 
DOI:
                        
                    
CSTR:
                        [cstr]
                    
Author:
                        Huibiao Ye1Huibiao Ye
China Telecommunication Corporation Zhejiang Branch
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
Weiming Fan2Weiming Fan
Binjiang Institute of Zhejiang University
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
Yuping Guo2Yuping Guo
Binjiang Institute of Zhejiang University
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
Xuna Wang2Xuna Wang
Binjiang Institute of Zhejiang University
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
Dalin Zhou3Dalin Zhou
University of Portsmouth
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:1.China Telecommunication Corporation Zhejiang Branch;2.Binjiang Institute of Zhejiang University;3.University of Portsmouth
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [45]

Cited by

Materials

Comments

Abstract:

Drone photography is an essential building block of intelligent transportation, enabling wide-ranging monitoring, precise positioning, and rapid transmission. However, the high computational cost of Transformer-based methods in object detection tasks hinders real-time result transmission in drone target detection applications. Therefore, we propose Mask Adaptive Transformers tailored for such scenarios. Specifically, we introduce a structure that sup-ports collaborative token sparsification in support windows, enhancing fault tolerance and reducing computational overhead. This structure comprises two modules: a binary mask strategy and Adaptive Window Self-Attention(A-WSA). The binary mask strategy focuses on significant objects in various complex scenes. The A-WSA mechanism is employed to self-attend for balance performance and computational cost to selected objects and isolate all contextual leakage. Extensive experiments on the challenging CarPK and VisDrone datasets demonstrate the effectiveness and superiority of the proposed method. Specifically, it achieves a mean average precision (mAP@0.5) improvement of 1.25% over CD-yolov5 on the CarPK dataset and a 3.75% mAP@0.5 im-provement over CZ Det on the VisDrone dataset.

Key words:target detection, Unmanned Aerial Vehicle Imagery, deep learning, data transmission

Reference

[1] Wang, Z., Xue, X. (2015). A review on the use of unmanned aerial vehicles for military operations. Journal of Military Research, 23(1), 15-28.

[2] Zhang, C., Kovacs, J. M. (2012). The application of small unmanned aerial systems for precision agriculture: a review. Precision Agriculture, 13(6), 693-712.

[3] Adams, S. M., Friedland, C. J. (2011). A survey of unmanned aerial vehicle (UAV) usage for imagery collection in disaster research and management. Proceedings of the 9th International Workshop on Remote Sensing for Disaster Response.

[4] Lin, Y., Zhang, Y. (2014). Applications of UAV systems in urban planning and development. International Journal of Urban Planning, 19(3), 233-245.

[5] LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

[6] Redmon, J., Farhadi, A. (2018). YOLOv3: An Incremental Improvement. arXiv preprint arXiv:1804.02767.

[7] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. European Conference on Computer Vision (ECCV), 2020, pp. 213-229.

[8] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700-4708).

[9] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248-255). Ieee.

[10] Redmon, J., Farhadi, A. (2018). YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767.

[11] Yu J, Gao H, Chen Y, et al. Deep object detector with attentional spatiotemporal LSTM for space human–robot interaction[J]. IEEE Transactions on human-machine systems, 2022, 52(4): 784-793.

[12] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.

[13] X. Chen, H. Fan, R. Girshick, and K. He, ‘‘Improved baselines with momentum contrastive learning,’’ 2020, arXiv:2003.04297.

[14] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, ‘‘End-to-end object detection with Transformers,’’ in Proc. ECCV, Aug. 2020, pp. 213–229.

[15] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J. (2020). Deformable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.

[16] Wang, H., Chen, Q., Yuan, L., Guo, Z., Zhang, L. (2023). EdgeFormer: Bringing efficient transformers to the edge. In International Conference on Learning Representations (ICLR).

[17] C. Yang, Z. Huang, and N. Wang, “Querydet: Cascaded sparse query for accelerating highresolution small object detection,” in CVPR, 2022.

[18] Du, B., Huang, Y., Chen, J., Huang, D. (2023). Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. InSProceedings of the IEEE/CVF conference on computer vision and pattern recognitionS(pp. 13435-13444).

[19] Bao F, Nie S, Xue K, et al. All are worth words: A vit backbone for diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 22669-22679.

[20] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012-10022.

[21] Zhang, H., Wei, J., Xiao, Y., Yuan, H., Lu, W. (2023). Lite DETR: A lightweight transformer for object detection. In arXiv preprint arXiv:2303.12345.

[22] Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, B. (2021). CSWin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10013-10023.

[23] Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., ... Lu, T. (2022). PVTv2: Improved Baselines with Pyramid Vision Transformer. arXiv preprint arXiv:2106.13797.

[24] Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., ... Lu, T. (2021). Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv preprint arXiv:2102.12122.

[25] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.

[26] Hyeon-Woo N, Yu-Ji K, Heo B, et al. Scratching visual transformer''s back with uniform attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 5807-5818.

[27] Hsieh M R, Lin Y L, Hsu W H. Drone-based object counting by spatially regularized regional proposal network[C]//Proceedings of the IEEE international conference on computer vision. 2017: 4145-4153.

[28] Du D, Zhu P, Wen L, et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results[C]//Proceedings of the IEEE/CVF international conference on computer vision workshops. 2019: 0-0.

[29] Mo N, Yan L. Oriented vehicle detection in high-resolution remote sensing images based on feature amplification and category balance by oversampling data augmentation[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2020, 43: 153-159.

[30] Tang T, Zhou S, Deng Z, et al. Arbitrary-oriented vehicle detection in aerial imagery with single convolutional neural networks[J]. Remote Sensing, 2017, 9(11): 1170.

[31] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.

[32] Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495.

[33] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing, 2015: 234-241.

[34] Yu J, Gao H, Sun J, et al. Spatial cognition-driven deep learning for car detection in unmanned aerial vehicle imagery[J]. IEEE Transactions on Cognitive and Developmental Systems, 2021, 14(4): 1574-1583.

[35] Yang C, Huang Z, Wang N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection[C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 2022: 13668-13677.

[36] Meethal A, Granger E, Pedersoli M. Cascaded zoom-in detector for high resolution aerial images[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 2046-2055.

[37] Nguyen D L, Vo X T, Priadana A, et al. Car Detector Based on YOLOv5 for Parking Management[C]//Conference on Information Technology and its Applications. Cham: Springer Nature Switzerland, 2023: 102-113.

[38] Zhu C, He Y, Savvides M. Feature selective anchor-free module for single-shot object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 840-849.

[39] Zhang H, Wang Y, Dayoub F, et al. Varifocalnet: An iou-aware dense object detector[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 8514-8523.

[40] Feng C, Zhong Y, Gao Y, et al. Tood: Task-aligned one-stage object detection[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2021: 3490-3499.

[41] Chen Z, Yang C, Li Q, et al. Disentangle your dense object detector[C]//Proceedings of the 29th ACM international conference on multimedia. 2021: 4939-4948.

[42] Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics/blob/main/CITATION.cff (accessed on 1 January 2023).

[43] Wang X, Yao F, Li A, et al. DroneNet: Rescue Drone-View Object Detection[J]. Drones, 2023, 7(7): 441.

[44] Wei Z, Duan C, Song X, et al. Amrnet: Chips augmentation in aerial images object detection[J]. arXiv preprint arXiv:2009.07168, 2020.

[45] Zhang H, Li F, Liu S, et al. Dino: Detr with improved denoising anchor boxes for end-to-end object detection[J]. arXiv preprint arXiv:2203.03605, 2022.

Get Citation

Copy

Article Metrics

Abstract:9
PDF: 0
HTML: 0
Cited by: 0

History

Received:July 27,2024
Revised:September 10,2024
Adopted:October 08,2024
Online:
Published:

Home

About us

Authors

Editors

News

Contents

Contact us

Get Citation

Share

Article Metrics

History

Article QR Code