2SWUNet: Small Window SWinUNet based on Tans-former for Building Extraction from High-Resolution Remote Sensing Images

2SWUNet: Small Window SWinUNet based on Tans-former for Building Extraction from High-Resolution Remote Sensing Images
DOI:
                        
                    
CSTR:
                        [cstr]
                    
Author:
                        yujiamin1yujiamin
Zhejiang University of Technology
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
chansixian1chansixian
Zhejiang University of Technology
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
leiyanjing1leiyanjing
Zhejiang University of Technology
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
wuwei1wuwei
Zhejiang University of Technology
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
wangyuan1wangyuan
Zhejiang University of Technology
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
zhouxiaolong2zhouxiaolong
Quzhou University
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:1.Zhejiang University of Technology;2.Quzhou University
Clc Number:
Fund Project:The National Natural Science Foundation of China (General Program, Key Program, Major Research Plan)

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Models dedicated to building long-range dependencies often exhibit degraded performance when transferred to remote sensing images. Vision Transformers (ViTs) is a new paradigm in computer vision that uses multi-head self-attention ra-ther than convolution as the main computational module, with global modeling capabilities. How-ever, its performance on small datasets is usually far inferior to that of convolutional neural networks (CNNs). In this work, we propose a Small Window SWinUNet (2SWUNet) for building extraction from high-resolution remote sensing images. Firstly, 2SWUNet is trained based on Swin Transformer by designing a fully symmetric encod-er-decoder U-shaped architecture. Secondly, to construct a reasonable U-shaped architecture for building extraction from high-resolution remote sensing images, the different forms of patch expansion are explored to simulate up-sampling operations and recover feature map resolution. Then, the small window-based multi-head self-attention (W-MSA) is designed to reduce the computational and memory burden, which is more appropriate for the features of remote sensing images. Meanwhile, the pre-training mechanism is advanced to make up for the lack of decoder parameters. Finally, comparison experiments with other mainstream CNNs and ViTs validate the superiority of the proposed model. In addition, by visualizing the effective receptive field, we dis-cover that the local information is more conducive to predicting in remote sensing images.

Key words:convolutional neural networks; semantic segmentation

Get Citation

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:August 29,2023
Revised:October 13,2023
Adopted:November 07,2023
Online:
Published:

Home

About us

Authors

Editors

News

Contents

Contact us

Get Citation

Share

Article Metrics

History

Article QR Code