Abstract:Existing multi-view three-dimensional (3D) reconstruction methods can only capture single type of feature from input view, failing to obtain fine-grained semantics for reconstructing the complex shapes. They rarely explore the semantic association between input views, leading to a rough 3D shape. To address these challenges, we propose a semantics-aware transformer (SATF) for 3D reconstruction. It is composed of two parallel view transformer encoders and a point cloud transformer decoder, and takes two red, green and blue (RGB) images as input and outputs a dense point cloud with richer details. Each view transformer encoder can learn a multi-level feature, facilitating characterizing fine-grained semantics from input view. The point cloud transformer decoder explores a semantically-associated feature by aligning the semantics of two input views, which describes the semantic association between views. Furthermore, it can generate a sparse point cloud using the semantically-associated feature. At last, the decoder enriches the sparse point cloud for producing a dense point cloud with richer details. Extensive experiments on the ShapeNet dataset show that our SATF outperforms the state-of-the-art methods.