Abstract:Recently, the research on transformers in deep learning makes a tremendous progress in nature language processing and computer vision. Due to the inherent invariant permutation, transformers provide solutions to unordered points problems faced by deep learning in point cloud object detection. In this paper, a two-stage LiDAR 3D object de-tection framework is presented, namely Point-Voxel Dual Transformer (PV-DT3D), which is a transformer-based method. In the proposed PV-DT3D, point-voxel fusion features are used for proposal refinement. Specifically, in the PV-DT3D, keypoints are sampled from entire point cloud scene and used to encode representative scene features via a proposal-aware voxel set abstraction module. Then according to the generated proposals by region proposal networks (RPN), the internal encoded keypoints are fed into dual transformer encoder-decoder architecture. In 3D object detection, for the first time, the proposed PV-DT3D takes advantages of both pointwise transformer and channel-wise architecture for capturing contexual information from the perspective of spatial and channel dimen-sions. Experiments conducted on the highly competitive KITTI 3D car detection leaderboard show that, the PV-DT3D achieves superior detection accuracy among state-of-the-art point-voxel-based methods.