Abstract:Although deep learning methods have been widely applied to slam visual odometry over the past decade with impres-sive improvements, their accuracy remains limited in complex dynamic environments. In this paper, we use a compo-site mask-based generative adversarial network to predict camera motion and binocular depth maps. Specifically, a perceptual generator is first designed to obtain the corresponding parallax map and optical flow from between two neighboring frames. Then, an iterative pose improvement strategy is proposed to improve the accuracy of pose estima-tion. Finally, a composite mask is embedded in the discriminator to sense structural deformations in the synthetic vir-tual image, thus encouraging the generator to learn additional structural level information to improve the accuracy of pose estimation. Detailed quantitative and qualitative evaluations on the KITTI dataset show that the proposed framework outperforms existing conventional, supervised learning and unsupervised depth VO methods, providing better results in both pose estimation and depth estimation.