Abstract:The current mainstream methods of loop closure detection in visual simultaneous localization and mapping (SLAM) are based on bag-of-words (BoW). However, traditional BoW-based approaches are strongly affected by changes in the appearance of the scene, which leads to poor robustness and low precision. In order to improve the precision and robustness of loop closure detection, a novel approach based on stacked assorted auto-encoder (SAAE) is proposed. The traditional stacked auto-encoder is made up of multiple layers of the same autoencoder. Compared with the visual BoW model, although it can better extract the features of the scene image, the output feature dimension is high. The proposed SAAE is composed of multiple layers of denoising auto-encoder, convolutional auto-encoder and sparse auto-encoder, it uses denoising auto-encoder to improve the robustness of image features, convolutional auto-encoder to preserve the spatial information of the image, and sparse auto-encoder to reduce the dimensionality of image features. It is capable of extracting low to high dimensional features of the scene image and preserving the spatial local characteristics of the image, which makes the output features more robust. The performance of SAAE is evaluated by a comparison study using data from new college dataset and city centre dataset. The methodology proposed in this paper can effectively improve the precision and robustness of loop closure detection in visual SLAM.