Abstract:Research on human motion prediction has made significant progress due to its importance in the development of various artificial intelligence applications. However, the prediction procedure often suffers from undesirable discontinuities and long-term error accumulation, which strongly limits its accuracy. To address these issues, a robust human motion prediction method via integration of spatial and temporal cues (RISTC) has been proposed. This method captures sufficient spatio-temporal correlation of the observable sequence of human poses by utilizing the spatio-temporal mixed feature extractor(MFE). In multi-layer MFEs, the channel-graph united attention blocks extract the augmented spatial features of the human poses in the channel and spatial dimension. Additionally, multi-scale temporal blocks have been designed to effectively capture complicated and highly dynamic temporal information. Our experiments on the Human3.6M and CMU Mocap datasets show that the proposed network yields higher prediction accuracy than the state-of-the-art methods.