Abstract:Video summarization aims at selecting valuable clips for browsing videos with high efficiency. Previous approaches typically focus on aggregating temporal features while ignoring the potential role of visual representations in summarizing videos. In this paper, we present a global difference-aware network (GDANet) that exploits the feature difference across frame and video as guidance to enhance visual features. Initially, a difference optimization module (DOM) is devised to enhance the discriminability of visual features, bringing gains in accurately aggregating temporal cues. Subsequently, a dual-scale attention module (DSAM) is introduced to capture informative contextual information. Eventually, we design an adaptive feature fusion module (AFFM) to make the network adaptively learn context representations and perform feature fusion effectively. We have conducted experiments on benchmark datasets, and the empirical results demonstrate the effectiveness of the proposed framework.