Abstract:Current mainstream unsupervised video object segmentation (UVOS) approaches typically incorporate optical flow as motion information to locate the primary objects in coherent video frames. However, they fuse appearance and motion information without evaluating the quality of the optical flow. When poor-quality optical flow is used for the interaction with the appearance information, it introduces significant noise and leads to a decline in overall performance. To alleviate this issue, we first employ a quality evaluation module (QEM) to evaluate the optical flow. Then, we select high-quality optical flow as motion cues to fuse with the appearance information, which can prevent poor-quality optical flow from diverting the network’s attention. Moreover, we design an appearance-guided fusion module (AGFM) to better integrate appearance and motion information. Extensive experiments on several widely utilized datasets, including DAVIS-16, FBMS-59, and YouTube-Objects, demonstrate that the proposed method outperforms existing methods.