基于播放速率预测的自监督视频表征算法研究Research on Self-supervised Video Representation Algorithm Based on Playing Rate Prediction
靳巾,张育嘉,徐叙远,刘孟洋
摘要(Abstract):
时空特征学习对于视频无监督表征至关重要。基于前置任务的视频自监督表征方法被证明是有效的方式之一。其中,视频播放率预测的前置任务能够以无监督的方式学习时序特征,近年来得到广泛的讨论。然而,播放速率预测任务只探讨了单个样本的自监督标签,忽略了不同目标的运动频率差别;播放率预测任务的损失函数将所有标签的权重视为均等,而忽略了不同预测标签间与真值间的差距;传统的播放速率预测任务只使用单层分类层用于播放速率预测,影响视频表征的整体性能。针对上述3个问题,提出了一种改进的播放速率前置任务。该方法在训练时创新性地使用对照样本,并使用EMD距离优化不同预测样本和真值间的损失函数,同时使用更深的神经网络预测层,缓解预测任务对视频表征的影响。本文所提出的方法在公开数据集UCF-101和HMDB-51进行仿真实验,比较了所提出方法与传统方法的性能增益。实验表明,改进的播放速度前置任务有较好的视频表征效果。
关键词(KeyWords): 视频表征;自监督学习;前置任务;神经网络;动作识别
基金项目(Foundation):
作者(Author): 靳巾,张育嘉,徐叙远,刘孟洋
DOI: 10.20064/j.cnki.2095-347X.2023.02.002
参考文献(References):
- [1] Carreira J,Zisserman A.Quo vadis,action recognition?a new model and the kinetics dataset [C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017:6299-6308.
- [2] 杨帆,李军锋,颜永红.基于时域和时频域联合优化的语音增强算法 [J].网络新媒体技术,2021,10(5):37-42.
- [3] 王星凯,邓浩江,盛益强.基于深度学习的智能推荐系统综述 [J].网络新媒体技术,2021,10(1):1-11.
- [4] Hara K,Kataoka H,Satoh Y.Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,2018:6546-6555.
- [5] Feichtenhofer C,Fan H,Xiong B,et al.A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:3299-3309.
- [6] Chen T,Kornblith S,Norouzi M,et al.A simple framework for contrastive learning of visual representations[C]//International conference on machine learning,2020:1597-1607.
- [7] Benaim S,Ephrat A,Lang O,et al.Speednet:Learning the speediness in videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020:9922-9931.
- [8] Fernando B,Bilen H,Gavves E,et al.Self-supervised video representation learning with odd-one-out networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition,2017:3636-3645.
- [9] Lee H-Y,Huang J-B,Singh M,et al.Unsupervised representation learning by sorting sequences[C]//Proceedings of the IEEE International Conference on Computer Vision,2017:667-676.
- [10] Vondrick C,Pirsiavash H,Torralba A.Anticipating visual representations from unlabeled video[C]//Proceedings of the IEEE conference on computer vision and pattern recognition,2016:98-106.
- [11] Kim D,Cho D,Kweon I S.Self-supervised video representation learning with space-time cubic puzzles[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2019,33(01):8545-8552.
- [12] Wang J,Jiao J,Bao L,et al.Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:4006-4015.
- [13] Misra I,Zitnick C L,Hebert M.Shuffle and learn:unsupervised learning using temporal order verification[C]//European Conference on Computer Vision,2016:527-544.
- [14] Wang J,Jiao J,Liu Y H.Self-supervised video representation learning by pace prediction[C]//European conference on computer vision,2020:504-521.
- [15] Jenni S,Meishvili G,Favaro P.Video representation learning by recognizing temporal transformations[C]//Computer Vision-ECCV 2020:16th European Conference,Glasgow,UK,August 23-28,2020,Proceedings,Part XXVIII 16,2020:425-442.
- [16] Laptev I.On space-time interest points[J].International journal of computer vision,2005,64(2):107-123.
- [17] Klaser A,Marsza?ek M,Schmid C.A spatio-temporal descriptor based on 3d-gradients[C]//BMVC 2008-19th British Machine Vision Conference,2008:275:1-10.
- [18] Dalal N,Triggs B,Schmid C.Human detection using oriented histograms of flow and appearance[C]//European conference on computer vision,2006:428-441.
- [19] Wang H,Schmid C.Action recognition with improved trajectories[C]//Proceedings of the IEEE international conference on computer vision,2013:3551-3558.
- [20] Simonyan K,Zisserman A.Two-stream convolutional networks for action recognition in videos[EB/OL].arXiv preprint arXiv:1406.2199.(2014-11-12).https://arXiv.org/abs/1406.2199.
- [21] Wang L,Xiong Y,Wang Z,et al.Temporal segment networks:Towards good practices for deep action recognition[C]//European conference on computer vision,2016:20-36.
- [22] Donahue J,Anne H L,Guadarrama S,et al.Long-term recurrent convolutional networks for visual recognition and description [C]//Proceedings of the IEEE conference on computer vision and pattern recognition,2015:2625-2634.
- [23] Bertasius G,Wang H,Torresani L.Is Space-Time Attention All You Need for Video Understanding?[EB/OL].arXiv preprint arXiv:2102.05095.(2021-06-09).https://arXiv.org/abs/2102.05095.
- [24] Xu D,Xiao J,Zhao Z,et al.Self-supervised spatiotemporal learning via video clip order prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2019:10334-10343.
- [25] Gan C,Gong B,Liu K,et al.Geometry guided convolutional neural networks for self-supervised video representation learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018:5589-5597.
- [26] Soomro K,Zamir A R,Shah M.UCF101:A dataset of 101 human actions classes from videos in the wild[EB/OL].arXiv preprint arXiv:1212.0402.(2012-12-03).https://arXiv.org/abs/1212.0402.
- [27] Kuehne H,Jhuang H,Garrote E,et al.HMDB:a large video database for human motion recognition[C]//2011 International conference on computer vision,2011:2556-2563.
- [28] Tran D,Wang H,Torresani L,et al.A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,2018:6450-6459.
- [29] Luo D,Liu C,Zhou Y,et al.Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence,2020:11701-11708.
- [30] Cho H,Kim T,Chang H J,et al.Self-supervised spatio-temporal representation learning using variable playback speed prediction[J].arXiv preprint arXiv:200302692,2020.
- [31] Pan T,Song Y,Yang T,et al.Videomoco:Contrastive video representation learning with temporally adversarial examples[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2021:11205-11214.
- [32] Zhou B,Khosla A,Lapedriza A,et al.Learning Deep Features for Discriminative Localization:10.1109/CVPR.2016.319[P].2016-12-01.