Real-time spatio-temporal action localization via learning motion representation

Yuanzhong Liu, Zhigang Tu, Liyu Lin, Xing Xie, Qianqing Qin

Abstract: Most state-of-the-art spatio-temporal (S-T) action localization methods explicitly use optical flow as auxiliary motion information. Although the combination of optical flow and RGB significantly improves the performance, optical flow estimation brings a large amount of computational cost and the whole network is not end-to-end trainable. These shortcomings hinder the interactive fusion between motion information and RGB information, and greatly limit its real-world applications. In this paper, we exploit better ways to use motion information in a unified end-to-end trainable network architecture. First, we use knowledge distillation to enable the 3D-Convolutional branch to learn motion information from RGB inputs. Second, we propose a novel motion cue called short-range-motion (SRM) module to enhance the 2DConvolutional branch to learn RGB information and dynamic motion information. In this strategy, flow computation at test time is avoided. Finally, we apply our methods to learn powerful RGB-motion representations for action classification and localization. Experimental results show that our method outperforms the state-of-the-arts on dataset benchmarks J-HMDB-21 and UCF101-24 with an impressive improvement of 7% and 3%.