STM: SpatioTemporal and Motion Encoding for Action Recognition
Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu, Junjie Yan from ZJU and SenseTime
Spatiotemporal and motion features are two complementary and crucial features of a video, so how to collect them effectively is also crucial to action recognition.
1. The authors argued that flow stream cannot be referred as temporal stream, since it only represents the motion between neighboring frames.
2. Using C3D to extract spatiotemporal gained superior performance, but it rapidly increased the computation cost.
3. Optical flow is complementary with spatiotemporal feature, but hard to compute.
3. Methods (including framework)
This paper proposed CSTM (Channel-wise SptioTemporal Module, extracting spatiotemporal feature) and CMM (Channel-wise Motion Module, extracting motion information).
In CSTM, we first reshape the data, then apply the channel-wise 1D convolution (with kernel size=3) on the T dimension (to fuse the temporal information). Next we will reshape it back and apply a normal 3x3 2D convolution on it.
In CMM, we will first apply a 1x1 conv just to reduce the channels, then subtract Conv(Input[:, T+1, :, :])
from input[:, T, :, :],
within it the conv is a normal 3x3 2D one. After sliding along the whole time axis, the results will be concatenated and applied with a 1x1 2D conv to recover the channels again.
After proposing CSTM and CMM, the authors also combine them together as a so called STM block, which can encode spatiotemporal and motion features simultaneously. The STM block is deadly simple, containing just two 1x1 2D convs and a CSTM, a CMM, as shown below:
Something-Something v1 & v2, Jester, Kinetics-400, UCF-101, HMDB-51.
Top-1 & top-5 accuracy.
TSN with ResNet-50
5. Pros. & Cons.
outperforms traditional approaches on temporal-related datasets
with more computational cost (than vanilla ResNet-50) but no significant improvement on scene-related datasets.
I have some comments being came up with during reading this paper, but maybe not so related to it:
1. Why fusing the branches simply by averaging them outperforms the other approaches?
2. Maybe the direction information in optical flow is redundant in action recognition, what if we just calculate the difference of the brightness of adjacent frames? We can simply treat it as the 4-th channel.
3. How to make our model be able to vary in input size? Most of the existing models are fixed in input size. And how to effectively enlarge a well trained model?
7. Closely related papers (less than 3)
1. Ali Diba, et al., Spatio-temporal channel correlation networks for action classification, in ECCV
2. Limin Wang, et al., Temporal segment net- works: Towards good practices for deep action recognition, in ECCV