[Paper-Reading] TSM: Temporal Shift Module for Efficient Video Understanding

Ji Lin, Chuang Gan, Song Han, from MIT

Chen Xiaoyuan

1 Background

Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy.

2 Motivation

To design a generic and effective model that enjoys both high efficiency and high performance.

3 Methods (including framework)


The core concept of the proposed TSM module is “data shifting”. Let’s consider a normal 1-D convolution with a kernel size of 3 as an example. Suppose the kernel is \(W=(w_1, w_2, w_3)\) and the input X is a 1-D vector with infinite length. Then, denote the shifting operation by -1, 0, +1 as \(X^{-1}\), \(X^0\), \(X^{+1}\), the convolution result Y is:

$$ Y=w_1X^{-1}+w_2X^0+w_3X^{+1} $$

As you can see, by doing shift operation, we can implement the 1-D convolution simply by multiply-accumulating. So the shift operation can bring us the ability to capture the temporal structure as if we have a 3-D convolution networks instead of a 2-D one.

And after some experiments, they’ve found out that shifting 1/4 of the channels is the best choice. And in-place TSM performs worse than residual TSM.



4 Experiments (data corpus, evaluation metrics, compared methods)

Datasets: Kinetics, UCF101, HMDB51, Something-Something

Framework: TSN with ResNet-50

Evaluation Metrics: [email protected] and [email protected]



5 Pros. & Cons.

Pros: Simple and easy to implement, but with great improvement (compared to the 2D conv baseline). It may show that shift operation is a great way the improve other ResNet models.

Cons: (None)

6 Comments (e. g., improvements)

It seems like some kind of shuffle / random is indeed useful in our models. For example, dropout, the identical path of ResNet, TSM, etc…

  1. Limin Wang, et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.