Ji Lin, Chuang Gan, Song Han, from MIT

Chen Xiaoyuan

### 1 Background

Conventional 2D CNNs are computationally cheap but cannot capture temporal relationships; 3D CNN based methods can achieve good performance but are computationally intensive, making it expensive to deploy.

### 2 Motivation

To design a generic and effective model that enjoys both high efficiency and high performance.

### 3 Methods (including framework)

The core concept of the proposed TSM module is “data shifting”. Let’s consider a normal 1-D convolution with a kernel size of 3 as an example. Suppose the kernel is $$W=(w_1, w_2, w_3)$$ and the input X is a 1-D vector with infinite length. Then, denote the shifting operation by -1, 0, +1 as $$X^{-1}$$, $$X^0$$, $$X^{+1}$$, the convolution result Y is:

$$Y=w_1X^{-1}+w_2X^0+w_3X^{+1}$$

As you can see, by doing shift operation, we can implement the 1-D convolution simply by multiply-accumulating. So the shift operation can bring us the ability to capture the temporal structure as if we have a 3-D convolution networks instead of a 2-D one.

And after some experiments, they’ve found out that shifting 1/4 of the channels is the best choice. And in-place TSM performs worse than residual TSM.

### 4 Experiments (data corpus, evaluation metrics, compared methods)

Datasets: Kinetics, UCF101, HMDB51, Something-Something

Framework: TSN with ResNet-50

Evaluation Metrics: Prec@1 and Prec@5

### 5 Pros. & Cons.

Pros: Simple and easy to implement, but with great improvement (compared to the 2D conv baseline). It may show that shift operation is a great way the improve other ResNet models.

Cons: (None)

### 6 Comments (e. g., improvements)

It seems like some kind of shuffle / random is indeed useful in our models. For example, dropout, the identical path of ResNet, TSM, etc…

1. Limin Wang, et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016.