[Paper-Reading] PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz, from NVIDIA

Chen Xiaoyuan

(I think the paper didn’t describe the model in detail enough, more accurate comprehension will be formed after reading its code)

1 Background

Optical flow estimation is a core computer vision problem, however, estimating it by traditional approach is computationally expensive for real-time applications.

2 Motivation

The author found that combining domain knowledge with deep learning can increase the accuracy and reduce the size of our model simultaneously. And they argue that the performance gap with FlowNetS and FlowNet2 is due to the partial use of the classical principles (which is, important).

3 Methods (including framework)


The whole PWC-Net can be divided into several components.

Feature Pyramid Extractor (to deal with shadow and lighting changes)

This component consists of L layers, and each layer is a downsampling process which is similar to that of ResNet’s (downsampled by a factor of 2).

Warping Layer (to estimate large motion)

The authors chose to warp the optical flow of the second image toward the first image (these two images are consecutive.)

$$ \textbf{c}^l_w(\textbf{x}) = \textbf{c}^l_2(\textbf{x}+\text{up}_2(\textbf{w}^{l+1})(\textbf{x})) $$

where x is the pixel index and \(\text{up}_2(w^{(l+1)})\) is the upsampled flow.

Cost Volume Layer (a more discriminative representation of the optical flow)

They defined the matching cost as the correlation between features of the first image and warped features of the second one:

$$ \textbf{c}\textbf{v}^l(\textbf{x}_1, \textbf{x}_2)=\frac1N(\textbf{c}^l_1(\textbf{x}_1))^T\textbf{c}^l_w(\textbf{x}_2) $$

As it costs a lot to calculate the whole cost volume, they chose to compute a partial one instead.

Optical Flow Estimator: a normal multilayer CNN network

Context Network (refinement of estimated optical flow): some dilated CNN networks

Plus, the training loss function is:

$$ \mathcal{L}(\Theta) = \sum^L_{l=l_0}\alpha_l\sum_\textbf{x}|\textbf{w}^l_\Theta(\textbf{x})-\textbf{w}^l_\textbf{GT}(\textbf{x})|_2+\gamma|\Theta|_2 $$

where Θ is the parameters of our network.

4 Experiments (data corpus, evaluation metrics, compared methods)

Datasets: FlyingChairs, MPI Sintel, KITTI

Evaluation metrics: EPE(End-Point Error)



5 Pros. & Cons.

Pros: Need much less memory; Outperforms the traditional methods (in accuracy, compared to FlowNetS and etc.)

Cons: Too compact to implement and comprehend; Less accurate than traditional ways.

6 Comments (e. g., improvements)

None yet, as it’s too complex for me, I need to dive into its code firstly.

But, anyway, I don’t like it much. Since I think the best and most useful models are simple ones (but meanwhile, with insights), for instance, the Batch Normalization of Google.

  1. A. Doscovitskiy, et al., FlowNet: Learning optical flow with convolutional networks, in ICCV, 2015.
  2. A. Ranjan, et al., Optical flow estimation using a spatial pyramid network, In CVPR, 2017.