Let’s firstly review how Neural networks work.
In the context of Classification problem, given the input images, shallow layers would extract low level features like image pixel gradients etc, these low level features are recursively combined to form more abstract features in the later layers.
It is the abstract high level features in the last layer make the classification possible or easier.
Naively, we suppose by adding more layers, the network would learn more differentiable features and boost the classification accuracy. But when the number of layers surpass a certain amount, the performance gets poorer.
Is the Degradation problem caused by overfitting?
No, if the network suffers overfitting, the training error would continue decreasing and large test/validation error should be observed. While in the degradation problem, both training error and testing error go up as more layers stacked.
Degradation indicates that deeper neural networks are more difficult to train.
A few lines of reasoning will show us that shallow neural networks should produce no higher performance than their deeper counterpart.
Start from the shallow neural networks, we gradually add more identity layers, until the number of layers match that of the deeper one. Then they two should make identical predictions.
Neural network training is mathematically equivalent to a non-convex optimization problem. And the chance to get trapped into a local minimum is going higher as more layers stacked.
We hypothesize that: the degradation problem comes from the difficulty in optimizing a larger number of parameters.
Compared to a plain CNN layer, Residual Block adds a short cut from the input features to the output of the mapping. Say, the input features are X, the output from the original mapping is f(x), now the output from the residual block is H(x) = f(x) + x, our desired output is H(x), but what the network is trying to fit is the mapping f(x) = H(x) — x, in other words, we are learning the residuals from the output in relate to the input.
Even shallow layers are suffice for a certain problem, the network can choose not to optimize the mapping f(x), and leave the residual block as an identity mapping.
Experiments and real-world applications show that Residual Block do solve the degradation problem.