In the case of a network for classification, we require the output to be of fixed length(say, the number of classes).
Feed a batch of 3 dimensional images(BCHW) into the network, walk through stacks of cnn layers, now we are right before the last fully connected layer, the features maps are of the size bchw.
The length(number of output neurons) of the final output is fixed, so is the fully connected layer’s dimensions, what ensued is that the dimensions of the fc layer’s input should be of fixed size.
Follow the above discussion, the input image’s size should also be fixed if there’re only fixed CNN layers between input and the fully connected layer.
If so this would greatly reduce the versatility of the model, the model won’t be able to classify images of other sizes directly.
Here’s the Spatial Pyramid Pooling layers to help, we usually know it as SPP module.
SPP module takes 4 dimensional features of various sizes, we denote the size as BCHW, it’s output would be of fixed size BCL, where L is the output’s length.
SPP employs pooling to achieve the fixed size. Whatever size of the input, the output is fixed. The window of the pooling operation is chosen accordingly, to give you an idea, let’s walk through a concrete example.
If the output’s length is predefined as 30, the input feature are of size 40 x 48, SPP module will firstly calculate the window size of the pooling operation to be 8 x 8, then each pooling window’s single value output are connected , resulting a one dimensional fixed length vector of size 30 for each channel.
Now we got an idea what it means by Spatial Pooling, so where is the “Pyramid”.
In computer vision tasks, there’s always the trade-off between large context information and localized details information.
Intuitively, both information is essential for a robust inference, for example, both cats and dogs are furry, if with no context information large enough given, we can’t tell them from localized details information alone. If we only look into the general information, we can’t capture the variations like poses, illuminations etc within the images, so a cat sleep like a dog may be falsely classified as a dog.
So how to maintain context information and detail information at the same time. In classical CV works like SIFT, Image Pyramids are the key to their success.
So recursively resample the images(usually by a half), all these images stack like an image pyramid. In our case we build a feature pyramid in the same way.
For each layer of the feature pyramid, we use spatial pooling to get a fixed length output, then we concatenate all these outputs, which is also of fixed length.
You see, now you understand why we need SPP module and how it is implemented!