CSPNet: Cross Stage Partial Network

Neville
6 min readJul 19, 2023

--

“Maximize the differences of gradient combination”

Introduction: From DenseNet to CSPNet

Due to efficient feature reuse and improved information flow, DenseNet[2] achieves remarkable performance both in inference efficiency and accuracy.

Fig 1: DenseNet vs CSPDenseNet

In the article Why isn’t DenseNet adopted as extensive as ResNet? I published earlier. I listed several villains that hinder the adoption of DenseNet, High memory usage is the most conspicuous one. High memory usage demands higher tiers of hardwares, which are costlier. Also higher memory usage will make the system suffer poor data locality, again slow down the inference speed. CSPNet[1] is here to help.

Review of DenseNet’s Gradient Flow Pattern

The connection pattern of DenseNet can be seen as the way to maximize cardinality.

Fig 2: Convolution Operations of a DenseBlock

Here, xi means the feature map of ith dense layer within a dense block. wi is the parameters of ith dense layer. [x0, x1] means the concatenated feature maps of x0 and x1 etc.

Fig 3: Weight Updating of a DenseBlock

gk is the gradient information propagated to kth dense layer within a dense block. f is the update function given the gradients. and wk’ is the kth updates to be applied to parameters of kth layer. As outlined in the red boxes above:

Large amount of gradient information is reused for updating the weights of different dense layers.

So does it hurt the efficiency of the parameters?

Different dense layers repeatedly learn copied gradient information.

A Partial Cross Stage Dense Block consists of a partial dense block and a partial transition layer.

Split and Merge Schema of CSPNet

Fig 4: Gradient Flow of CSPNet

Firstly, feature maps x0 is split into two parts x0' and x0'’ at the entry of CSP Block.

x0'’ will be directed to the partial dense block, and transformed the same way as a vanilla DenseBlock. xi, wi denote feature maps and weights in the ith dense layer.

All feature maps within the partial dense block will firstly be concatenated , then in order to match the feature maps’ resolution, those collected feature maps will go through the transition layer, resulting xT.

x0' will be short cut connected to the other end of the partial dense block. xT and x0' will be concatenated and transformed by wU as a whole, resulting xU.

Fig 4 shows the information flow and its gradient update dependencies. As outlined by the green boxes, gradients propagated to the partial dense block have nothing to do with gradients of the short cut connected x0'. So:

Both sides do not contain duplicate gradient information that belongs to other sides.

CSPDenseNet preserves the feature reuse attributions of DenseNet, while reducing the duplicated gradient flow within the network.

Hindsights from the split and merge schema

  1. Number of gradient flow paths doubles

Compared to DenseNet the gradient propagated from the loss end will be directed to twice the paths in CSPNet because of the feature maps’ split schema, which allows a richer gradient combination.

2. Balance computation of each layer

The number of channels in the base layer is much larger than the growth rate in DenseNet.

The number of channels involved in the transition layer is only half of the number of base layer’s channels.

Thus computation bottleneck is saved for a half.

3. Reduce Memory Traffic

Assume the size of the base feature maps is w x h x c, growth rate and number of layers of a dense block are d and m respectively.

Memory Access Cost(MAC) calculation of DenseBlock

Fig 5: Connection Pattern of DenseNet

As the resolution is fixed with a dense block, we’ll omit the w x h in the calculation of MAC, like we are counting the number of times the whole 2D feature map gets read.

Base layer read by all m dense layers followed, contributes c x m to the total MAC.

Each layer generates d new feature maps. The generated d feature maps will be read by all layers in the downstream, firstly d x m, then d x (m — 1), … , d x 2, d x 1. It’s trivial to tell the MAC cost is d x m x (m + 1) / 2

So total MAC of a Dense Block is c x m + d x m x (m + 1) / 2

MAC calculation of CSPDenseBlock

The MAC calculation of the partial dense block is almost identical, except that the channel number is half the size so total MAC of a CSPDenseBlock is c x m / 2 + d x m x (m + 1) / 2.

Comparison of MAC

Since the channel size c and the number of dense layers m are usually much larger than the growth rate d. The MAC cost of CSPDenseBlock is roughly half of DenseBlock’s.

Question: Doesn’t the concatenation of feature maps involve reading memory of another half?

Transition Mechanism of CSPNet

After the partial transformation, we’ll meet the feature fusion problem. The discussion in the original CSPNet paper[1] is truly insightful.

Logically, We have four choices:

  1. split-transform-transition-concatenate-transition
  2. split-transform-concatenate-transition
  3. split-transform-transition-concatenate
  4. no-split-transform only
Fig 6: Feature Fusion Strategies

So, do we have a guiding principal that could help us to make the decision?

Yes:

The purpose of designing transition layer is to maximize the the difference of gradient combination.

Then we get Hierarchical Fusion Mechanism:

Truncating the gradient flow to prevent distinct layers from learning duplicate gradient information

In the graphical illustration of Fig 6, Fusion Last strategy transforms the partial feature maps first then does the concatenation. Experimentally this strategy doesn’t hurt the performance while significantly reduces the computation cost. This observation conforms to the fact that Fusion Last Strategy truncate the gradient flow as no gradient information shared by Part1 and Part2.

Fusion first strategy firstly concatenates the feature maps from two parts, then to transform them as a whole. Deduced from the guiding principal that “transition layers should maximize the difference of gradient combination”, fusion first strategy should behave poorer, because large amount of gradient information to the transition layer is reused. And the experiment results verified our deduction.

Further Work

  1. The argument that CSPNet is mac saving is not that convincing.
  2. Why bother to combine feature fusion first and fusion last strategy if fusion last is already good enough? and fusion first will minimize the gradient differences between those two parts?

References

[1] Wang, Chien-Yao, et al. “CSPNet: A new backbone that can enhance learning capability of CNN.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020.

[2] Huang, Gao, et al. “Densely connected convolutional networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

--

--

No responses yet