Why isn’t DenseNet adopted as extensive as ResNet?

7 min readJul 15, 2023

Introduction

Generally speaking, any neural network architecture that has a short cut connection can be seen as a special case of DenseNet, and from this viewpoint, the most relevant difference between those architectures is their denseness. ResNet can be viewed as DenseNet who is made of stacked DenseBlocks of only one layer.

Given the fact that DenseNet[2] requires less parameters and can be trained more efficiently, the question has surfaced: Why isn’t DenseNet used as extensive as ResNet[1]?

Connection Patterns of Vanilla CNN, ResNet and DenseNet

ResNet is More Straightforward

ResNet adds a short cut from the input feature to the learnable mapping’s output in each Residual Block. Within a Dense Block, DenseNet goes even further by concatenating feature maps to each of the following layers’s output.

The author of DenseNet claims that DenseNet is superior than ResNet both in efficiency and accuracy. But apparently ResNet’s architecture is much more succinct.

Simpler network means less pain to tune.

For engineering consideration, a less sophisticated architecture means less hyper-parameters to tune. This will greatly reduce the time and effort to train the network to it’s optimal state.

To give you an idea, if there’re N hyper-parameters that defines a network and it’s training scheduling. Let’s say, we tune the network using grid search. Grid Search means we’ll try out a fixed number of values for each hyper-parameter. We set the number of trials as M. The total number of trials is M^N, the running time is growing exponentially by the number of hyper-parameters!

For a DenseNet based backbone, besides all the hyper-params that defines a ResNet based architecture, like number of filter for each cnn layer, the width of feature maps, total number of layers, you have to pick the growth rate k, number of Dense Blocks and a lot of other details that define the transition layers. So apparently DenseNet requires more hyper-params.

2. Work in a team requires the delivery should be more predictable

In academic arena, the author of a novel architecture will only show you the best results in regrading of training efficiency and inference accuracy. But they won’t tell you how many trials had failed, they will only show you the one that looks good.

Once the novel architecture came out, the authors will reap intellectual influence and prestige. So they are incentivized to find the seemly optimal state of the architecture they proposed, no matter how hard that would be.

As for real world applications, engineers or product managers care more about ROI(return over investments) instead of metrics like FPS or mAP. More time spent in tuning the network means higher expenses. A potential improvements in the model side does not always need to be exploited if the expense doesn’t match the return.

Let alone they quite likely has to work on deadline. Work in a team means you have to reduce the uncertainty in your delivery date.

Instead of work with unbounded time for an infinitesimal gain in performance metrics, simpler architectures are more guaranteed to be delivered in time.

Memory Access Cost (MAC)

Memory Access Cost is defined as the number of memory access operations. MAC is a topic frequently seen in works related to efficient network architectures, like ShuffleNetV2 [3]. For some networks the memory access cost may be the dominant factor for inference efficiency.

And the take home message from the discussion of MAC is: Fewer parameters does not always mean faster inference speed.

The connection patterns determine that for ResNet, the MAC grows linearly as the number of layers, but quadratically for DenseNet. Higher MAC would make inference speed of DenseNet slower though the number of parameters seems fewer than that of ResNet.

High Memory Usage

Due to the fact that for DenseNet, each downstream layer has to access all feature maps in the same Dense Block, those feature maps have to be loaded into the memory, memory demand is much higher than that of ResNet.

A higher memory usage demands higher tier hardwares for deployments. Better hardwares means higher cost, again would hamper the adoption of DenseNet for commercial computer vision products.

Poor Data Locality

In computer system, a hierarchical of memory caches help mitigate the contradiction between memory’s large capacity and high cost.

Data more likely to be fetched next by the processor is placed in the faster, more expensive cache.

If data not found, the system would recursively search it in lower level caches and place it in the nearest cache to the processor. For example, if the processor can’t find r1 in registers, the processor would go search r1 in L1 cache, if found, r1 would be fetched back, however the time cost is orders higher. If not found, the processor would go search in even lower level L2 cache.

Afterwards, the caches would be updated because their sizes are fixed. Evidently, to reduce time cost, we fill in the upper level caches as many values as they can hold. So whenever a cache misses a value the processor needs, the value would be placed into higher level caches, a value in higher level caches in return would be popped out to lower level caches.

In the end, the hard disk can be seen as a cache for lowest level of cache memory, RAM-Main memory. Lower level caches are of orders cheaper than higher level caches, so their storage capacity can be much larger.

So a hierarchy of caches benefit the computer system from both the fast speed of higher level caches and large storage capacities of lower level caches.

Let’s go back to our comparison of ResNet and DenseNet, we already know DenseNet requires more memory than ResNet, as DenseNet has to load more feature maps into the memory for use.

Given the same hardware, the memory cache capacities are fixed. More memory usage means more frequent memory swaps between cache levels. Consequently, DenseNet would exhibit poor data locality, as more cache misses will occur. To handle cache misses, the system has to be penalized for extra time cost, thus the inference speed of DenseNet suffers a lot.

Backbone’s Influence is Fading for Downstream Tasks

When people in the CV community talking about backbones like ResNet and DenseNet, the discussion usually relates to downstream tasks like Object Detection, Semantic Segmentation etc.

Better backbones usually means better performance

For example, it’s a quite common practice to fuse informations from features maps of different layers, that would provide details information from low level feature maps and large context information from higher level features maps.

Better backbone means the feature extractor(the learned cnn filters) can extract more differentiable and robust features from vision data. Thus better features can ease the learning of downstream tasks, otherwise the noisy feature maps(data) may confuse the learning.

2. If backbones perform roughly the same, other factors come to dominate

However, the above discussion would only be meaningful if the performance gap is large. There’re other factors that guide the learning of the model, like, data augmentation techniques, loss metrics, training tricks and information fusion patterns etc.

The influence of the backbone usually would be fading in the downstream tasks. That’s to say, you may not observe performance boost that is alluring enough to let you choose the more complex DenseNet architecture over ResNet.

The Superiorness is Merely an Illusion

Unfortunately, what we are told in papers is not always true, the superiorness of DenseNet may just be an illusion.

The success of DenseNet Lacks Theoretical Backing

We should not take engineering tricks as serious as theoretical truth! The DenseNet paper just told us DenseNet performs better than ResNet for datasets like ImageNet, anything more than that is not guaranteed. When handling real world data, you should not be surprised finding your experiment results contradict with claims in papers.

“Story Telling” is so pervasive in the AI or Deep Learning community. In fact, the rationales (“Stories”) behind the the success of neural architectures proposed by papers are usually too far away from mathematical tools or theoretical insights available.

When you find your neural architecture works well on a dataset, you give some explanations to justify your neural architectures. However, until the networks’ generalization ability is theoretically vindicated, we can only call it a “hypothesis”. We should not be shocked when a hypothesis fails in real life.

For classical works like ResNet, it’s effectiveness has been tested by uncountable applications of it, so people empirically believe ResNet is indeed superior. Anyway, for engineers, if ResNet predicts well, why bother to ask “the why” like Research Scientists.

So the phenomenon that DenseNet is not adopted as widespread as ResNet is an indication of DenseNet’s worse generalization ability.

References

[1] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[2] Huang, Gao, et al. “Densely connected convolutional networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[3] Ma, Ningning, et al. “Shufflenet v2: Practical guidelines for efficient cnn architecture design.” Proceedings of the European conference on computer vision (ECCV). 2018.