Why Swin Transformer?

Looking Back to Look Forward

4 min readAug 29, 2023

The debut of Vision Transformer[1] confirms the effectiveness of handling vision data using Transformer[3] like architectures.

The success of ViT is attributed to its long range relationship modeling ability that could capture the interactions of different image parts from the very early on.

ViT sparks great enthusiasm among the Computer Vision community, and the community launched the “Rebuild Every Computer Vision Task Using Transformer” movement, CV researchers try to leverage transformers to solve all downstream tasks like Object Detection, Semantic Segmentation, Super Resolution and Matting, etc.

It won’t last long before people realizing the limitations of transformers for solving computer vision tasks. And the motivation to bypass or mitigate those limitations yields Swin-Transformer which is short for Shifted Window Transformer.

  1. ViT cannot differentiate scales between tokens

In the vanilla ViT architecture, all tokens are always at a fixed scale, but objects are of various scales, which is unsuitable for vision applications like dense prediction.

2. ViT demands quadratic computation to image size

The computation cost to model the relationships between tokens is quadratic to the number of tokens, which would be intractable for high resolution images.

Swin Transformer borrows the idea of gradually enlarging the receptive fields from CNN like architectures.

Each Layer of Swin Transformer Block splits the tokens into windows, and each window consists of tokens(patches) of fixed size, only to calculate the self attention within each window. so the computation within each window can be taken as constant, which in return makes the computation cost linear to the image size.

Swin Transformer Block will gradually shift and merge the windows in later layers, which makes information exchange over a larger range while at the same time producing tokens of different scales.

Why Window Shift?

The Overall backbone of Swin-Transformer consists of stages of different window sizes, just like ResNet[4], but the scale of features/tokens remains constant within a block.

Confining the self attention calculation within each window soothes the surging quadratic calculation cost between tokens, but this also means tokens cannot exchange information from tokens in other windows.

That’s to say, we are confining the long range token interaction modeling ability of ViT, a signature attribute that could explain its success.

So the author proposed shifted window, each window is respectively shifted half of the window size downward and rightward. The central window now spans four windows in previous Swin layer, in which the tokens could exchange information in longer range, though still confined.

Why Cycled Shift?

Now in the shifted windows, the center window now spans four previous windows, but the previous four windows is splitted into nine blocks, if we naively pad the fractured blocks with zeros and batch process them like normal windows, the computation burden is now 9/4 fold, let alone the memory cost.

The author invented an elegant cycle shift mechanism to reorganize the 9 blocks.

Fig 1: Cycle Shift

As shown in Fig 1, we horizontally flip blocks A+C, followed is another vertically flip for C+B, now the 9 fractured blocks are realigned into 4 windows of original size. And we can batch process these windows with no noticeable increase in computation.

Why Masking?

After cycle shift, the implementation in the paper does not just calculate the attentions directly, it adds a very large(absolutely) negative value like -100 to the interactions between tokens from windows that are not neighbors in the original window split. But why is this necessary?

In computer vision, CNN like architectures preserves the locality, semantic information close spatially will always stay close spatially, locality is especially import for dense prediction tasks like object detection or semantic segmentation.

In the cycle shifted windows, we should notice that the locality is partially preserved within the original 9 blocks but not in those realigned 4 windows, so the author add masking(a very negative value) to interactions that violated the locality of information. Since usually the activations in neural networks are small, after masking and operations like softmax, the weights becomes infinitesimally close to zero, thus erased the interactions that we don’t want.

In Table 4[2] of the paper, the ablation study shows that without masking, the performance deteriorates a lot for dense prediction tasks.


[1] Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).

[2] Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.

[3] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[4] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.