A Tour of MAE: Masked AutoEncoder

Looking Back to Look Forward

1 min readAug 23, 2023

Why did progress of self supervised learning like masked auto-encoder in Computer Vision lag behind that of Nature Language Processing(NLP)?

  1. Information Density

Words are information dense, pictures are information redundant, some missing information can be inferred from pixels nearby

So MAE employs a radically higher mask ratio like 75% to force the auto encoder to have a holistic understanding, rather than low level details of images.

2. Convolution’s handling of mask tokens and positional embedding is awkward

This gap has been mitigated by the introduction of Vision Transformers

Other highlights of MAE

asymmetric encoder and decoder

  1. encoder only works on the visible part of the image
  2. decoder reconstruct both the visible and invisible parts from the latent representation and mask tokens


[1] He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.