A Tour of MAE: Masked AutoEncoder
Why did progress of self supervised learning like masked auto-encoder in Computer Vision lag behind that of Nature Language Processing(NLP)?
- Information Density
Words are information dense, pictures are information redundant, some missing information can be inferred from pixels nearby
So MAE employs a radically higher mask ratio like 75% to force the auto encoder to have a holistic understanding, rather than low level details of images.
2. Convolution’s handling of mask tokens and positional embedding is awkward
This gap has been mitigated by the introduction of Vision Transformers
Other highlights of MAE
asymmetric encoder and decoder
- encoder only works on the visible part of the image
- decoder reconstruct both the visible and invisible parts from the latent representation and mask tokens
References
[1] He, Kaiming, et al. “Masked autoencoders are scalable vision learners.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.