Masked Autoencoders

Key Points
Self-supervised learning
But not contrastive learning, there are no negative pairs.
The masked patch prediction learns some representation
Masked Patch Prediction
In ViT, the encoder was trained for class prediction. Whereas in MAE, the encoder/decoder are trained for masked patch prediction sort of image reconstruction.
More sample efficient training compared to ViT training, as we are using self-supervision.
Can be considered as a framework to train ViT-like transformers
Masked Autoencoder can be considered like BERT but for images, where the masked tokens are predicted by neighbouring tokens.
Note that the model is not good enough to generate the same quality images, ie not great reconstruction quality. But the point is that the representation learned by the model is good enough, which can be used by downstream tasks.
This is pre-training to learn representation followed by task-specific paradigm.
Approach
Mask random patches in the image (about 75%), then reconstruct the missing patches/pixels with input as remaining patches/pixels.
The encoder is given a visible subset of patches (not masked patches) as input. Then there is a decoder that predicts the complete image (visible and masked patches) using latent representation (output of the encoder) and mask tokens (just placeholder tokens).
This approach allows for more sample-efficient training.
Questions
Why can't patches in ViT or MAE be overlapping?
The patches can be overlapping, but that comes at the cost of additional computation. And probably with a minor improvement in the performance. Though there will be more continuity in the token embeddings.
Why does a higher masking ratio seem to give better performance?

Last updated