PERT
PRE-TRAINING BERT WITH PERMUTED LANGUAGE MODEL
https://arxiv.org/abs/2203.06906
Pre-trained language models are widely in various machine learning tasks (NLP and Computer vision) this article talks about a new method for pretraining language models.
PLMs are classified into two categories based on their training protocols:
auto-encoding and auto-regressive.
Permuted Language Model (PerLM): In this architecture, a proportion of the input text is permuted, and the training objective is to predict the position of the original text.
The proposed PerLM tries to recover the word orders from a disordered sentence, and the objective is to predict the position of the original word.
The pre-training task used in the PERT, Permuted Language Model (PerLM)
- The masking strategy, here, whole word and n-gram masking strategies for selecting candidate tokens for masking with a percentage of 40%, 30%, 20%, 10% for word-level unigram to 4-gram.
- Randomly select a set of 90% tokens and shuffle their orders. Note that the shuffle process only takes place for these 90% tokens, not the whole input sequence.
- For the rest of the 10% tokens, we keep them unchanged, treating them as negative samples.
* PerLM does not employ the artificial token [MASK] for masking purposes, which alleviates the pretraining-finetuning discrepancy issue (but could still suffer from unnatural word orders).
* The prediction space for PerLM is the input sequence, rather than on the whole. vocabulary, making it computationally efficient than MLM.