PERT

PRE-TRAINING BERT WITH PERMUTED LANGUAGE MODEL
https://arxiv.org/abs/2203.06906

Aquib J. Khan
2 min readApr 4, 2022

Pre-trained language models are widely in various machine learning tasks (NLP and Computer vision) this article talks about a new method for pretraining language models.

PLMs are classified into two categories based on their training protocols:
auto-encoding and auto-regressive.

High-level architecture of both, auto-encoding and auto-regressive

Permuted Language Model (PerLM): In this architecture, a proportion of the input text is permuted, and the training objective is to predict the position of the original text.

The proposed PerLM tries to recover the word orders from a disordered sentence, and the objective is to predict the position of the original word.

https://arxiv.org/abs/2203.06906

The pre-training task used in the PERT, Permuted Language Model (PerLM)

  • The masking strategy, here, whole word and n-gram masking strategies for selecting candidate tokens for masking with a percentage of 40%, 30%, 20%, 10% for word-level unigram to 4-gram.
  • Randomly select a set of 90% tokens and shuffle their orders. Note that the shuffle process only takes place for these 90% tokens, not the whole input sequence.
  • For the rest of the 10% tokens, we keep them unchanged, treating them as negative samples.

* PerLM does not employ the artificial token [MASK] for masking purposes, which alleviates the pretraining-finetuning discrepancy issue (but could still suffer from unnatural word orders).

* The prediction space for PerLM is the input sequence, rather than on the whole. vocabulary, making it computationally efficient than MLM.

--

--