PERT

Pre-trained language models are widely in various machine learning tasks (NLP and Computer vision) this article talks about a new method for pretraining language models.

PLMs are classified into two categories based on their training protocols:
auto-encoding and auto-regressive.

High-level architecture of both, auto-encoding and auto-regressive

Permuted Language Model (PerLM): In this architecture, a proportion of the input text is permuted, and the training objective is to predict the position of the original text.

The proposed PerLM tries to recover the word orders from a disordered sentence, and the objective is to predict the position of the original word.

https://arxiv.org/abs/2203.06906

The pre-training task used in the PERT, Permuted Language Model (PerLM)

  • The masking strategy, here, whole word and n-gram masking strategies for selecting candidate tokens for masking with a percentage of 40%, 30%, 20%, 10% for word-level unigram to 4-gram.
  • Randomly select a set of 90% tokens and shuffle their orders. Note that the shuffle process only takes place for these 90% tokens, not the whole input sequence.
  • For the rest of the 10% tokens, we keep them unchanged, treating them as negative samples.

* PerLM does not employ the artificial token [MASK] for masking purposes, which alleviates the pretraining-finetuning discrepancy issue (but could still suffer from unnatural word orders).

* The prediction space for PerLM is the input sequence, rather than on the whole. vocabulary, making it computationally efficient than MLM.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store