SQUAD and GLUE are tasks for language representation models -- aka BERT-like. Th...

igravious · on Feb 10, 2020

What's the difference between the two models?

sailingparrot · on Feb 10, 2020

* BERT & language representation models: They basically turn a sentence into a compact vector that represents it so you can then do some downstream task on it such as sentiment detection, or matching the similarity between two sentences etc.

* GPT & language generation models: Given some context (say a sentence), they can generate text to complete it, or to summarize it, etc. The task here is to actually write something.

riku_iki · on Feb 10, 2020

Both are language representation models, text generation is just a way of training model. BERT is also trained on text generation task: it asked to fill gaps in text (15% of text is blanked during training).

sailingparrot · on Feb 11, 2020

Maybe I am not understanding your point.

Out of the box, given a sequence of n tokens, BERT returns a tensor of dimension (n_tokens, hidden_size) [1]. Where hidden size has no relationship with the vocabulary. You can then fine-tune a model on this representation to do various tasks, e.g. sentiment classification. Thus BERT is said to be a language representation model.

Out of the box, given a sequence, GPT-2 returns a distribution over the vocabulary [2] from which you can draw to find the most likely next word. Thus GPT-2 is said to be a language generation model.

You could of course play with the masking token of BERT call it recursively to force BERT to generate something, and you could chop off some layers of GPT-2 to get some representation of your input sequence, but I think that is a little past the original question.

[1] https://github.com/google-research/bert/blob/master/modeling...

[2] https://github.com/openai/gpt-2/blob/master/src/model.py#L17...

riku_iki · on Feb 11, 2020

> BERT returns a tensor of dimension (n_tokens, hidden_size) [1]. Where hidden size has no relationship with the vocabulary

"BERT returns" is ambiguous here. During pretraining last layer is loggits for one hot vocab vector, the same as in GPT: https://github.com/google-research/bert/blob/master/run_pret...

octbash · on Feb 10, 2020

One is a language generation model, the other is a fill-in-the-blank model. It sounds like they might be similar, but in practice they are different enough objectives (and in particular the "bi-directional" aspect of BERT-type models) that the models learn different things.