Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

SQUAD and GLUE are tasks for language representation models -- aka BERT-like. This is a language generation model -- GPT-like. Hence, SQUAD/GLUE test sets are not really applicable. We are reporting on the wikitext and lambada sets that openAI also uses for similar models (numbers are in the blogpost).


What's the difference between the two models?


* BERT & language representation models: They basically turn a sentence into a compact vector that represents it so you can then do some downstream task on it such as sentiment detection, or matching the similarity between two sentences etc.

* GPT & language generation models: Given some context (say a sentence), they can generate text to complete it, or to summarize it, etc. The task here is to actually write something.


Both are language representation models, text generation is just a way of training model. BERT is also trained on text generation task: it asked to fill gaps in text (15% of text is blanked during training).


Maybe I am not understanding your point.

Out of the box, given a sequence of n tokens, BERT returns a tensor of dimension (n_tokens, hidden_size) [1]. Where hidden size has no relationship with the vocabulary. You can then fine-tune a model on this representation to do various tasks, e.g. sentiment classification. Thus BERT is said to be a language representation model.

Out of the box, given a sequence, GPT-2 returns a distribution over the vocabulary [2] from which you can draw to find the most likely next word. Thus GPT-2 is said to be a language generation model.

You could of course play with the masking token of BERT call it recursively to force BERT to generate something, and you could chop off some layers of GPT-2 to get some representation of your input sequence, but I think that is a little past the original question.

[1] https://github.com/google-research/bert/blob/master/modeling...

[2] https://github.com/openai/gpt-2/blob/master/src/model.py#L17...


> BERT returns a tensor of dimension (n_tokens, hidden_size) [1]. Where hidden size has no relationship with the vocabulary

"BERT returns" is ambiguous here. During pretraining last layer is loggits for one hot vocab vector, the same as in GPT: https://github.com/google-research/bert/blob/master/run_pret...


One is a language generation model, the other is a fill-in-the-blank model. It sounds like they might be similar, but in practice they are different enough objectives (and in particular the "bi-directional" aspect of BERT-type models) that the models learn different things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: