Lots of GPT4's test performance was from taking hundreds of runs and taking the ...

Lots of GPT4's test performance was from taking hundreds of runs and taking the most common answer (on multiple choice/fill in the blank).

That does speak to the increase you can get by orchestrating things more with multiple runs even in something as simple as take he majority. I'm assuming the multiple choice stuff allowed it to think in a scratch pad before answering or something as just taking multiple runs of a single next character A B C D for multiple choice would probably be similar to just lowering the temperature and taking one measurement.