Because the AI's, at least right now, can't generate/change code so that it correctly does what's expected with the confidence intervals we expect. I've tried to get it to happen, and it just doesn't. As long as that's true, we'll need to somehow get the correctness to where it needs to be, and that's going to require a person.
A lot of people have already figured out at some tricks to improving code generation.
You can fairly easily update the “next token” choice with a syntax check filter. LLMs like ChatGPT provide a selection of “likely” options, not a single perfect choice. Simply filter the top-n recommendations mechanically for validity. This will improve output a lot.
Similarly, backtracking can be used to fix larger semantic errors.
Last but not least, any scenario where a test case is available can be utilised to automatically iterate the LLM over the same problem until it gets it right. For example, feed it compiler error messages until it fixes the remaining errors.
This will guarantee output that compiles, but it may still be the wrong solution.
As the LLMs get smarter they will do better. Also, they can be fine tuned for specific problems automatically because the labels are available! We can easily determine if a piece of code compiles, or if it makes a unit test pass.
Currently ChatGPT isn't, at least via public access, hooked up to a compiler or interpreter that it can use to feed the code it generates into and determine whether it executes as expected. That wouldn't even seem particularly difficult to do, and once it is, ChatGPT would literally be able to train itself how to get the desired result.
Precisely. I think people should consider the "v4" in "ChatGPT 4" as more like "0.4 alpha".
We're very much in the "early days" of experimenting with how LLMs can be effectively used. The API restrictions enforced by OpenAI are preventing entire categories of use-cases from being tested.
Expect to see fine-tuned versions of LLaMA run circles around ChatGPT once people start hooking it up like this.