I've been trying opencode a bit with gemini pro (and claude via those) with a rust project, and I have a push pre-hook to cargo check the code.
The amount of times I have to "yell" at the llm for adding #[allow] statements to silence the linter instead of fixing the code is crazy and when I point it out they go "Oops, you caught me, let me fix it the proper way".
So the tests don't necessarily make them produce proper code.
I was doing a somewhat elaborate form/graph in Google Worksheets, had to translate a bunch of cells from English to Spanish, and said "Why not use Gemini for this easy, grunt work? They tend to output good translations".
I spent 20 minutes between guiding it because it was putting the translation in the wrong cells, asking it not to convert the cells to a fancy table, and finally, convincing it that it really had access to alter the document, because at some point it denied it. I wasn't being rude, but it seems I somehow made it evasive.
I had to ask it to translate in the chat, and manually copy-pasted the translations in the proper cells myself. Bonus points because it only translated like ten cells at a time, and truncated the reply with a "More cells translated" message.
I can't imagine how hard it would be to handhold an LLM while working in a complex code base. I guess they are a godsend for prototypes and proofs of concept, but they can't beat a competent engineer yet. It's like that joke where a student answers that 2+2=5, and when questioned, he replies that his strength is speed, not accuracy.
This is one of those places I feel like they're trying to do too much with the LLMs and I think this is one of those places where there's "a bubble". I feel like the LLMs are text tools, so trying to take them out of their domain and force them somewhere else you're going to have problems.
Anyways, I replied because I had something else I wanted to say.
I was using Gemini in a google worksheet a while back. I had to cross reference a website and input stuff into a cell. I got Gemini to do it, had it do the first row, then the second, then I told it to do a batch of 10, then 20. It had a hiccup at 20, would take too long I guess. So I had it go back to 10. But then Gemini tells me it can't read my worksheet. I convince it that it can, but then it tells me it can't edit my worksheet. I argue with it, "you've been changing the worksheet wtf?" I convinced it that it could and it started again, but then after doing a couple it told me it couldn't again. We went back and forth a bit, I'd get it working, it would break, repeat. I think it was after the third time I just couldn't get it to do it again.
I looked up the docs, searched online, and I was concerned that I found Google didn't allow Gemini to do a lot of stuff to worksheets/docs/other google workspace stuff. They said they didn't allow it to do a ton of stuff that I definitely had Gemini doing.
Then a week or two went by and google announced they're allowing gemini to directly edit worksheets.
So wtf how did I get it to do it before it could do it???
I added a bunch of lines telling it to never do that in CLAUDE.md and it worked flawlessly.
So I have a different experience with Claude Code, but I'm not trying to say you're holding it wrong, just adding a data point, and then, maybe I got lucky.
The amount of times I have to "yell" at the llm for adding #[allow] statements to silence the linter instead of fixing the code is crazy and when I point it out they go "Oops, you caught me, let me fix it the proper way".
So the tests don't necessarily make them produce proper code.