Hacker Newsnew | past | comments | ask | show | jobs | submit | visarga's commentslogin

The prefix tuning approach was largely abandoned for LoRA, it does not change the process if you tune the prefix or some adapter layers, but it is more flexible to train the LoRAs.

The Skills concept emerged naturally when you see how coding agents use docs, CLI tools and code. Their advantage is they can be edited on the fly to incorporate new information and can learn from any feedback source - human, code execution, web search or LLMs.


If we did that it would be much more expensive, keeping all weights in SRAM is done by Groq for example.

At the end of the day, you either validate every line of code manually, or you have the agent write tests. Automate your review.

I would just put a PR_REVIEW.md file in the repo an have a CI agent run it on the diff/repo and decide pass or reject. In this file there are rules the code must be evaluated against. It could be project level policy, you just put your constraints you cannot check by code testing. Of course any constraint that can be a code test, better be a code test.

My experience is you can trust any code that is well tested, human or AI generated. And you cannot trust any code that is not well tested (what I call "vibe tested"). But some constraints need to be in natural language, and for that you need a LLM to review the PRs. This combination of code tests and LLM review should be able to ensure reliable AI coding. If it does not, iterate on your PR rules and on tests.


You should take into consideration the time it took to make those 9200 tests originally. If you have good test coverage the agent can go much farther ahead.

Heh, I mostly use AI in the opposite direction to write tests because:

1. That’s the part of development work I hate the most and never really clicked with me

2. AI to to this point seems to be better at writing tests than code

Take this with the grain of salt that:

1. I suck

2. My work is mostly in the realm of infrastructure where testing has always been weird and a little dumb


AI has become very good at writing pointless and bad tests, at least. It remains difficult to compel it to write good tests consistently.

But even if it wrote great tests every time, the trouble is that testing was designed around the idea of "double entry accounting". Even great tests can test the wrong thing. In the old world you would write a test case and then implement something to satisfy the same. If both sides of the ledger agree, so to speak, you can be pretty confident that both are correct. — In other words, going through the process of implementation gives an opportunity to make sure the test you wrote isn't ill-conceived or broken itself. If you only write the tests, or only write the implementation, or write none of it, there is no point at which you can validate your work.

If you have already built up an application and are reusing its test suite to reimplement the software in another language, like above, that is one thing, but in greenfield work it remains an outstanding problem of how to validate the work when you start to involve AI agents. Another article posted here recently suggests that we can go back to manual testing to validate the work... But that seems like a non-solution.


Every error is a signal you need better tests. You can let the LLM create tests for every error it stumbles into, besides all the regular tests it can write on its own. Add all test scenarios you can think of, since you are not implementing them by hand. A bad test is invalidated by code, and a bad code invalidated by the test, so between them the AI agent can become reliable.

Of course there is - if you write good tests, they compress your validation work, and stand in for your experience. Write tests with AI, but validate their quality and coverage yourself.

I think the whole discussion about coding agent reliability is missing the elephant in the room - it is not vibe coding, but vibe testing. That is when you run the code a few times and say LGTM - the best recipe to shoot yourself in the foot no matter if code was hand written or made with AI. Just put the screw on the agent, let it handle a heavy test harness.


this is a very good point, however the risk of writing bad or non extensive tests is still there if you don’t know what good looks like! The grind will still need to be there, but it will be a different way of gaining experience

Starting to get it!

New skills, not no skills.

There will still be a wide spectrum of people that actually understand the stack - and don’t - and no matter how much easier or harder the tools get, those people aren’t going anywhere.


You 100% need to test work done by AI, if it's code it needs to pass extensive tests, if it's just a question answered, it needs to be the common conclusion of multiple independent agents. You can trust a single AI as much as a HN or reddit comment, but you can trust a committee of 4 as a real expert.

More generally I think testing AI by using its web search, code execution and ensembling is the missing ingredient to increased usage. We need to define the opposite of AI work - what validates it. This is hard, but once done you can trust the system and it becomes cheaper to change.


The stochastic parrot framing makes some assumptions, one of them being that LLMs generate from minimal input prompts, like "tell me about Transformers" or "draw a cute dog". But when input provides substantial entropy or novelty, the output will not look like any training data. And longer sessions with multiple rounds of messages also deviate OOD. The model is doing work outside its training distribution.

It's like saying pianos are not creative because they don't make music. Well, yes, you have to play the keys to hear the music, and transformers are no exception. You need to put in your unique magic input to get something new and useful.


I don't think it will become well rounded because that is not cost sensitive. Intelligence is sensitive to cost, it is the core constraint shaping it. Any action has a cost - energy, materials, time, opportunity or social. Intelligence is solving the cost equation, if we can't solve it we die. Cost is also why we specialize, in a group we can offload some intelligence to others. LLMs also have their own costs, and are shaped by it into some kind of jagged intelligence, they are no spherical cows either.

Coding agents are going to become better and used everywhere, why train for the artisanal coding style of 2010 when you are closer to 2030? What you need to know is how to break complex projects in small parts, improve testing, organize work and the typical agent problems and capabilities. In the future no employer is going to have the patience for you to code manually.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: