Hacker Newsnew | past | comments | ask | show | jobs | submit | esafak's commentslogin

With a background like that you should be doing machine learning if you want to combine science and software. And climate tech is a burgeoning field.

I see many relevant openings in https://www.climatetechlist.com/jobs


They can charge companies more, as they drive more miles, for profit.

Don't you mean testing the interface of the implementation? I see nothing wrong with that, if so.

They mean the dependencies. If you’re testing system A whose sole purpose is to call functions in systems B and C, one approach is to replace B and C with mocks. The test simply checks that A calls the right functions.

The pain comes when system B changes. Oftentimes you can’t even make a benign change (like renaming a function) without updating a million tests.


Tests are only concerned with the user interface, not the implementation. If System B changes, that means that you only have to change your implementation around using System B to reflect it. The user interface remains the same, and thus the tests can remain the same, and therefore so can the mocks.

I think we’re in agreement. Mocks are usually all about reaching inside the implementation and checking things. I prefer highly accurate “fakes” - for example running queries against a real ephemeral Postgres instance in a Docker container instead of mocking out every SQL query and checking that query.Execute was called with the correct arguments.

> Mocks are usually all about reaching inside the implementation and checking things.

Unfortunately there is no consistency in the nomenclature used around testing. Testing is, after all, the least understood aspect of computer science. However, the dictionary suggests that a "mock" is something that is not authentic, but does not deceive (i.e. not the real thing, but behaves like the real thing). That is what I consider a "mock", but I'm gathering that is what you call a "fake".

Sticking with your example, a mock data provider to me is something that, for example, uses in-memory data structures instead of SQL. Tested with the same test suite as the SQL implementation. It is not the datastore intended to be used, but behaves the same way (as proven by the shared tests).

> checking that query.Execute was called with the correct arguments.

That sounds ridiculous and I am not sure why anyone would ever do such a thing. I'm not sure that even needs a name.


You don't know what the model is capable of until you try. Maybe today's models are not good enough. Try again next year.

This is true, but also: everything I try works!

I simply cannot come up with tasks the LLMs can't do, when running in agent mode, with a feedback loop available to them. Giving a clear goal, and giving the agent a way to measure it's progress towards that goal is incredibly powerful.

With the problem in the original article, I might have asked it to generate 100 test cases, and run them with the original Perl. Then I'd tell it, "ok, now port that to Typescript, make sure these test cases pass".


Really, you haven't found a single task they can't do? I like agents, but this seems a little unrealistic? Recently, I asked Codex and Claude both to "give me a single command to capture a performance profile while running a playwright test". Codex worked on this one for at least 2 hours and never succeeded, even though it really isn't that hard.

I think I was using Grok Code 1 Fast with Cline, and had it trying to fix some code. Came back a bit later and found out that after not being able to make progress on fixing the code, it decided to "fix" the test by replacing it with a trivial test.

That made the test pass of course, leaving the code as broken as it ever was. Guess that one was on me though, I never specified it shouldn't do that...


> I simply cannot come up with tasks the LLMs can't do, when running in agent mode, with a feedback loop available to them. Giving a clear goal, and giving the agent a way to measure it's progress towards that goal is incredibly powerful.

It's really easy to come up with plenty of algorithmic tasks that they can't do.

Like: implement an algorithm / data structure that takes a sequence of priority queue instructions (insert element, delete smallest element) in the comparison model, and return the elements that would be left in the priority queue at the end.

This is trivial to do in O(n log n). The challenge is doing this in linear time, or proving that it's not possible.

(Spoiler: it's possible, but it's far from trivial.)


So it is not reliable enough to run automatically?

Alperen,

Thanks for the article. Perhaps you could write a follow-up article or tutorial on your favored approach, Verification-Guided Development? This is new to most people, including myself, and you only briefly touch on it after spending most of the article on what you don't like.

Good luck with your degree!

P.S. Some links in your Research page are placeholders or broken.


I'll add some links for the original VGD paper and related articles, that should help in short term. Thank you! I'll look into writing something on VGD itself in the next few weeks.

> I think back to coworkers I’ve had over the years, and their varying preferences. Some people couldn’t start coding until they had a checklist of everything they needed to do to solve a problem. Others would dive right in and prototype to learn about the space they would be operating in.

This is the interesting question for me. I've had the impression that you should always have a plan, coming from big tech where a plan is demanded of anything significant, but working at a startup again where there's no bureaucracy to force me to do so, I find that I can live without detailed plans just fine. Then again, I am more experienced.



How to use this new feature?

Did any other scanner catch this, and when? A detection lag leaderboard would be neat.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: