More

esafak · 2025-12-25T21:48:19 1766699299

With a background like that you should be doing machine learning if you want to combine science and software. And climate tech is a burgeoning field.

I see many relevant openings in https://www.climatetechlist.com/jobs

esafak · 2025-12-25T19:20:07 1766690407

They can charge companies more, as they drive more miles, for profit.

esafak · 2025-12-24T14:49:54 1766587794

Don't you mean testing the interface of the implementation? I see nothing wrong with that, if so.

et1337 · 2025-12-24T15:55:20 1766591720

They mean the dependencies. If you’re testing system A whose sole purpose is to call functions in systems B and C, one approach is to replace B and C with mocks. The test simply checks that A calls the right functions.

The pain comes when system B changes. Oftentimes you can’t even make a benign change (like renaming a function) without updating a million tests.

9rx · 2025-12-24T21:47:48 1766612868

Tests are only concerned with the user interface, not the implementation. If System B changes, that means that you only have to change your implementation around using System B to reflect it. The user interface remains the same, and thus the tests can remain the same, and therefore so can the mocks.

et1337 · 2025-12-24T22:01:22 1766613682

I think we’re in agreement. Mocks are usually all about reaching inside the implementation and checking things. I prefer highly accurate “fakes” - for example running queries against a real ephemeral Postgres instance in a Docker container instead of mocking out every SQL query and checking that query.Execute was called with the correct arguments.

9rx · 2025-12-24T22:16:23 1766614583

> Mocks are usually all about reaching inside the implementation and checking things.

Unfortunately there is no consistency in the nomenclature used around testing. Testing is, after all, the least understood aspect of computer science. However, the dictionary suggests that a "mock" is something that is not authentic, but does not deceive (i.e. not the real thing, but behaves like the real thing). That is what I consider a "mock", but I'm gathering that is what you call a "fake".

Sticking with your example, a mock data provider to me is something that, for example, uses in-memory data structures instead of SQL. Tested with the same test suite as the SQL implementation. It is not the datastore intended to be used, but behaves the same way (as proven by the shared tests).

> checking that query.Execute was called with the correct arguments.

That sounds ridiculous and I am not sure why anyone would ever do such a thing. I'm not sure that even needs a name.

esafak · 2025-12-24T14:43:15 1766587395

You don't know what the model is capable of until you try. Maybe today's models are not good enough. Try again next year.

jeffrallen · 2025-12-24T15:54:51 1766591691

This is true, but also: everything I try works!

I simply cannot come up with tasks the LLMs can't do, when running in agent mode, with a feedback loop available to them. Giving a clear goal, and giving the agent a way to measure it's progress towards that goal is incredibly powerful.

With the problem in the original article, I might have asked it to generate 100 test cases, and run them with the original Perl. Then I'd tell it, "ok, now port that to Typescript, make sure these test cases pass".

johnfn · 2025-12-24T16:46:59 1766594819

Really, you haven't found a single task they can't do? I like agents, but this seems a little unrealistic? Recently, I asked Codex and Claude both to "give me a single command to capture a performance profile while running a playwright test". Codex worked on this one for at least 2 hours and never succeeded, even though it really isn't that hard.

magicalhippo · 2025-12-25T09:58:29 1766656709

I think I was using Grok Code 1 Fast with Cline, and had it trying to fix some code. Came back a bit later and found out that after not being able to make progress on fixing the code, it decided to "fix" the test by replacing it with a trivial test.

That made the test pass of course, leaving the code as broken as it ever was. Guess that one was on me though, I never specified it shouldn't do that...

eru · 2025-12-24T19:37:26 1766605046

> I simply cannot come up with tasks the LLMs can't do, when running in agent mode, with a feedback loop available to them. Giving a clear goal, and giving the agent a way to measure it's progress towards that goal is incredibly powerful.

It's really easy to come up with plenty of algorithmic tasks that they can't do.

Like: implement an algorithm / data structure that takes a sequence of priority queue instructions (insert element, delete smallest element) in the comparison model, and return the elements that would be left in the priority queue at the end.

This is trivial to do in O(n log n). The challenge is doing this in linear time, or proving that it's not possible.

(Spoiler: it's possible, but it's far from trivial.)

esafak · 2025-12-23T15:47:53 1766504873

So it is not reliable enough to run automatically?

esafak · 2025-12-23T13:40:42 1766497242

Alperen,

Thanks for the article. Perhaps you could write a follow-up article or tutorial on your favored approach, Verification-Guided Development? This is new to most people, including myself, and you only briefly touch on it after spending most of the article on what you don't like.

Good luck with your degree!

P.S. Some links in your Research page are placeholders or broken.

alpaylan · 2025-12-23T14:08:28 1766498908

I'll add some links for the original VGD paper and related articles, that should help in short term. Thank you! I'll look into writing something on VGD itself in the next few weeks.

esafak · 2025-12-23T05:02:04 1766466124

> I think back to coworkers I’ve had over the years, and their varying preferences. Some people couldn’t start coding until they had a checklist of everything they needed to do to solve a problem. Others would dive right in and prototype to learn about the space they would be operating in.

This is the interesting question for me. I've had the impression that you should always have a plan, coming from big tech where a plan is demanded of anything significant, but working at a startup again where there's no bureaucracy to force me to do so, I find that I can live without detailed plans just fine. Then again, I am more experienced.

esafak · 2025-12-23T02:04:00 1766455440

https://www.reuters.com/markets/deals/how-bill-ackmans-sparc...

esafak · 2025-12-23T00:09:05 1766448545

How to use this new feature?

esafak · 2025-12-22T23:08:18 1766444898

Did any other scanner catch this, and when? A detection lag leaderboard would be neat.