I would argue future engineers should be worried a bit. We no longer need to hir...

visarga · on March 18, 2024

Yeah, it sounds to me your teammates are going to pick up the tab at the end, when subtle errors will be 10x harder to repair, or you are working on toy projects where correctness doesn't really matter.

nyrikki · on March 18, 2024

To add to this.

I was going through devin's 'pass' diffs from SWE bench.

Every one I ended up tracing to actual issues caused changes that would reduce maintainablity or introduced potential side effects.

I think it may be useful as a suggestion in a red-green-refactor model, but will end up producing hard to maintain and modify code.

Note this one here that introduced circular dependencies, changed a function that only accepted points to one that appears to accept any geometric object but only added lines.

Domain knowledge and writing maintainable code is beyond generative transformers.

https://github.com/CognitionAI/devin-swebench-results/blob/m...

You simply can't get past what Gödel and Rice proved with current technology.

It is like when visual languages were supposed to replace programmers. Code isn't really the issue, the details are.

ekidd · on March 18, 2024

Thank you for reading the diffs and reporting on them.

And to be fair, lots of humans are already at least this bad at writing code. And lots of companies are happy with garbage code so long as it addresses an immediate business requirement.

So Devin wouldn't have to advance much to be competitive in certain simple situations where people don't care about anything that happens more than 2 quarters into the future.

I also agree that producing good code which meets real business needs is a hard problem. In fact, any AI which can truly do the work of a good senior software engineer can probably learn to do a lot of other human jobs as well.

nyrikki · on March 18, 2024

Architectural erosion is an ongoing problem for humans, but they don't produce tightly coupled low cohesion code by default at the SWE level the majority of the time.

With this quality of changes it won't be long until violations stack up to where further changes will be beyond any algorithms ability to unravel.

While lots of companies do only look out in the short term, human programers are incentivized to protect themselves from pain if they aren't forced into unrealistic delivery times.

At&t wireless being destroyed as a company due to a failed SAP migration that was largely due to fragile code is a good example.

But I guess if the developer jobs that will go away are from companies that want to underperform in the market due to errors and a code base that can't adapt to changing market realities, that may happen.

But I would fire any non intern programmer if they constantly did things like removing deprecation comments and introduced circular dependencies with the majority of their commits.

https://github.com/CognitionAI/devin-swebench-results/blob/m...

PAC learning is powerful but is still probably approximately correct.

Until these tools can avoid the most basic bad practices I don't see any company sticking to them in the long term, but it will probably be a very expensive experiment for many of them.

falcor84 · on March 18, 2024

Can't we just RLHF code reviews?

nyrikki · on March 18, 2024

RLHF works on problems that are difficult to specify yet easy to judge.

While RLHF will help improve systems, code correctness is not easy to judge outside of the simplest cases.

Note how on OpenAI's technical report, they admit performance on college level tests is almost exclusively from pre-training. If you look at LSAT as an example, all those questions were probably in the corpus.

https://arxiv.org/abs/2303.08774

falcor84 · on March 18, 2024

>RLHF works on problems that are difficult to specify yet easy to judge.

But that's the thing, that it seems that everyone here on HN (and elsewhere) finds it easy to judge the flaws of AI-generated code, and they seem relatively consistent. So if we start offering these critiques as RLHF at scale, we should be able to bring the LLM output to the level where further feedback is hard (or at least inconsistent), right?

ogogmad · on March 19, 2024

> You simply can't get past what Gödel and Rice proved with current technology.

Not this again. Those theorems tell you nothing about your concerns. The worst case of a problem is not equal to its usual case.

barrell · on March 18, 2024

Agreed. I use LLMs quite extensively and the amount of production code I ship from an LLM is next to zero.

I even wrote a majority of my codebase in Python despite not knowing Python precisely because I would get the best recommendations from LLMs. As a frontend developer, with no experience in backend engineering in the last decade, and no Python experience, building an app where almost every function has gone through an LLM at some point, for almost 8 months — I would be extremely surprised if some of the code it generated landed in production.

csomar · on March 18, 2024

Most software is already as bad as this, though. And managers won't care (maybe even shouldn't?) if the execution fairly delivers.

Think of this as Facebook page vs. WordPress website vs. A full custom website. The best option is to have a full custom website. Next, is a cheaper option from someone who can put a few lines together. The worst option is a Facebook page that you can create yourself.

But the Facebook page also does the job. And for some businesses, it's fairly enough.

packetlost · on March 18, 2024

> I'm writing production code that's passing code reviews in languages I never used

Your coworkers likely aren't doing a very good job at reviewing, but also I don't blame them. The only way to be sure code works is to use it for its intended task. Brains are bad interpreters, and LLMs are extremely good bullshit generators. If the code makes it to prod and works, good. But honestly, if you aren't just pushing DB records around or slinging HTML, I doubt it'll be good enough to get you very far without taking down prod.

acedTrex · on March 18, 2024

I have yet to see either copilot or gpt4 generate code that I would come close to accepting in a PR from one of my devs, so I struggle to imagine what kind of domain you are in that the code it generates actually makes it through review.

0xedd · on March 19, 2024

You simply don't know how to use it. It's not meant as "develop this feature". It's meant to reduce the time it takes you to do something you're always good at. The prompt will be in the form of "write this function with x/y/z constraints and a/b/c design choices". You do a few touch ups, which is quick because you're good at said domain, and then your PR it. The bottom line is, it took you much less time to do the same thing.

Then again, it's always dinosaurs who value their own teachings, above anything else, and try to cling on to it, at any cost, without learning new tools. So, while the industry is going through major changes (2023 saw a 30% decrease in new hires. Among 940 companies surveyed, 40% expect layoffs due to AI), people should adapt rather than ignore the signs.

jackling · on March 18, 2024

What's your domain?

codeyperson · on March 19, 2024

That you know of

ipaddr · on March 18, 2024

Honestly that sounds like a problem with the way you are managing prs. The PRs are too big or you are overly nitpicking prs on unimportant things

supriyo-biswas · on March 18, 2024

To be fair, Leetcode was never a good indicator of developer skills, though primarily because of the time pressure and the restrictive format that dings you for asking questions about the problem.

politician · on March 18, 2024

Speaking of Leetcode... is anyone selling a service to boost Leetcode scores using AI yet? It seems like that's fairly low hanging fruit at this point.

ioblomov · on March 18, 2024

Based on their demos, HackerRank is doing this as part of their existing products. Which makes sense since prompt engineering will soon become a minimum requirement for devs of any experience level.

robotnikman · on March 18, 2024

I have accepted using these tools to help when it comes to generating code and improving my output. However when it comes to dealing with more niche areas (in my case retail technology) it falls short.

You still need that domain knowledge of whatever you are writing code for or integrating with, especially is the technology is more niche, or documentation was never made available publicly and scraped by the AI

But when it comes to writing boilerplate code it is great, or when working with very commonly used frameworks (like front end javascript frameworks in my case)

pphysch · on March 18, 2024

> passes tests

Okay, so you are just kicking the can down the road to the test engineers. Now your org needs to spend more resources on test engineering to really make sure the AI code doesn't fuzz your system to death.

If you squint, using a language compiler is analogous to writing tests for generated code. You are really writing a spec and having something automatically generate the actual code that implements the spec.

doktrin · on March 18, 2024

This doesn’t vibe with my experience at all. We also use LLMs and it’s exceedingly rare that a non-trivial PR/MR gets waved through without comment.

butlike · on March 19, 2024

You should create a vfx character and really pizazz up the talk. Let it run and narrate the speech on a huge screen in an auditorium.

jayd16 · on March 18, 2024

I wonder if the reviewers are just using GPT as well.

kaba0 · on March 18, 2024

Meanwhile I’m paid for editing a single line of code in 2 weeks, and nothing less than singularity will replace me.

But sure, call me back when AI will actually reason about possible race conditions, instead of spewing out the definition of one it got from wikipedia.

lainga · on March 18, 2024

Who's "we"?

l3mure · on March 18, 2024

Post some example PRs.