More

abossy · 2025-11-01T20:02:40 1762027360

What others would you recommend that are comparable in quality?

pixelmelt · 2025-11-01T22:32:16 1762036336

Been reading a book by u/fpham "The Cranky mans guide to lora and qlora" and it's pretty great, writing quality isnt all there but the content is valuable for learning to make good finetunes

donkeyboy · 2025-11-01T20:28:38 1762028918

The documentation for common ai packages is pretty good too. For example, pytorch docs, peft docs, timm docs.

abossy · 2025-08-07T19:50:55 1754596255

At my company (Charlie Labs), we've had a tremendous amount of success with context awareness over long-running tasks with GPT-5 since getting access a few weeks ago. We ran an eval to solve 10 real Github issues so that we could measure this against Claude Code and the differences were surprisingly large. You can see our write-up here:

https://charlielabs.ai/research/gpt-5

Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.

While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.

bartman · 2025-08-07T21:03:57 1754600637

I am not (usually) photosensitive, but the animated static noise on your websites causes noticable flickering on various screens I use and made it impossible for me to read your article.

For better accessibility and a safer experience[1] I would recommend not animating the background, or at least making it easily togglable.

[1] https://developer.mozilla.org/en-US/docs/Web/Accessibility/G...

neom · 2025-08-07T21:19:45 1754601585

Removed- sorry, and thank you for the feedback.

pxc · 2025-08-07T21:38:16 1754602696

Love your responsiveness here!

Edited to add: I am, in fact, photosensitive (due to a genetic retinal condition), and for my eyes, your site as it is very easy to read, and the visualizations look great.

bartman · 2025-08-07T22:43:25 1754606605

Thank you!

Love that you included the judge prompts in your article.

neom · 2025-08-07T22:49:37 1754606977

Please let me know what you would like to see more of. Evals are something we take serious, I think this post was ok enough given our constraints, but I'd like to produce content people find useful and I think we can do a lot better.

jeanlucas · 2025-08-07T21:34:55 1754602495

Nice,

MPSFounder · 2025-08-07T21:15:11 1754601311

I concur. Awful UI

RyanHamilton · 2025-08-08T07:56:59 1754639819

Did you sign any kind of agreement with a non disparagement clause to get early access? I'm asking because if you did, your data point isn't useful. It would mean anyone else that tried it and got worse results wouldn't be able to post here. We would just be seeing the successful data points.

htrp · 2025-08-08T12:39:52 1754656792

Even if they didn't, overly critical or negative commentary will mean their removal from the list of trusted testers

neom · 2025-08-08T12:47:26 1754657246

They didn't say anything to us, nothing was approved, just eng <> eng discussion about the model. Also nothing was cherry picked etc etc - I don't care what OAI thinks, I care about producing the best product and showing you our findings.

TechDebtDevin · 2025-08-08T13:19:30 1754659170

Waitig 30-45 minutes for code, that you're still going to have to read from top to bottom to make sure it doesn't have anything dumb in it, does not seem like a productivity enhancement. I would quit If I was an engineer and told to do this.

rantallion · 2025-08-08T14:56:33 1754664993

If you're doing nothing in that 30-45 minutes other than stare at a loading screen, you're doing it wrong.

I'm not sold on the efficacy of AI and I share your reservations about having to scrutinise their output, but I see great value in being able to offload a long-running task to someone/something else and only have to check back later. In the meantime, I can be doing something else - like sitting in those planning meetings we all enjoy!

abossy · 2025-08-08T15:54:38 1754668478

I love sitting in those planning meetings, too. /s

This is exactly right. We've adapted our workflow to kick off a task and then kick off the next one and the next. Then we review the work of each as they come through. It's just CPU pipelining for human workflow.

The process is far from perfect but the throughput is very high. The limiting factor is review. I spend most of my time doing line-by-line review of AI output and asking questions about things I'm unsure of. It's a very different job from the way I historically operated, which involved tight code -> verify loops of manually written code.

abossy · 2025-05-30T21:05:27 1748639127

We must know very different 16-year olds.

abossy · 2025-04-25T03:25:38 1745551538

The company I work for generates thousands of these each week for children's personalized storybooks to help them learn how to read. The story text is the core part of the application, but the personalized images are what make them engaging.

abossy · 2025-03-25T21:20:23 1742937623

That's very interesting. I would have assumed that 4o is internally using a single seed for the entire conversation, or something analogous to that, to control randomness across image generation requests. Can you share the technical name for this reasoning process so I could look up research about it?

SpaceManNabs · 2025-03-25T21:33:50 1742938430

multimodal chain of thought / generation of thought

Nobody has really decided on a name.

Also chain of thought is somewhat different from chain of thought reasoning so mb throw in multimodal chain of thought reasoning

abossy · 2025-01-26T14:14:45 1737900885

In my experience, AI isn't very good at debugging AI-generated code. If it fails to make the right insight, it loops continuously until it's completely off the rails. I'm surprised your friend hasn't fully gotten stuck with this, as it seems like a huge risk for his startup.

CharlieDigital · 2025-01-26T14:51:24 1737903084

Having had an inside view of a YC startup that went from seed to C, I can tell you that code quality means a lot less than one would think when it comes to the early days of a startup.

The biggest risk to a startup is that you get the business model wrong or you don't ship code, even if it's the code is buggy and messy.

abossy · on Nov 8, 2024

> I fine-tuned GPT-4o to turn Claude's sketch of changes into a git patch, which would add and remove lines to make the edits. I only finished generating the training data late at night, and the fine-tuning job ran as I slept

Could you say more about this? What was the entirety of your training data, exactly, and how did the sketch of changes and git patch play into that?

jahooma · on Nov 8, 2024

Sure! I git cloned some open source projects, and wrote a script (with Codebuff) to pick commits and individual diffs of files. For each of those, I had Claude write a sketch of what changed from the old file to the new.

This is all the data I need: the old file, the sketch of how Claude would update it, and the ground truth diff that should be produced. I compiled this into the ideal conversation where the assistant responds with the perfect patch, and that became the training set. I think I had on the order of ~300 of these conversations for the first run, and it worked pretty well.

I came up with more improvements too, like replacing all the variant placeholder comments like "// ... existing code ..." or "# ... (keep the rest of the function)" with one [[*REPLACE_WITH_EXISITNG_CODE*]] symbol, and that made it more accurate

abossy · on Nov 8, 2024

Very interesting, thanks!

abossy · on Jan 6, 2023

I wonder if the script can be flipped to both encourage the use of ChatGPT and scrutinize its output among students. I imagine that analyzing the results of ChatGPT output for something like a routine essay prompt requires a higher degree of precision and subject-matter expertise than writing the essay itself.

heliophobicdude · on Jan 6, 2023

This is excellent! It also helps people to learn the tools that we would inevitably need to learn how to use.

Prompt engineering is a skill I'm seeing a need for

abossy · on March 23, 2016

Could you share the process you used to write your script?

abossy · on March 23, 2016

How much time do you give candidates to complete their work samples? I'm curious whether work-sample tests filter out, say, parents with young children, in which a company with a 4-6 hour technical interview might be preferred over an unbound take-home assignment (in addition to the possibility of more interviews).