Been reading a book by u/fpham "The Cranky mans guide to lora and qlora" and it's pretty great, writing quality isnt all there but the content is valuable for learning to make good finetunes
At my company (Charlie Labs), we've had a tremendous amount of success with context awareness over long-running tasks with GPT-5 since getting access a few weeks ago. We ran an eval to solve 10 real Github issues so that we could measure this against Claude Code and the differences were surprisingly large. You can see our write-up here:
Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.
While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.
I am not (usually) photosensitive, but the animated static noise on your websites causes noticable flickering on various screens I use and made it impossible for me to read your article.
For better accessibility and a safer experience[1] I would recommend not animating the background, or at least making it easily togglable.
Edited to add: I am, in fact, photosensitive (due to a genetic retinal condition), and for my eyes, your site as it is very easy to read, and the visualizations look great.
Please let me know what you would like to see more of. Evals are something we take serious, I think this post was ok enough given our constraints, but I'd like to produce content people find useful and I think we can do a lot better.
Did you sign any kind of agreement with a non disparagement clause to get early access? I'm asking because if you did, your data point isn't useful. It would mean anyone else that tried it and got worse results wouldn't be able to post here. We would just be seeing the successful data points.
They didn't say anything to us, nothing was approved, just eng <> eng discussion about the model. Also nothing was cherry picked etc etc - I don't care what OAI thinks, I care about producing the best product and showing you our findings.
Waitig 30-45 minutes for code, that you're still going to have to read from top to bottom to make sure it doesn't have anything dumb in it, does not seem like a productivity enhancement. I would quit If I was an engineer and told to do this.
If you're doing nothing in that 30-45 minutes other than stare at a loading screen, you're doing it wrong.
I'm not sold on the efficacy of AI and I share your reservations about having to scrutinise their output, but I see great value in being able to offload a long-running task to someone/something else and only have to check back later. In the meantime, I can be doing something else - like sitting in those planning meetings we all enjoy!
I love sitting in those planning meetings, too. /s
This is exactly right. We've adapted our workflow to kick off a task and then kick off the next one and the next. Then we review the work of each as they come through. It's just CPU pipelining for human workflow.
The process is far from perfect but the throughput is very high. The limiting factor is review. I spend most of my time doing line-by-line review of AI output and asking questions about things I'm unsure of. It's a very different job from the way I historically operated, which involved tight code -> verify loops of manually written code.
The company I work for generates thousands of these each week for children's personalized storybooks to help them learn how to read. The story text is the core part of the application, but the personalized images are what make them engaging.
That's very interesting. I would have assumed that 4o is internally using a single seed for the entire conversation, or something analogous to that, to control randomness across image generation requests. Can you share the technical name for this reasoning process so I could look up research about it?
In my experience, AI isn't very good at debugging AI-generated code. If it fails to make the right insight, it loops continuously until it's completely off the rails. I'm surprised your friend hasn't fully gotten stuck with this, as it seems like a huge risk for his startup.
Having had an inside view of a YC startup that went from seed to C, I can tell you that code quality means a lot less than one would think when it comes to the early days of a startup.
The biggest risk to a startup is that you get the business model wrong or you don't ship code, even if it's the code is buggy and messy.
> I fine-tuned GPT-4o to turn Claude's sketch of changes into a git patch, which would add and remove lines to make the edits. I only finished generating the training data late at night, and the fine-tuning job ran as I slept
Could you say more about this? What was the entirety of your training data, exactly, and how did the sketch of changes and git patch play into that?
Sure! I git cloned some open source projects, and wrote a script (with Codebuff) to pick commits and individual diffs of files. For each of those, I had Claude write a sketch of what changed from the old file to the new.
This is all the data I need: the old file, the sketch of how Claude would update it, and the ground truth diff that should be produced. I compiled this into the ideal conversation where the assistant responds with the perfect patch, and that became the training set. I think I had on the order of ~300 of these conversations for the first run, and it worked pretty well.
I came up with more improvements too, like replacing all the variant placeholder comments like "// ... existing code ..." or "# ... (keep the rest of the function)" with one [[*REPLACE_WITH_EXISITNG_CODE*]] symbol, and that made it more accurate
I wonder if the script can be flipped to both encourage the use of ChatGPT and scrutinize its output among students. I imagine that analyzing the results of ChatGPT output for something like a routine essay prompt requires a higher degree of precision and subject-matter expertise than writing the essay itself.
How much time do you give candidates to complete their work samples? I'm curious whether work-sample tests filter out, say, parents with young children, in which a company with a 4-6 hour technical interview might be preferred over an unbound take-home assignment (in addition to the possibility of more interviews).