More

phil917 · 2025-03-31T05:42:46 1743399766

So many of the writer's issues with students today are things I did myself a good amount when I was in college around 15 years ago at this point. I skipped classes all the time and I was often browsing the web on my phone / laptop even if I did goto class.

If I'm being honest, a lot of my professors (perhaps a big majority even) were just bad teachers and I got much more value out of the textbook, looking up stuff on the internet, or just tinkering with the at home assignments. I can say with 100% certainty that ChatGPT would have been infinitely more helpful in me learning calculus compared to the professor who taught my class in university.

I also don't really align with the issues he has with students asking for the slide decks used in class. If it can help your students learn the material, the whole purpose of the class, then what's the big deal? This point in particular almost made it seem like he's a bit salty over his students not being deferential enough to him.

All in all, despite doing many of the things that this writer takes issue with when I was in college myself years ago, I have a great career and I'm good at my work. So I think the kids are going to be just fine.

jongjong · 2025-03-31T05:52:42 1743400362

Yes I think you're right about that. Most teachers are not good at teaching. You have to wonder why they need special certification to be a school teacher when the result is so bad. I know that most teachers were bad because I had a few excellent teachers and the contrast made it obvious. I failed high school math, then I went to university and did a more advanced math course; I got distinction. I didn't even invest more time. The difference is that, at university, I was mostly skipping the lectures and reading directly from the textbook.

Math at school was just insane; it was an endless stream of; if you see this problem, use this formula. If you see that problem, use that formula... But nobody understood what they were doing. Nobody learned math from first principles.

It's weird though because in university, as I was doing well in math, I came across John Von Neumann's quote "In mathematics, you don't understand things. You just get used to them."

To me, this suggests that some gifted people have the ability to apply complex rules without understanding them from first principles... But that was absolutely never the case for me. I'm the opposite of that. I can't apply something before I fully understand it.

pants2 · 2025-03-31T06:13:07 1743401587

I always laugh when professors complain that students doing poorly in the class don't show up to class or office hours; they might just be really bad teachers, and they know those things won't help them.

phil917 · 2025-02-06T20:20:38 1738873238

I feel like anytime I try these "agentic" programming tools they always fall on their face.

Devin was pretty bad and honestly soaked up more time than it saved. I've tried Cursor Composer before and came away with bad results. I tried Copilot again just now with o3-mini and it just completely hallucinated up some fields into my project when I asked it to do something...

Am I taking crazy pills or do these tools kinda suck?

jetpackjoe · 2025-02-06T20:28:06 1738873686

You might be asking it too much, or not giving it enough context.

I've found the Cursor Agent to work great when you give it a narrow scope and plenty of examples.

phil917 · 2025-02-06T20:34:57 1738874097

Perhaps, but at that point I feel like I'm spending more time feeding the tool the right prompt and context, going back and forth with corrections, etc... when I could just write the code myself with less time and hassle.

I've definitely had far more success with using AI as a fuzzy search or asking it for one-off pieces of functionality. Any time I ask it to interact directly inside my codebase, it usually fails.

phil917 · 2025-02-06T19:32:12 1738870332

"Project Padawan" looks fairly similar to Devin, at least from a user experience perspective. From personal experience, Devin was pretty terrible so we'll see if Microsoft does any better...

phil917 · 2025-02-06T08:24:41 1738830281

Honestly, that demo was pretty weak.

If 2025 is really going to be the “year of agents”, then they need to do better than this.

phil917 · 2025-01-29T02:21:13 1738117273

I can't even properly put into words my hatred for AI generated "music"

phil917 · 2024-12-24T23:14:23 1735082063

The AI hype bros on social media are the actual worst and have done more damage than anything else to how I feel about AI these days.

The grift has been at unbearable levels for months now and it actually drove me to delete my X account recently.

hyhconito · 2024-12-24T23:35:27 1735083327

Yep. Everything is exactly two models away. It's the new fusion. Yet after $X billion expenditure I still have to argue with the finest turd squeezed out of the industry over simple things that a trained monkey could work out.

The whole thing is propped up on faith, hype and investment capital. Now I work in the latter and we're not impressed.

Edit: As for X, removing it from my existence was an improvement in mental health for sure.

dinfinity · 2024-12-25T20:01:43 1735156903

That is one really, really impressive monkey. If it were true. Probably just a bad description of reality.

hyhconito · 2024-12-26T09:25:45 1735205145

Monkey as a synonym of a completely fungible creature.

johnea · 2024-12-24T23:17:33 1735082253

Well, deleting the twitverse is an upside to your life, completely outside the whole imaginary intelligence topic.

phil917 · on Dec 20, 2024

Yeah I agree that wasn't particularly mind blowing to me and seems fairly in line with what existing SOTA models can do. Especially since they did it in steps. Maybe I'm missing something.

phil917 · on Dec 20, 2024

Quote from the creators of the AGI-ARC benchmark: "Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence."

qnleigh · on Dec 21, 2024

I like the notion, implied in the article, that AGI won't be verified by any single benchmark, but by our collective inability to come up with benchmarks that defeat some eventual AI system. This matches the cat-and-mouse game we've been seeing for a while, where benchmarks have to constantly adapt to better models.

I guess you can say the same thing for the Turing Test. Simple chat bots beat it ages ago in specific settings, but the bar is much higher now that the average person is familiar with their limitations.

If/once we have an AGI, it will probably take weeks to months to really convince ourselves that it is one.

nopinsight · on Dec 20, 2024

I'd need to see what kinds of easy tasks those are and would be happy to revise my hypothesis if that's warranted.

Also, it depends a great deal on what we define as AGI and whether they need to be a strict superset of typical human intelligence. o3's intelligence is probably superhuman in some aspects but inferior in others. We can find many humans who exhibit such tendencies as well. We'd probably say they think differently but would still call them generally intelligent.

lswainemoore · on Dec 20, 2024

They're in the original post. Also here: https://x.com/fchollet/status/1870172872641261979 / https://x.com/fchollet/status/1870173137234727219

Personally, I think it's fair to call them "very easy". If a person I otherwise thought was intelligent was unable to solve these, I'd be quite surprised.

nopinsight · on Dec 20, 2024

Thanks! I've analyzed some easy problems that o3 failed at. They involve spatial intelligence including connection and movement. This skill is very hard to learn from textual and still image data.

I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.

(OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)

lswainemoore · on Dec 20, 2024

> I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.

Maybe! I suppose time will tell. That said, spatial intelligence (connection/movement included) is the whole game in this evaluation set. I think it's revealing that they can't handle these particular examples, and problematic for claims of AGI.

MVissers · on Dec 21, 2024

Probably just not trained on this kind of data. We could create a benchmark about it, and they'd shatter it within a year or so.

I'm starting to really see no limits on intelligence in these models.

sungho_ · on Dec 21, 2024

Doesn't the fact that it can only accomplish tasks with benchmarks imply that it has limitations in intelligence?

qup · on Dec 21, 2024

> Doesn't the fact that it can only accomplish tasks with benchmarks

That's not a fact

PoignardAzur · on Dec 21, 2024

> This skill is very hard to learn from textual and still image data.

I had the same take at first, but thinking about it again, I'm not quite sure?

Take the "blue dots make a cross" example (the second one). The inputs only has four blue dots, which makes it very easy to see a pattern even in text data: two of them have the same x coordinate, two of them have the same y (or the same first-tuple-element and second-tuple-element if you want to taboo any spatial concepts).

Then if you look into the output, you can notice that all the input coordinates are also in the output set, just not always with the same color. If you separate them into "input-and-output" and "output-only", you quickly notice that all of the output-only squares are blue and share a coordinate (tuple-element) with the blue inputs. If you split the "input-and-output" set into "same color" and "color changed", you can notice that the changes only go from red to blue, and that the coordinates that changed are clustered, and at least one element of the cluster shares a coordinate with a blue input.

Of course, it's easy to build this chain of reasoning in retrospect, but it doesn't seem like a complete stretch: each step only requires noticing patterns in the data, and it's how a reasonably puzzle-savvy person might solve this if you didn't let them draw the squares on papers. There are a lot of escape games with chains of reasoning much more complex and random office workers solve them all the time.

The visual aspect makes the patterns jump to us more, but the fact that o3 couldn't find them at all with thousands of dollars of compute budget still seems meaningful to me.

EDIT: Actually, looking at Twitter discussions[1], o3 did find those patterns, but was stumped by ambiguity in the test input that the examples didn't cover. Its failures on the "cascading rectangles" example[2] looks much more interesting.

[1]: https://x.com/bio_bootloader/status/1870339297594786064

[2]: https://x.com/_AI30_/status/1870407853871419806

CooCooCaCha · on Dec 20, 2024

Yeah the real goalpost is reliable intelligence. A supposed phd level AI failing simple problems is a red flag that we’re still missing something.

gremlinsinc · on Dec 20, 2024

You've never met a Doctor who couldn't figure out how to work their email? Or use street smarts? You can have a PHD but be unable to reliably handle soft skills, or any number of things you might 'expect' someone to be able to do.

Just playing devils' advocate or nitpicking the language a bit...

CooCooCaCha · on Dec 20, 2024

An important distinction here is you’re comparing skill across very different tasks.

I’m not even going that far, I’m talking about performance on similar tasks. Something many people have noticed about modern AI is it can go from genius to baby-level performance seemingly at random.

Take self driving cars for example, a reasonably intelligent human of sound mind and body would never accidentally mistake a concrete pillar for a road. Yet that happens with self-driving cars, and seemingly here with ARC-AGI problems which all have a similar flavor.

nuancebydefault · on Dec 20, 2024

A coworker of mine has a phd in physics. Showing the difference to him between little and big endian in a hex editor, showing file sizes of raw image files and how to compute it... I explained 3 times and maybe he understood part of it now.

manquer · on Dec 21, 2024

Doctors[1] or say pilots are skilled professions and difficult to master and deserve respect yes , but they do not need high levels of intelligence to be good at. They require many other skills like taking decisions under pressure or good motor skills that are hard, but not necessarily intelligence.

Also not knowing something is hardly a criteria , skilled humans focus on their areas of interest above most other knowledge and can be unaware of other subjects.

Fields medal winners for example may not be aware of most pop culture things doesn’t make them not able to do so, just not interested

—-

[1] most doctors including surgeons and many respected specialists, some doctors however do need that skills but those are specialized few and generally do know how to use email

intended · on Dec 21, 2024

good nit pick.

A PHD learnt their field. If they learnt that field, reasoning through everything to understand their material, then - given enough time - they are capable of learning email and street smarts.

Which is why a reasoning LLM, should be able to do all of those things.

Its not learnt a subject, its learnt reasoning.

93po · on Dec 20, 2024

they say it isn't AGI but i think the way o3 functions can be refined to AGI - it's learning to solve a new, novel problems. we just need to make it do that more consistently, which seems achievable

phil917 · on Dec 20, 2024

Direct quote from the ARC-AGI blog:

“SO IS IT AGI?

ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.

Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”

The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.

Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…

Bjorkbat · on Dec 20, 2024

> Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned”

Something I missed until I scrolled back to the top and reread the page was this

> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set

So yeah, the results were specifically from a version of o3 trained on the public training set

Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.

On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.

phil917 · on Dec 20, 2024

Lol I missed that even though it's literally the first sentence of the blog, good catch.

Yeah, that makes this result a lot less impressive for me.

icpmacdo · on Dec 21, 2024

ARC co-founder Mike Knoop

"Raising visibility on this note we added to address ARC "tuned" confusion:

> OpenAI shared they trained the o3 we tested on 75% of the Public Training set.

This is the explicit purpose of the training set. It is designed to expose a system to the core knowledge priors needed to beat the much harder eval set.

The idea is each training task shows you an isolated single prior. And the eval set requires you to recombine and abstract from those priors on the fly. Broadly, the eval tasks require utilizing 3-5 priors.

The eval sets are extremely resistant to just "memorizing" the training set. This is why o3 is impressive." https://x.com/mikeknoop/status/1870583471892226343

skepticATX · on Dec 20, 2024

Great catch. Super disappointing that AI companies continue to do things like this. It’s a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.

Bjorkbat · on Dec 20, 2024

To me it's very frustrating because such little caveats make benchmarks less reliable. Implicitly, benchmarks are no different from tests in that someone/something who scores high on a benchmark/test should be able to generalize that knowledge out into the real world.

While that is true with humans taking tests, it's not really true with AIs evaluating on benchmarks.

SWE-bench is a great example. Claude Sonnet can get something like a 50% on verified, whereas I think I might be able to score a 20-25%? So, Claude is a better programmer than me.

Except that isn't really true. Claude can still make a lot of clumsy mistakes. I wouldn't even say these are junior engineer mistakes. I've used it for creative programming tasks and have found one example where it tried to use a library written for d3js for a p5js programming example. The confusion is kind of understandable, but it's also a really dumb mistake.

Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.

Or maybe benchmarks are just bad at measuring intelligence in general.

Regardless, every time a model beats a benchmark I'm annoyed by the fact that I have no clue whatsoever how much this actually translates into real world performance. Did OpenAI/Anthropic/Google actually create something that will automate wide swathes of the software engineering profession? Or did they create the world's most knowledgeable junior engineer?

throwaway0123_5 · on Dec 20, 2024

> Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.

My understanding is that it works by checking if the proposed solution passes test-cases included in the original (human) PR. This seems to present some problems too, because there are surely ways to write code that passes the tests but would fail human review for one reason or another. It would be interesting to not only see the pass rate but also the rate at which the proposed solutions are preferred to the original ones (preferably evaluated by a human but even an LLM comparing the two solutions would be interesting).

Bjorkbat · on Dec 20, 2024

If I recall correctly the authors of the benchmark did mention on Twitter that for certain issues models will submit an answer that technically passes the test but is kind of questionable, so yeah, good point.

hartator · on Dec 20, 2024

> acid test

The css acid test? This can be gamed too.

sundarurfriend · on Dec 21, 2024

https://en.wikipedia.org/wiki/Acid_test:

> An acid test is a qualitative chemical or metallurgical assay utilizing acid. Historically, it often involved the use of a robust acid to distinguish gold from base metals. Figuratively, the term represents any definitive test for attributes, such as gauging a person's character or evaluating a product's performance.

Specifically here, they're using the figurative sense of "definitive test".

airstrike · on Dec 21, 2024

also a "litmus test" but I guess that's a different chemistry test...

phil917 · on Dec 6, 2024

Don't forget the massive population decline set to literally halve the population in the next 70 years...

lossolo · on Dec 6, 2024

I wouldn’t consider it a major problem, especially with the coming robotic revolution. Even if the population declines by half, that would still leave 700 million people so twice the population of the U.S. According to predictions, the first signs of demographic challenges are expected to appear in about 15–20 years from now. That’s a long time, and a lot can change in two decades. Just compare the world in 2004 to today.

It's a major mistake to underestimate your competition.

airstrike · on Dec 6, 2024

> the coming robotic revolution

That's a long ways out. We're barely past the first innings of the chatbot revolution and it's already struggling to keep going. Robotics are way more complex because physics can be cruel.

lossolo · on Dec 7, 2024

https://www.physicalintelligence.company/blog/pi0?blog

Show me what was possible 20 years ago versus what we can do now. I think you have enough imagination to envision what might be possible 20 years from now.