> FPGAs will never rival gpus or TPUs for inference. The main reason is that GPUs aren't really gpus anymore.
Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.
I have wondered if it is possible to make a mining algorithm FPGA-hard in the same way that RandomX is CPU-hard and memory-hard. Relative to CPUs, the "programming time" cost is high.
My recollection is that ASIC-resistance involves using lots of scratchpad memory and mixing multiple hashing algorithms, so that you'd have to use a lot of silicon and/or bottleneck hard on external RAM. I think the same would hurt FPGAs too.
I had to return my Vision Pro after trying it for a week. I'm one of those rare customers that genuinely wanted to keep it, because it's the only VR headset I could _actually_ get work done in thanks to its stellar resolution and overall screen quality. In spite of its many, many flaws. But I had to ditch the thing because: 1) it's stupidly heavy, and 2) it's the only headset that caused me eyestrain.
I was praying for a new revision, but ... this wasn't it. No mention of making the thing lighter. Seems like instead they _added_ weight to the band to compensate.
Guess I'll keep waiting and hoping someone else fills the space. Maybe, just maybe, there will be a real Quest Pro with the same screen quality as the AVP. The Quest 3 is almost perfect in every regard except for that, so I'd happily drop "stupid" money to grab one with an AVP level display in it. (With the usual caveats of it being an evil Meta product, etc, etc).
The problem isn't really total weight, it's unbalanced weight. For comparison, see the BoboVR head straps for the Quest (https://www.bobovr.com/products/s3-pro), which look ridiculous and add a lot of total weight (especially with a battery), but are actually more comfortable than not having it, because they spread out and counterbalance the weight of the headset.
The Dual Knit Band may help AVP comfort, but I'm skeptical. To have a significant benefit it would need to have much stiffer side support so that the entire thing can lever the weight of the headset upwards and pull the center of gravity way back from the face.
At least in the U.S. the equality of women in society (and in law) has slowly risen over the last 100 years. Over that same period the availability of pornographic images has also slowly risen (from magazines, to VHS, to the Internet, to streaming videos, to VR).
So if we're looking at correlation, doesn't the data imply that _more_ porn is associated with _more_ rights for women?
(Conversely, the vast majority of people calling for and enacting policies for more restrictions on pornography are also rolling back rights for women.)
I used almost 100% AI to build a SCUMM-like parser, interpreter, and engine (https://github.com/fpgaminer/scumm-rust). It was a fun workflow; I could generally focus on my usual work and just pop in occasionally to check on and direct the AI.
I used a combination of OpenAI's online Codex, and Claude Sonnet 4 in VSCode agent mode. It was nice that Codex was more automated and had an environment it could work in, but its thought-logs are terrible. Iteration was also slow because it takes awhile for it to spin the environment up. And while you _can_ have multiple requests running at once, it usually doesn't make sense for a single, somewhat small project.
Sonnet 4's thoughts were much more coherent, and it was fun to watch it work and figure out problems. But there's something broken in VSCode right now that makes its ability to read console output inconsistent, which made things difficult.
The biggest issue I ran into is that both are set up to seek out and read only small parts of the code. While they're generally good at getting enough context, it does cause some degradation in quality. A frequent issue was replication of CSS styling between the Rust side of things (which creates all of the HTML elements) and the style.css side of things. Like it would be working on the Rust code and forget to check style.css, so it would just manually insert styles on the Rust side even though those elements were already styled on the style.css side.
Codex is also _terrible_ at formatting and will frequently muck things up, so it's mandatory to use it with an autoformatter and instructions to use it. Even with that, Codex will often say that it ran it, but didn't actually run it (or ran it somewhere in the middle instead of at the end) so its pull requests fail CI. Sonnet never seemed to have this issue and just used the prevailing style it saw in the files.
Now, when I say "almost 100% AI", it's maybe 99% because I did have to step in and do some edits myself for things that both failed at. In particular neither can see the actual game running, so they'd make weird mistakes with the design. (Yes, Sonnet in VS Code can see attached images, and potentially can see the DOM of vscode's built in browser, but the vision of all SOTA models is ass so it's effectively useless). I also stepped in once to do one major refactor. The AIs had decided on a very strange, messy, and buggy interpreter implementation at first.
Maybe this is an insane idea, but ... how about a spider P2P network?
At least for local AIs it might not be a terrible idea. Basically a distributed cache of the most common sources our bots might pull from. That would mean only a few fetches from each website per day, and then the rest of the bandwidth load can be shared amongst the bots.
Probably lots of privacy issues to work around with such an implementation though.
Usability/Performance/etc aside, I get such a sense of magic and wonder with the new Agent mode in VSCode. Watching a little AI actually wander around the code and making decisions on how to accomplish a task. It's so unfathomably cool.
On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.
Both seem to be better at prompt following and have more up to date knowledge.
But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.
The benchmark is a bit specific, but challenging. It's a prompt optimization task where the model iteratively writes a prompt, the prompt gets evaluated and scored from 0 to 100, and then the model can try again given the feedback. The whole process occurs in one conversation with the model, so it sees its previous attempts and their scores. In other words, it has to do Reinforcement Learning on the fly.
Quasar did barely better than 4o. I was also surprised to see the thinking variant of Sonnet not provide any benefit. Both Gemini and ChatGPT benefit from their thinking modes. Normal Sonnet 3.7 does do a lot of thinking in its responses by default though, even without explicit prompting, which seems to help it a lot.
Quasar was also very unreliable and frequently did not follow instructions. I had the whole process automated, and the automation would retry a request if the response was incorrect. Quasar took on average 4 retries of the first round before it caught on to what it was supposed to be doing. None of the other models had that difficulty and almost all other retries were the result of a model re-using an existing prompt.
Based on looking at the logs, I'd say only o3-mini and the models above it were genuinely optimizing. By that I mean they continued to try new things, tweak the prompts in subtle ways to see what it does, and consistently introspect on patterns it's observing. That enabled all of those models to continuously find better and better prompts. In a separate manual run I let Gemini 2.5 Pro go for longer and it was eventually able to get a prompt to a score of 100.
EDIT: But yes, to the article's point, Quasar was the fastest of all the models, hands down. That does have value on its own.
Didn't they say they were going to open-source some model? "Fast and good but not too cutting-edge" would be a good candidate for a "token model" to open-source without meaningfully hurting your own bottom line.
I'd be pleasantly surprised - GPT-4o is their bread and butter (it powers paid ChatGPT) and QA seems to be slightly ahead on benchmarks at similar or lower latency (so very roughly, it might be cheaper to run).
Are you willing to share this code? I'm working on a project where I'm optimizing the prompt manually, I wonder if it could be automated. I guess I'd have to find a way to actually objectively measure the output quality.
That's the model automation. To evaluate the prompts it suggests I have a sample of my dataset with 128 examples. For this particular run, all I cared about was optimizing a prompt for Llama 3.1 that would get it to write responses like those I'm finetuning for. That way the finetuning has a better starting point.
So to evaluate how effective a given prompt is, I go through each example and run <user>prompt</user><assistant>responses</assistant> (in the proper format, of course) through llama 3.1 and measure the NLL on the assistant portion. I then have a simple linear formula to convert the NLL to a score between 0 and 100, scaled based on typical NLL values. It should _probably_ be a non-linear formula, but I'm lazy.
Another approach to prompt optimization is to give the model something like:
I have some texts along with their corresponding scores. The texts are arranged in ascending order based on their scores from worst (low score) to best (higher score).
Text: {text0}
Score: {score0}
Text: {text1}
Score: {score1}
...
Thoroughly read all of the texts and their corresponding scores.
Analyze the texts and their scores to understand what leads to a high score. Don't just look for literal patterns of words/tokens. Extensively research the data until you understand the underlying mechanisms that lead to high scores. The underlying, internal relationships. Much like how an LLM is able to predict the token not just from the literal text but also by understanding very complex relationships of the "tokens" between the tokens.
Take all of the texts into consideration, not just the best.
Solidify your understanding of how to optimize for a high score.
Demonstrate your deep and complete understanding by writing a new text that maximizes the score and is better than all of the provided texts.
Ideally the new text should be under 20 words.
Or some variation thereof. That's the "one off" approach where you don't keep a conversation with the model and instead just call it again with the updated scores. Supposedly that's "better" since the texts are in ascending order, letting the model easily track improvements, but I've had far better luck with the iterative, conversational approach.
Also the constraint on how long the "new text" can be is important, as all models have a tendency of writing longer and longer prompts with each iteration.
Yeah. Even for Bitcoin mining GPUs dominated FPGAs. I created the Bitcoin mining FPGA project(s), and they were only interesting for two reasons: 1) they were far more power efficient, which in the case of mining changes the equation significantly. 2) GPUs at the time had poor binary math support, which hampered their performance; whereas an FPGA is just one giant binary math machine.