Since Jim Keller is quoted in the article, it would be good to point out that his new company uses RISC-V cores a lot [1] and have licensed their design out [2] to other companies. It seems to have been the correct choice for them.
As a consumer maybe that's true, but as a developer and company making tech, you really can target arm or RISC-V if you care at all and it's not hard at all in a huge number of cases.
It still kinda makes me sad that the discussion always stops there. It will still break some of whole classes of programs, it's not magic and can either be finicky or just too limited.
From the top of my head USB and peripheral access in general is a PITA every time a translation layer is added. Raw device access in emulation never has worked "pretty well".
For Mac it wasn't that much of an issue because anyone with deeper hardware needs was probably not there in the first place. But on Windows I'd predict it will be a significant transition pain. A bigger share of people will stick on x86 basically forever to cover their long taily needs.
I’m guessing from your question you did not bother reading the article before commenting.
That said this is a solved problem - because plenty of people have arm based machines, and run x86 binaries on them (it obviously comes with a perf cost, but there’s only so much that can be done when dealing with poor engineering)
Sure. Multiplication is exceedingly rare compared to any of those 8 instructions though. RISC-V base instruction set doesn't include multiplication for example. If you had room to add more then 8 I would rather add the bitops first. This will allow you to also write a much more efficient soft mul in addition to many more things.
Isn't it kind of obvious that the instruction set itself cannot directly impact performance, since operations are just single clock cycles? I assume the issue is more about the difficulties of implementation, in particular how x86 has a lot of complicated archaic features and corner cases that just absorbs space on silicon, not to mention engineering effort. Apple, being vertically integrated, is free to toss features at will.
This is a pretty simplistic view of how processors execute instructions. They reorder, merge, and speculate execution. Most instructions do not take a single cycle to complete, often they take less or more (and even this depends on how you look at it, since they can be less if you overlap them and more if you look at latency). While this doesn’t mean ISA matters as much as people might think how CPUs work is also more complicated than people might think.
Independent operations can be performed in parallel, with an effective throughput of a fraction of a clock. However, this is not something that depends on instruction set.
The ISA can massively impact performance on relevant real-world workloads, but the divide isn't along the mythical RISC vs. CISC divide. Take H.264 encoding as a contrived example and try it with different x86 ISA levels (e.g. no SIMD vs. SSE2 vs. SSE4.2 vs. AVX256 vs. AVX-512 on the same Zen 5 desktop CPU. You can't just build a 32-wide OoO core and have it run at a useful clock speed.
Even if you ignore specialized vector operations, the additional registers are a huge boost in the updated x86 architectures. Not to mention memory pipelining in the cache. But none of this is RISC vs CISC, really.
One could argue that Lunar Lake is exactly that. At least, it and arguably AMDs monolithic mobile SOCs starting with Zen 2 have shown that one can be competitive in terms of efficiency with ARM equivalents. The 4800U and M2 were more closely matched than I feel many in the media gave them credit for.
Lunar Lake now appears more than competitive in that regard with even the higher end M3 SKUs, though of course M4 Pro and Max are around the corner. Apple, more than the ARM ISA, seems responisble for the prevailing impression that ARM is inherently more efficient because Apple were simply the first to put such major investment into targeting specifically mobile SKUs, rather than scaling down from server focused products. Snapdragon X showcases that quite impressively, ARM based but efficiency wise Apple, Intel and AMD SKUs appear more competitive.
This [variable-length coding] is a disadvantage for x86, yet it doesn’t really matter for high performance CPUs ...
Why are they leaving out difficilty of building a compiler backend when the ISA has variable-length codes? I would assume an ISA needs to consider its burden on compiler authors. (Itanium is an extreme example of an ISA too tedious for compiler authors)
I once extended a Common Lisp compiler to emit machine code for SSE4.2 instructions (specifically minss and maxss). The experience was a bit bad due to subtle differences in prefixes and specific fields needing to be set to activate some mode for SSE4.2 instructions.
Now suppose you want to debug a compiler backend targetting x86. Good luck, x86 disassembly is an undeciable problem because you don't know where the instructions start. Meanwhile ARM has two kinds of instructions (full-size and half-size, called "thumb"). Thanks to fixed-length instructions and alignment rules, you always know whether you're in thumb or not. Emitting machine code (and disassembly) is much more straightforward
The variable-length instructions make very little difference in building a compiler at this point, that it's practically negligible. If you look at GCC or LLVM, while there is some extra code to deal with the variable length encoding, it's not anywhere near big enough of a challenge for them.
> I once extended a Common Lisp compiler to emit machine code for SSE4.2 instructions (specifically minss and maxss). The experience was a bit bad due to subtle differences in prefixes and specific fields needing to be set to activate some mode for SSE4.2 instructions.
I assume this was some toy compiler or a non-optimizing compiler. LLVM or GCC (or any other industrial strength optimizing compilers) have no trouble whatsoever dealing with any of those. The difficulty with more complex instructions like vector instructions is in optimization / being able to find the code pattern that can take advantage of the complex instructions, and that has nothing whatsoever to do with them being variable length encoding or prefixes or knowledge about instruction set themselves. If the program is already written for it - e.g. using intrinsics - emitting and mapping to the machine code is trivial, regardless of how complex the instruction encoding rule is.
Compilers don’t have to disassemble instructions, so the fact that x86 is hard to disassemble doesn’t matter. And it’s not actually hard in practice if the binary contains labels for function starts.
Neither x86 nor arm are easier to compile for, because they both have different quirks that cause headaches. X86 has two operant forms, which require more work. Arm has restrictions on conditional jump distance and immediate encodings, which requires more work. So, an instruction set that had no two operant forms like x86 and that allows any instruction to have any sized immediate would probably be easiest to compile form. Allowing any instruction to have a 32-bit or even 64-bit immediate would lead to either all instructions being huge or to variable length instructions. X86’s use of variable length instructions allows 32-bit immediates everywhere, so in that sense, it makes compiling easier.
> Thanks to fixed-length instructions and alignment rules, you always know whether you're in thumb or not.
As someone who did this literally yesterday, no you do not. Doing this properly requires tracing control flow, which is an undecidable problem. In fact x86-64 is way easier in practice because real code is somewhat self-synchronizing if you just disassemble linearly and for ARM this doesn’t work nearly as well.
> Why are they leaving out difficilty of building a compiler backend when the ISA has variable-length codes?
Because compiling into the maching code and the machine code decoding by the CPU have different time constraints. One can even argue that a compiler has an infinite time (however impractical it is) to translate a programming language into the binary code, whereas the CPU does not have the same luxury hence the haggle over the ISA's. So it makes sense to leave the compiler code generator out of the picture.
> Now suppose you want to debug a compiler backend targetting x86. Good luck, x86 disassembly is an undeciable problem because you don't know where the instructions start. Meanwhile ARM has two kinds of instructions (full-size and half-size, called "thumb"). Thanks to fixed-length instructions and alignment rules, you always know whether you're in thumb or not.
This juxtaposition is (I assume unintentionally) hilarious. ARM is the ISA where you might reasonably see two different ISAs (and thus a disassembler needs to handle both) in the same object file, where x86 only has one, at least since the mid-90's. Yet somehow x86 is undecidable and ARM is straightforward!
In reality, the problem you're talking about for x86 is trivially solvable with what's known as recursive disassembly. You start disassembling from known function locations (since you mention debugging a compiler, that means your binary should be fully symbolized, but finding common function entry points for an unsymbolized binary isn't that difficult of a challenge), and then continue disassembling instructions until you get to various kinds of jump instructions. Then you add branch locations to the list, and rinse and repeat until you're out of new locations to disassemble.
> Here in this article we’ll bring together research, comments from people who are very familiar with CPUs, and a bit of our in-house data to show why focusing on the ISA is a waste of time
Very amateur-ish question and counterpoint from me: how is Apple M1 performance and power consumption so good compared to the other laptop cpus then?
It's not a counter-point you think it is.
ARM is not a primary reason why Apple's M series are so much better. If it is, other ARM should be able to match, but they don't. Simply put, Apple built a good microarchitecture and a pipeline - e.g. M1 has, e.g. 600 reorder buffer. Back then, this was 2x of most other CPUs, regardless of ISA. This is not something you can just throw more transistors at it - you need all the other parts of the processor optimized in order to make that useful, and Apple did. Simply, M-series are one of the best out-of-order pipelines ever built.
Apple absolutely leverages their full-stack platform control to optimize their silicon. They know with insane accuracy how many cycles would be saved across a billion phones if they added (say) another integer divide unit, and if they decide that’s worth it, they can rev every major developer’s toolchain to schedule instructions for the new ALU months before the new chip ships.
Apple doesn’t do this, as it would reveal details about unreleased products. The scheduling model for their CPUs typically drops months after the chips ship, and even then the benefit you get from that is minor. (You can try it yourself, by comparing code generated using the “wrong” microarchitecture versus the “right” one. And also keep in mind that developers don’t always update their apps immediately to use the new toolchain.)
Silicon economics mean that whoever has the highest volume can invest the most in better designs and better process.
Apple gets volume from the iPhone.
This is also why x86 outperformed expensive (and so low volume) RISC workstation CPUs for a long time at single thread perf, and then eventually on multi thread perf too.
This "why" is more "what are the underlying economics that enable the development of the better chip". Economics doesn't guarantee you'll get a better CPU. It still takes a cutting-edge engineering team with a good CPU architect to build that best chip. e.g. Intel didn't fall behind because of lack of money.
Hard to find reliable numbers, but it looks like the number of x86 chips sold per year is less than the number of iPhones sold per year, based on my totally not serious research.
Oh, I would probably agree. I’m just saying Intel also ships a lot of chips and it’s not like they are hurting for cash from those sales. I think a direct comparison would be difficult because the BOM of the average iPhone processor is surely much less than Intel’s but they sell a lot more, but also Apple’s revenue is mixed in a lot of things so looking at just a processor is not really possible to do considering the money goes to making iPhones not just chips. I think if you looked at overall R&D investment on processors in dollars though then numbers would look a lot closer.
I think that process and the fabs influence things much much more than the microarch design. In fact, what microarch you can aim for (like, how wide it can be, and so how aggressive you get to be with all aspects of it) comes down to how good the process is.
So the comparison is Intel vs TSMC. If you consider that TSMC doesn’t just have Apple as a customer, then holy cow the difference is big.
[1] https://www.tomshardware.com/news/tenstorrent-shares-roadmap...
[2] https://www.anandtech.com/show/21281/tenstorrent-licenses-ri...