ARM or x86? ISA Doesn't Matter (2021)

janice1999 · on Aug 31, 2024

Since Jim Keller is quoted in the article, it would be good to point out that his new company uses RISC-V cores a lot [1] and have licensed their design out [2] to other companies. It seems to have been the correct choice for them.

[1] https://www.tomshardware.com/news/tenstorrent-shares-roadmap...

[2] https://www.anandtech.com/show/21281/tenstorrent-licenses-ri...

Mathnerd314 · on Aug 31, 2024

Say that when I need to run some program and I have an ARM processor but the only binaries available are x86...

jauntywundrkind · on Sept 1, 2024

As a consumer maybe that's true, but as a developer and company making tech, you really can target arm or RISC-V if you care at all and it's not hard at all in a huge number of cases.

wongarsu · on Sept 1, 2024

Both MacOS and Windows handle that pretty well with JIT, AOT or emulation

makeitdouble · on Sept 1, 2024

This is a valid answer for 90% of people's use.

It still kinda makes me sad that the discussion always stops there. It will still break some of whole classes of programs, it's not magic and can either be finicky or just too limited.

From the top of my head USB and peripheral access in general is a PITA every time a translation layer is added. Raw device access in emulation never has worked "pretty well".

For Mac it wasn't that much of an issue because anyone with deeper hardware needs was probably not there in the first place. But on Windows I'd predict it will be a significant transition pain. A bigger share of people will stick on x86 basically forever to cover their long taily needs.

cozzyd · on Sept 1, 2024

And you can use qemu-user on Linux, though I don't think any distros set it up by default

SG- · on Sept 1, 2024

why would that matter?

jcelerier · on Sept 1, 2024

... Why would it not?

olliej · on Sept 1, 2024

Because that’s a solved problem, and isn’t relevant to this article - which I assume you also did not read

SG- · on Sept 1, 2024

because AOT and JIT take care of it.

olliej · on Sept 1, 2024

I’m guessing from your question you did not bother reading the article before commenting.

That said this is a solved problem - because plenty of people have arm based machines, and run x86 binaries on them (it obviously comes with a perf cost, but there’s only so much that can be done when dealing with poor engineering)

remexre · on Aug 31, 2024

> it has got the right top eight instructions that you actually need to optimize for

I wonder which ones he considers those to be -- the interview doesn't say, sadly...

rowanG077 · on Aug 31, 2024

I would say it's load, store, compare, add, subtract, call, return and branch. If you have that you cover almost all code.

sgerenser · on Sept 1, 2024

Implementing multiplies as a bunch of adds has got to suck for performance though.

rowanG077 · on Sept 1, 2024

Sure. Multiplication is exceedingly rare compared to any of those 8 instructions though. RISC-V base instruction set doesn't include multiplication for example. If you had room to add more then 8 I would rather add the bitops first. This will allow you to also write a much more efficient soft mul in addition to many more things.

colejohnson66 · on Sept 1, 2024

Well, if you decompose it to a series of shifts and adds, it’s a bit quicker. That’s actually how the 8086 implemented it.

janice1999 · on Aug 31, 2024

They've made some details public about their architecture [1] and even their RISC-V test suite [2] but I'd also love to know what it was referring to.

[1] https://www.theregister.com/2024/08/27/tenstorrent_ai_blackh...

[2] https://github.com/tenstorrent/riscv_arch_tests

anonymoushn · on Aug 31, 2024

I'm hoping we at least get pshufb, aesenc, rep movsb, some sort of integer multiply, and a floating point fma

slashdave · on Sept 1, 2024

Isn't it kind of obvious that the instruction set itself cannot directly impact performance, since operations are just single clock cycles? I assume the issue is more about the difficulties of implementation, in particular how x86 has a lot of complicated archaic features and corner cases that just absorbs space on silicon, not to mention engineering effort. Apple, being vertically integrated, is free to toss features at will.

saagarjha · on Sept 1, 2024

This is a pretty simplistic view of how processors execute instructions. They reorder, merge, and speculate execution. Most instructions do not take a single cycle to complete, often they take less or more (and even this depends on how you look at it, since they can be less if you overlap them and more if you look at latency). While this doesn’t mean ISA matters as much as people might think how CPUs work is also more complicated than people might think.

tiberious726 · on Sept 1, 2024

What synchronous processor can complete an instruction in anything other than a positive integer multiple of its clock?

slashdave · on Sept 1, 2024

Independent operations can be performed in parallel, with an effective throughput of a fraction of a clock. However, this is not something that depends on instruction set.

crest · on Sept 1, 2024

The ISA can massively impact performance on relevant real-world workloads, but the divide isn't along the mythical RISC vs. CISC divide. Take H.264 encoding as a contrived example and try it with different x86 ISA levels (e.g. no SIMD vs. SSE2 vs. SSE4.2 vs. AVX256 vs. AVX-512 on the same Zen 5 desktop CPU. You can't just build a 32-wide OoO core and have it run at a useful clock speed.

slashdave · on Sept 1, 2024

Even if you ignore specialized vector operations, the additional registers are a huge boost in the updated x86 architectures. Not to mention memory pipelining in the cache. But none of this is RISC vs CISC, really.

snvzz · on Sept 1, 2024

If ISA really does not matter, then RISC-V is a no-brainer, due to its licensing.

CalChris · on Sept 1, 2024

If ISAs really didn't matter then Apple would have switched horses.

jovial_cavalier · on Sept 1, 2024

So... why doesn't Intel "just" make a low-power x86 processor and win everything forever?

Topfi · on Sept 8, 2024

One could argue that Lunar Lake is exactly that. At least, it and arguably AMDs monolithic mobile SOCs starting with Zen 2 have shown that one can be competitive in terms of efficiency with ARM equivalents. The 4800U and M2 were more closely matched than I feel many in the media gave them credit for.

Lunar Lake now appears more than competitive in that regard with even the higher end M3 SKUs, though of course M4 Pro and Max are around the corner. Apple, more than the ARM ISA, seems responisble for the prevailing impression that ARM is inherently more efficient because Apple were simply the first to put such major investment into targeting specifically mobile SKUs, rather than scaling down from server focused products. Snapdragon X showcases that quite impressively, ARM based but efficiency wise Apple, Intel and AMD SKUs appear more competitive.

ksec · on Sept 1, 2024

Missing [2021] in the title.

saagarjha · on Sept 1, 2024

Fixed

koito17 · on Aug 31, 2024

This paragraph caught my attention

  This [variable-length coding] is a disadvantage for x86, yet it doesn’t really matter for high performance CPUs ...

Why are they leaving out difficilty of building a compiler backend when the ISA has variable-length codes? I would assume an ISA needs to consider its burden on compiler authors. (Itanium is an extreme example of an ISA too tedious for compiler authors)

I once extended a Common Lisp compiler to emit machine code for SSE4.2 instructions (specifically minss and maxss). The experience was a bit bad due to subtle differences in prefixes and specific fields needing to be set to activate some mode for SSE4.2 instructions.

Now suppose you want to debug a compiler backend targetting x86. Good luck, x86 disassembly is an undeciable problem because you don't know where the instructions start. Meanwhile ARM has two kinds of instructions (full-size and half-size, called "thumb"). Thanks to fixed-length instructions and alignment rules, you always know whether you're in thumb or not. Emitting machine code (and disassembly) is much more straightforward

drivebycomment · on Sept 1, 2024

The variable-length instructions make very little difference in building a compiler at this point, that it's practically negligible. If you look at GCC or LLVM, while there is some extra code to deal with the variable length encoding, it's not anywhere near big enough of a challenge for them.

> I once extended a Common Lisp compiler to emit machine code for SSE4.2 instructions (specifically minss and maxss). The experience was a bit bad due to subtle differences in prefixes and specific fields needing to be set to activate some mode for SSE4.2 instructions.

I assume this was some toy compiler or a non-optimizing compiler. LLVM or GCC (or any other industrial strength optimizing compilers) have no trouble whatsoever dealing with any of those. The difficulty with more complex instructions like vector instructions is in optimization / being able to find the code pattern that can take advantage of the complex instructions, and that has nothing whatsoever to do with them being variable length encoding or prefixes or knowledge about instruction set themselves. If the program is already written for it - e.g. using intrinsics - emitting and mapping to the machine code is trivial, regardless of how complex the instruction encoding rule is.

koito17 · on Sept 1, 2024

> I assume this was some toy compiler

I dont know the definition of "toy compiler", but compare the following (x86 backend vs arm64 backend)

https://github.com/Clozure/ccl/blob/d960a0e/compiler/X86/x86...

https://github.com/Clozure/ccl/blob/d960a0e/compiler/ARM64/a...

I would argue the former is a lot more complex compared to the functional equivalent in arm64

The specific extension to allow e.g. minss would look something like this

  (def-x86-opcode minss ((:regxmm :insert-xmm-rm) (:regxmm :insert-xmm-reg))
    #x0f5d #o300 #x0 #xf3)

Now try doing the same for e.g. movsxd, and you will have to be careful with the ModR/M byte, due to the VEX prefix changing semantics.

pizlonator · on Sept 1, 2024

Compilers don’t have to disassemble instructions, so the fact that x86 is hard to disassemble doesn’t matter. And it’s not actually hard in practice if the binary contains labels for function starts.

Neither x86 nor arm are easier to compile for, because they both have different quirks that cause headaches. X86 has two operant forms, which require more work. Arm has restrictions on conditional jump distance and immediate encodings, which requires more work. So, an instruction set that had no two operant forms like x86 and that allows any instruction to have any sized immediate would probably be easiest to compile form. Allowing any instruction to have a 32-bit or even 64-bit immediate would lead to either all instructions being huge or to variable length instructions. X86’s use of variable length instructions allows 32-bit immediates everywhere, so in that sense, it makes compiling easier.

saagarjha · on Sept 1, 2024

> Thanks to fixed-length instructions and alignment rules, you always know whether you're in thumb or not.

As someone who did this literally yesterday, no you do not. Doing this properly requires tracing control flow, which is an undecidable problem. In fact x86-64 is way easier in practice because real code is somewhat self-synchronizing if you just disassemble linearly and for ARM this doesn’t work nearly as well.

sgerenser · on Sept 1, 2024

Why compare x86-64 to ARM32? ARM64 has no thumb and all instructions are 32 bit in length.

saagarjha · on Sept 1, 2024

Because they brought it up. 64-bit ARM is much easier yes

mhh__ · on Sept 1, 2024

The amount of code that's being run compiled with systems for which this is a pressing issue is probably ~0%.

You're either using an established backend that does all this for you, or realistically you can throw some money at it and make it go away.

inkyoto · on Sept 1, 2024

> Why are they leaving out difficilty of building a compiler backend when the ISA has variable-length codes?

Because compiling into the maching code and the machine code decoding by the CPU have different time constraints. One can even argue that a compiler has an infinite time (however impractical it is) to translate a programming language into the binary code, whereas the CPU does not have the same luxury hence the haggle over the ISA's. So it makes sense to leave the compiler code generator out of the picture.

tptacek · on Sept 1, 2024

Because they're not talking about software development cost in the first place.

jcranmer · on Sept 1, 2024

> Now suppose you want to debug a compiler backend targetting x86. Good luck, x86 disassembly is an undeciable problem because you don't know where the instructions start. Meanwhile ARM has two kinds of instructions (full-size and half-size, called "thumb"). Thanks to fixed-length instructions and alignment rules, you always know whether you're in thumb or not.

This juxtaposition is (I assume unintentionally) hilarious. ARM is the ISA where you might reasonably see two different ISAs (and thus a disassembler needs to handle both) in the same object file, where x86 only has one, at least since the mid-90's. Yet somehow x86 is undecidable and ARM is straightforward!

In reality, the problem you're talking about for x86 is trivially solvable with what's known as recursive disassembly. You start disassembling from known function locations (since you mention debugging a compiler, that means your binary should be fully symbolized, but finding common function entry points for an unsymbolized binary isn't that difficult of a challenge), and then continue disassembling instructions until you get to various kinds of jump instructions. Then you add branch locations to the list, and rinse and repeat until you're out of new locations to disassemble.

bfung · on Sept 1, 2024

> Here in this article we’ll bring together research, comments from people who are very familiar with CPUs, and a bit of our in-house data to show why focusing on the ISA is a waste of time

Very amateur-ish question and counterpoint from me: how is Apple M1 performance and power consumption so good compared to the other laptop cpus then?

drivebycomment · on Sept 1, 2024

It's not a counter-point you think it is. ARM is not a primary reason why Apple's M series are so much better. If it is, other ARM should be able to match, but they don't. Simply put, Apple built a good microarchitecture and a pipeline - e.g. M1 has, e.g. 600 reorder buffer. Back then, this was 2x of most other CPUs, regardless of ISA. This is not something you can just throw more transistors at it - you need all the other parts of the processor optimized in order to make that useful, and Apple did. Simply, M-series are one of the best out-of-order pipelines ever built.

twoodfin · on Sept 1, 2024

Apple absolutely leverages their full-stack platform control to optimize their silicon. They know with insane accuracy how many cycles would be saved across a billion phones if they added (say) another integer divide unit, and if they decide that’s worth it, they can rev every major developer’s toolchain to schedule instructions for the new ALU months before the new chip ships.

saagarjha · on Sept 1, 2024

Apple doesn’t do this, as it would reveal details about unreleased products. The scheduling model for their CPUs typically drops months after the chips ship, and even then the benefit you get from that is minor. (You can try it yourself, by comparing code generated using the “wrong” microarchitecture versus the “right” one. And also keep in mind that developers don’t always update their apps immediately to use the new toolchain.)

pizlonator · on Sept 1, 2024

Silicon economics mean that whoever has the highest volume can invest the most in better designs and better process.

Apple gets volume from the iPhone.

This is also why x86 outperformed expensive (and so low volume) RISC workstation CPUs for a long time at single thread perf, and then eventually on multi thread perf too.

drivebycomment · on Sept 1, 2024

This "why" is more "what are the underlying economics that enable the development of the better chip". Economics doesn't guarantee you'll get a better CPU. It still takes a cutting-edge engineering team with a good CPU architect to build that best chip. e.g. Intel didn't fall behind because of lack of money.

pizlonator · on Sept 1, 2024

Intel’s revenue has been at least 4x less than Apple’s since 2015. Last year the gap was more like 7x.

saagarjha · on Sept 1, 2024

Intel has volume, though. They ship millions of expensive CPUs to servers around the world. It’s just that they dropped the ball compared to Apple.

pizlonator · on Sept 1, 2024

Hard to find reliable numbers, but it looks like the number of x86 chips sold per year is less than the number of iPhones sold per year, based on my totally not serious research.

Do you have data that says otherwise?

saagarjha · on Sept 1, 2024

Oh, I would probably agree. I’m just saying Intel also ships a lot of chips and it’s not like they are hurting for cash from those sales. I think a direct comparison would be difficult because the BOM of the average iPhone processor is surely much less than Intel’s but they sell a lot more, but also Apple’s revenue is mixed in a lot of things so looking at just a processor is not really possible to do considering the money goes to making iPhones not just chips. I think if you looked at overall R&D investment on processors in dollars though then numbers would look a lot closer.

pizlonator · on Sept 1, 2024

I think that process and the fabs influence things much much more than the microarch design. In fact, what microarch you can aim for (like, how wide it can be, and so how aggressive you get to be with all aspects of it) comes down to how good the process is.

So the comparison is Intel vs TSMC. If you consider that TSMC doesn’t just have Apple as a customer, then holy cow the difference is big.

xwolfi · on Sept 1, 2024

The code ? Why you think Apple found a way to reverse the heat death of the universe ? They just ... do less stuff so they consume less !