Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Can you trust a compiler to optimize your code? (matklad.github.io)
123 points by Bender on Aug 1, 2023 | hide | past | favorite | 108 comments


Many absolutely fall into the trap of "don't optimise because premature optimisation etc., compiler is smarter than you, and library authors are definitely smarter than you". The result of that approach is that in 2023, our desktop is worse in usability and performance than what is was in 1998.

With SIMD, you always have to check what the compiler generates, you can't just write something and expect it to be vectorised. Similar to other optimisations such as de-virtualisation - it might happen or it might not. If it doesn't, you have to do it yourself by hand, or somehow convince the compiler to do it.


> The result of that approach is that in 2023, our desktop is worse in usability and performance than what is was in 1998.

No. This is undoubtedly because every UI nowadays is using JavaScript. This is problem is way waaaaay larger than any premature optimization considerations.


There is another point to consider: more and more people are writing code without being experienced programmer, thanks to easier languages, better IDE, more streamlined lib API, and now, even ChatGPT.

A geographer scripting her GIS in Python or a quant doing some pandas analysis has a very limited programming experience, since her expertise is in her job.

And so we see more and more code produced by people that don't know what is slow and fast, what is safe or dangerous, what is maintainable or not. This code is popular, because it solves a real problem by people deeply familiar with the problem, way more than a programmer would be.

I have an example at $current_client, where they have some matlab code that handles tremendous amounts of money, with no comments, no doc, no tests, side effect everywhere and some interesting quadratic behaviors. They solved by loading balancing the thing on 6 servers with 17 processes, and manual testing.


I do data engineering, and I'm working on carving out a niche for myself in optimizing code. Whereas most others want to solve sexy problems as soon as possible, it ends up biting us in the end because it doesn't scale down the road and/or costs tremendous amounts of money.

Our OKRs often include "reduce cost of X by 8%", which is nice but isn't achieved half the time. But if you have deep experience or are thinking outside the box, you can (as I have) reduce by 95%.


The code the unexperienced write is infinitely faster than the non-code that would be not-written by nobody would the unexperienced not write code.


Sometimes it's concretely a specific amount faster because you're replacing some other workflow with the software, but yes sometimes you're enabling something which would otherwise just not happen. Colleagues in our sister group, which doesn't have any actual software engineers, helped some academic librarians build a tool to automate something they knew they couldn't otherwise do at scale. They want to teach students how to use a library effectively, and while any of them could walk say, one student per hour through this, there are thousands of new students every year, it's not manageable. Midway through the presentation the specialist explaining it said "It's not a program" and I was glad to find I wasn't the only person who wanted to interrupt, because duh, of course it's a program. They've built a program. It's a bunch of instructions, for how the computer does stuff, it's a program.

Yes these people wouldn't have attempted Python, let alone Rust, but what they've built is a program, in a visual programming language. Is that intimidating and so they might struggle to maintain it? Yes, a little bit maybe, but pretending it isn't a program won't help. And that might mean sometimes you need a software engineer, the same way that if we want a bike shed (and sometimes we do, students like bicycles), sometimes we might need the staff architects down the corridor from me, don't just throw up a few brick walls and hope you got it right.


Well, it's a wild guess, of course. But does it sound even remotely possible, that thanks to that messy matlab code $current_client now has money to hire you to look into it?


Hence the "because it solves a real problem by people deeply familiar with the problem, way more than a programmer would be."


Oh, missed that on initial reading. Point taken.


No. Doing UI in JavaScript shouldn't make any noticeable difference. I say that as someone who hates JavaScript and doesn't think it should be used for anything, really.

But, UI apps are long-running, and interactive. There's no real "throughput" or anything like that to be concerned with.

The problem is precisely with the crowd that believes that ALL optimization is premature--or rather that code styles that I would consider pessimization is good. For example, in JavaScript, it's considered hip and "functional" to have a local variable that's an array, and instead of mutating it, programmers will string together five calls to `map`, `filter`, etc, making a bunch of temporary copies and iterating the data five times instead of once.

Writing JavaScript with what I've heard called "mechanical sympathy" does not have to be slow at all.


Chained folds (map, filter, etc.) immediately within the same expression really should be easy to optimize, but even if the compiler doesn't do that, if the library provides those methods as operations on iterators, then an unoptimized version should only cause redundant loop counters and a single extra array allocation whenever you materialize it. So the array temporaries are a bad implementation/design choice.

Scala has functional web frameworks that have no problem doing 100k+ requests per second on older mid tier desktop hardware, and while that's slow compared to a good mutable, imperative C implementation, it's fast enough that obviously functional programming per se is not the reason for perceived UI slowness.


"Chained folds (map, filter, etc.) immediately within the same expression really should be easy to optimize"

In a pure functional language they are.

In a non-functional language with mutation it is much more challenging. Map functions are unconstrained in Javascript which means they can do anything they like with regards to breaking optimization.

This is generally why I'm down on the whole "chain maps and filters" style of programming in languages that aren't designed for it. You are paying more than you may realize if you've never profiled your code and compared. In languages that are designed for it and have the ability to optimize it, by all means, go for it.

The other thing you have to remember is that JITs are time-constrained as optimizers go. They have the advantage of access to real run-time information about how things are being used, but they also have very, very tight time budgets; on the whole the time budget's problems seem to dominate the knowledge advantages. The profiler is not allocated a lot of time to prove that a given function is pure and it's OK to synthesize a composed function rather than run two maps. I don't even know if any JS engine is smart enough to try that. (Input from anyone who knows welcome, I'd be interested to hear either way.)


You don't need the language to be designed for it or to prove it. Just document it as a contract for the caller. e.g. "the argument to map must be a pure function. For functions with side effects, use forEach. If an impure function is provided, the behavior is undefined." This isn't very different from contracts that require you not to change the length of an array when using iterators.


Yeah, I'm imagining trying to push that sort of thing out into the Javascript world, and maybe your model of Javascript programmers is a lot different than mine, but I don't see this going terribly well.

In the Javascript world's defense, I don't see this as a practical solution in very many languages at all. I don't think I can name any language that doesn't have a compiler-level enforced definition of "pure function" somehow where its programmers will across-the-board reliably know what a "pure function" is.

Much as I hate to lick HN's boots and inflate the already substantial ego we can have here sometimes, do remember that HN is generally populated by the upper echelon of programmers and there's plenty of people out there who would have no idea what this means. I can think of several coworkers over the years. This isn't even an insult in the sense you might think; not everyone programming is necessarily a "programmer" I expect to be super knowledgable about programming. I'm having a major issue right now at work where I've got a data scientist sort of pushed into doing even just a bit of "programming work" interacting with an engineering organization and it has been a freaking nightmare, and I can't even really complain because while they are wielding Python, programming is really just a side task in the world of data analysis. They may be able to program in Python but they have no idea about performance, super focused understandings of optimizations (at best, sometimes it's just zippo), an incredibly surface level understanding of Python syntax, etc. And why not? It's not actually what we're paying them for.


> In the Javascript world's defense, I don't see this as a practical solution in very many languages at all. I don't think I can name any language that doesn't have a compiler-level enforced definition of "pure function" somehow where its programmers will across-the-board reliably know what a "pure function" is.

IME Scala programmers generally don't have much trouble with this, and the language doesn't do anything to prevent side-effects wherever you feel like putting them. It's just culturally not very common. Similarly, you can use `null` for almost anything in Scala, but it's extremely rare to run into one in code that isn't interacting with Java. But nothing really stops you from using basically any feature as Java++.

For your Python programmer, like I mentioned they already have to deal with some form of no-side-effects with for loops: you should not modify the length of a data structure that you are iterating over (`for x in foo`), or you'll get weird/unexpected results (e.g. [0]). You don't need to even go full UB, and could just specify that the order that the arguments to chained folds get called in is unspecified, and that code that needs to care should use `forEach` or a for loop. IME if you're writing in a pipelined map/filter/fold style in the first place, the ordering question just doesn't come up much. With that style it's rare to need to mutate anything yourself because the mutation you need is baked into those loop primitives.

[0] https://stackoverflow.com/questions/38003119/what-are-the-ru...


I'm certainly not against chaining combinators in general. In fact, I love functional programming and I enjoy Scala. I'm only specifically criticizing it in JavaScript and other languages that decided to import functional programming features onto eager collections.

Scala's standard collections are implemented as persistent collections, so the overhead for each combinator call is small, and could even be optimized by the compiler. JavaScript doesn't have a compiler, so the best you could hope for is a JIT optimization, but even that's impossible for several reasons related to the dynamic nature of the language.

As an aside, I sometimes like to shit on Java, but the language is very well engineered, despite its "original sins" that make me hate it. When Java decided to import functional programming niceties, they didn't just tack `map` and `filter` onto `Collection<T>`. They added the whole `Stream` API which does the best possible thing given what the language already was.


It depends what you're aiming for. For accuracy, map and filter make it impossible to fall off the bounds of your iteration. While it would certainly be nice if the runtime could do that faster, slow + correct is often (not always) better than fast + broken.

(There are languages that make fast + correct easier. Unfortunately, JavaScript isn't really one of them).


JavaScript has `for-of` syntax for iterables, so you won't go out of bounds, regardless.


put the nice looking calls to `map`, `filter`, etc in a comment then rewrite it with a for loop under it.


That's a pretty nice suggestion. Honestly, though, the expanded for loop implementation is never really that complicated in practice.


> This is undoubtedly because every UI nowadays is using JavaScript.

This happens in pure C++ or C in Windows too. It's neither just JS nor the problem the parent mentions (though of course in those cases those can sometimes contribute too). My understanding is that it's the various the layers of abstraction + security mitigations code has to go through nowadays.


It's not only JavaScript. For instance all the c# apps I ever had to use on windows had GUIs that felt less responsive than their old win32 counterparts or Qt equivalents (exhibit A: visual studio which became incredibly slower and laggier for the c# / WPF migration compared to... VS2008? and honestly never recovered)


There may be another cause of that. There appears to be an explosion of background tasks running, even when you do nothing. You can look at the task list and see them. The disk drive light comes on when I'm not using the computer.


Using javascript... To simulate an entirely duplicate DOM structure and then diffing it to update the real thing.

JavaScript is fine it's the massive frameworks that add 20 extra layers of abstraction that has to be processed that's the real problem.


Why would JS be the problem itself? Bundling a whole browser engine? Sure. JS itself is quite fast and plenty of frameworks use it for scripting their GUIs, e.g. QT.


Typical HN take that is nothing but a lie. Worst experience I ever had in a desktop app were from Java UI applications (jdownloader), python (caliber, pgadmin) or the win 11 super laggy ui (native). Meanwhile now we have a lot of web application with much better performance that their "native" counterpart (For ex: Email clients, Text/Code editors).


> win 11 super laggy ui (native)

It's hard to know what this might be. I wouldn't assume it's a lack of perfectly CPU-optimised native UI code when it could be all manner of things.


It’s not difficult to provide a bad experience regardless of technology. But if your baseline is slower and messier, it has a significant impact on your average application. I can also tell you bad personal experiences, but how about not making it personal? Thanks


Is windows 11's UI laggy because of the UI library, or because of fetching things from the internet?


Not because of JS in any case. JS is the whipping boy of desktop performance because its ascendance happened at the same time as what you say: interfaces that update on the fly and need data from the net to do anything at all.

So now download times get blamed on JS and Electron by the haters. Ah, well.


I'm not sure about this. There are simple JS applications like Postman which can be unbearably slow (try closing tabs in fast succession, you will know what I mean). And this is for local-only data.


JavaScript is the fastest scripting language in existence and it's not close.

It's strange that people automatically suspect it has to be the culprit as opposed to e. g. bad application programmers or (more likely) product owners putting more focus on features at the cost of optimization.


> The result of that approach is that in 2023, our desktop is worse in usability and performance than what is was in 1998.

I believe there are many more causes, some of them far more significant. Off the top of my head:

* the Web and the related delays: people started to accept the fact that there is a small delay in everything they do related to network latency

* the general "what Andy giveth Bill taketh away" mindset, and not just among MS developers: "if I have a powerful machine, I should use its full potential, and other people should do it, too"

* the notion that programmer's time is more expensive than machine time

* the notion that in order to win we need to be first, therefore there is no time to optimize things resource-wise

* the belief that users only want features and don't care about speed

* the lack of high-quality, developer-friendly, modern cross-platform UI framework resulting in many companies using Electron with its overhead


> the notion that programmer's time is more expensive than machine time

I do this computation each time I want to code something: time_to_develop + time_to_optimize + time_running and get the solution that minimizes that. Most of the time the optimize part is not worth it (although it'd be so much funnier to do it).


I thin it's important to think about the context of "machine time". If "machine time" is some computer somewhere that only runs your code, then sure. Even as a customer, it might be cheaper to just pay more for a beefier machine than to pay more for optimizing the software.

But when you're building human user-facing apps, you can't always just get a faster computer. As others have said, many instances of lag in windows can be traced to network calls. A 64-core monster would probably wait around just as long as my ultra-low voltage, several year-old lazy ultrabook.


You should work in the gaming industry or robotics where if time_running > some constant you have no product.


I happen to work in the computational fluid dynamics department and I happen to have coded in the video game industry (optimized 3D engines).

So I do know when optimizing is necessary.

It's just that most of the time, for my computation, just waiting a few hours for a one-shot computation is much more effcient than spending 2 days optimizing it so that it runs in 20 minutes.

But I admit my parent message was not quite clear on the context in which I was thinking :-)


It's not a trap, the compiler can do micro-optimizations better than any human can because it considers thousands of variables, per target, than you don't even know about.

The problem is not realizing what's the scope of micro-optimization. And ignoring architecture and design in general. But that reflects an industry that largely doesn't know and doesn't care. As long as it ships and barely works, it's good enough.

Bad software is a victim of excellent hardware.


>As long as it ships and barely works, it's good enough.

Rails in a nutshell.


Rails was at one point advertised as the web platform where the CEO can directly type, almost in English, their app, and he never has to pay a programmer anymore. Programmers are obsolete. So... I guess the culture that formed around it, no wonder, wasn't of the brightest people.


> COBOL was at one point advertised as the platform where the CEO can directly type, almost in English, their app, and he never has to pay a programmer anymore. Programmers are obsolete. So... I guess the culture that formed around it, no wonder, wasn't of the brightest people.

FTFY.

Where you can have great things such as:

a IS GREATER THAN b

a IS GREATER THAN b AND c OR EQ TO d

MULTIPLY PAY-RATE BY 1.5 GIVING OVERTIME-RATE

\s It's as old as dirt.


> With SIMD, you always have to check what the compiler generates, you can't just write something and expect it to be vectorised.

This is indeed a serious problem, and not just because the autovectorizer implementations vary. The SIMD instructions also vary quite a bit between CPUs and CPU architectures.

D solves it in a novel way. First off, D does not autovectorize conventional code. D includes an array language, and array operations are converted to vector instructions, if those instructions exist. So, if you try to use an operation that does not exist on the target CPU, the compiler issues an error.

The error enables the programmer to select the appropriate action - emulate the behavior, or use another set of operations or algorithms that are equivalent and are supported by the SIMD instructions.

This avoids the problem of the compiler silently inserting emulation instructions that may be dramatically slower than alternatives. It saves the programmer the need to do asm dumps of the code to see if it was vectorized or not.

Frankly, the whole autovectorization approach bothers me. A compiler is designed to convert high level code into low level instructions. The autovectorizor converts low level code into higher level instructions. It's just fundamentally wrong.


Nah that’s not lack of premature optimization, that’s just electron apps and everyone learning JavaScript.


JavaScript can be quite “fast” if you’re not copying an array 30 times to get one number out of it.

Yeah. It’s still slow, as it were. But the JavaScript community also very much aligns with various functional programming ideals like immutability, which takes “kind of slow” to “immensely slow”.


With SIMD, you always have to check what the compiler generates, you can't just write something and expect it to be vectorised.

You are right about this but everything else is far off. Modern software that runs slow on modern hardware has to be wildly inefficient to make a visible slow down. Even 'OO' programs that are doing lots of small allocations and skipping around in memory are lightning fast on 20 year old computers.

Modern hardware is incredible. It's a testament to just how bad some software is that it runs slow. Most software can be sped up by 10x-100x by going from a scripting language to C++. Most software can be sped up 7x-100x after it is in C++ by weeding out excessive allocations and paying attention to memory access patterns. That's all before SIMD and multi-threading.


You don’t even get information/ diagnostics about copy elision / return value optimization in c++, if that happened or not in particular place in the code.

Only by looking into assembler :(


Note that RVO is mandatory when returning a prvalue since C++17.

IME NRVO happens pretty reliably in the simple case, even on low optimisation levels.

There is a proposal for mandatory NRVO in some cases. It would be pretty great, but the proposal didn't move forward for quite a while.


The real optimizations are those that deal with memory anyway, which the compiler usually can't and almost always won't do


Machines aren’t slow because people don’t use SIMD.


Of course, that was a jab at the lack of optimisation in general, not SIMD specifically.


They don't say you shouldn't optimize at all. But before you do, check what your performance bottlenecks are, take measurements and start from there.


The issue is different people mean different things when they say "optimization".

For some developers, giving any consideration to the performance of your chosen architecture and data structures is "premature optimization"


in current_year we've invented programming paradigms that can slow down all parts of the program equally


I'm not working on anything close to bare metal hardware though; I rely on the libraries, frameworks and runtimes I use for the majority of performance, and have to watch out for expensive re-renders and the like myself.

That said, I'm sad that we're building apps in React Native; I tried to advocate for native apps which are much more mechanically sympathetic, but they weren't having it. In hindsight, it was probably the best call since we spend nearly no time on anything device specific and developing features is much faster than it would be if it were native, but it still sucks.


It would be possible to make iterators more SIMD-friendly than they are today but that would come at the cost of interpreting what the user might have intended and then dropping or introducing additional side-effects (e.g. by skipping things or grabbing a few more items than needed). Without an effects system that makes purity available on the type level libraries can't even conservatively apply those optimizations on pure lambdas because they have no way to tell the pure and impure ones apart.


So, using i++ instead of ++i is why this 400mb Electron app is laggy?


Premature optimization is more about not home-growing some bespoke fancy data structure before you've even ran the application for example. If it's a bottleneck you can optimize those pieces later. Requirements can also change. Now you've spent a ton of time building something fancy and are less likely to throw it out because you're attached to the big fancy work you put in because we are human.


> "don't optimise because premature optimisation etc., compiler is smarter than you, and library authors are definitely smarter than you"

Everything about that is quite okay if you add "don't optimise prematurely ...". At some point you still need to optimise.


That advice is only for micro optimizations.

It doesn't make sense to spend a week aggressively optimizing one function unless you know that function is a bottleneck and the optimization is worth it.

But many developers take that to mean "don't pay attention to performance at all". They have bad architecture, bad data structures, sloppy code that copies data all over the place, etc.

Then, even if they do go back and micro-optimize that one hot function, the whole app still sucks.


Hardly matters when everyone is shipping Electron junk.


I think there's probably an underexplored sweet spot where IDEs gain the ability to hint at what type of optimizations the compiler will be able to do, and you could even specify which ones you want to select visually. Which in term is left as a textual annotation that's "statically checked", i.e. if the compiler cannot perform the expected optimization anymore it will reject the annotation with an error or warning whichever degree of freedom you decided to give the compiler.

Could get a bit unwieldy but it might not have to. You could also combine this with generative ML to suggest refactors that lead to potentially more easily optimizable code. It doesn't have to be perfect just assist enough for you to gradually build intuition. ML could also provide estimated guesses on how an optimization might speed up a given logical unit of code.

What you get is writing readable code but still a clear idea of how it will behave / impact the compiler. It does of course go counter the idea of abstraction. But if we do advocate for people to learn how the compiler will optimize or not we might as well invest in tooling to support that.



No quite, those are hints to guide the optimizer to do a better job based on programmer knowledge. They aren't attributes to indicate "I expect / need this particular optimization A to be done on this code block".


This is good advice.

Still, a lot of problems seem to remain firmly in the "write it with intrinsics" region even if you do your best to help the compiler out. For any operation that involves filtering a list, the compiler will not generate the lookup table of shuffles that is used in the fastest solutions (maybe the story is different on AVX-512 where there are dedicated filtering instructions or on NEON where movemask costs more because there isn't such an instruction). For parsing and formatting strings, the compiler usually will not take branchy code and decide to compute both branches and vectorize it even if it is guaranteed chunks of the appropriate size. Some concrete examples of the how simple these problems can be while remaining out of reach are "Given a mutable list of u32, remove all multiples of 3" and "Given a list of newline-separated UUIDs and an equally large list of mutable u128, parse the hexadecimal bytes of each UUID into the corresponding u128".


The problem with autovectorization is that there's no good tooling for it. icc at least has

  #pragma simd
to force vectorization, but gcc and clang don't. And there's no way to annotate a loop to say "throw a compile error if this loop fails to autovectorize". You either pray that someone else doesn't cause a silent 4x regression by adding a loop-carried dependency, or you perform some brittle hackery trying to read the output of -fopt-info-vec-missed flag and killing the build when you see a file name + line number that corresponds to a hot loop.

At that point, it just becomes easier to use SIMD intrinsics. You're still deferring to the compiler for the instruction scheduling and register allocation, but you no longer have to worry about whether the SIMD is being emitted at all.


Or you can have some automated benchmarks as part of your CI pipeline that will compare performance numbers to previous builds to know if there is a regression that ends up mattering.

An added benefit is that you can use it to do PGO and potentially eek out some more performance: https://en.wikipedia.org/wiki/Profile-guided_optimization


Automated benchmarks are good practice, but they're not a substitute for compile-time feedback. I would much prefer the immediate results from a compiler error, over the minutes-long wait for a traditional CI pipeline to finish.


It shouldn't be too hard to write a script that objdumps a list of functions and checks if SIMD instructions are used. Strap it to your build system / CI pipeline and you're good to go. Of course, it'd be better if compilers could do that natively.


You can do something similar with OpenMP in GCC: https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html


Really surpising to me that compilers don't support a feature like this. Anybody know why?


In my experience, one of the biggest benefits that isn't too hard to implement is letting the compiler write/read more than it strictly needs to.

For example, pad your arrays to a multiple of 8, 16, etc. and where possible, run loops on the entire array including the padding. IIRC both GCC and Clang are smart enough to understand loops whose limit is rounded to the nearest multiple of 2^n via bitwise operators and they won't have to emit ugly loop tails which bloat code size.

This is a bit off topic, but I remember discovering a missed optimization in GCC this way: I wanted to signal to the compiler that it is allowed to read a particular pointer earlier than C semantics would allow and did something along the lines of (void)*p; Well it turns out that dead-end reads are removed pretty early in the optimization pipeline, so GCC removed the read and then refused to touch the pointer. Replacing it with some __builtin, I was able to confirm that was indeed the case.


That sounds like an optimization the compiler could / should do for me, if I enable the option. Same with things like struct ordering and / or padding, although that can affect the functionality in subtle ways, because the language gives you too much power at that low level.


No, because the compiler has no way of knowing if you have some reason to care about the layout. 99.99% of the time you don't--but in general it can't know that last .01% where your layout has outside implications.


It sounds like that, but turns out that those are extremely hard optimizations to do as they require whole program analysis.


Also, because they involve trade-offs. Ideally, a compiler optimization would result in code that is better in every measurable aspect. For example, dead code elimination results in a smaller binary, fewer branch points, and fewer runtime computations. Array padding, on the other hand, requires a larger memory footprint to accommodate the padding.

If an optimization is only beneficial in some cases, compiler authors are rightfully wary of enabling it in all cases.


> Compilers are tools. They can reliably vectorize code if it is written in an amenable-to-vectorization form.

This goes for regular scalar code as well, it's just less obvious in mature compilers with ton of work put into them.

I've long since given up handwriting scalar assembly code for speed. Instead I've studied the output of the compilers I use, and through iteration learned how to write optimization-friendly code.

If I have a hot spot that matters in my code I check just to be sure. Sometimes the most innocent looking changes can cause issues.


One thing I really want in my compilers is a way to express expectations: if I expect an auto-vectorisation, or a block of unreachable code, I should be able to tell the compiler explicitly!

For dead code, this is different from assertions, or Rust's `unreachable!` macro, in that they tell the compiler to trust me, and crash at runtime if I'm wrong. I want the compiler to check my logic and fail to compile if it can't prove me right. If I write a switch block and I know that only a limited subset of values can possibly happen, I still have to provide code for all possible values. But I expect that code to be subject to dead code elimination, and I'd quite like to be sure that actually happens. I'm not aware of any automation around checking that code I expect to compile to SIMD actually winds up doing that.


What does the compiler do with that expectation? If the code can't be vectorized for target architecture, does it emi a warning?


I'd quite like it to emit an error and fail. If some target architectures won't vectorise the code, I can always add the expectation conditionally.

I'm not so much trying to catch deliberate changes, where a warning might make sense as I'd be (maybe) looking for it, but changes that break autovectorisation as a side effect -- maybe something can't be inlined any more, maybe a new compiler version introduces a new optimisation that causes a cascade of tiny changes. I'd hope that we would catch the difference in performance, but it's easier to notice a compilation failure than a performance drop.


On a related note, folks who found this interesting might also like to read “There’s plenty of room at the top” where computer scientists showed how orders of magnitude performance is left unused when a naive high-level implementation is not optimized for the hardware (combination of coding style and compiler insufficiency).

https://news.ycombinator.com/item?id=36921588

Much broader in scope than OP, that also has many insightful nuggets there about how to think about software and system design, especially in the post Moore’s law era.


This is overall very solid. However you have to be careful to not go overboard with this line of reasoning. I've seen people try to explain away inefficient code by inventing a chain of optimizations that the compiler could conceivably do. If this chain is too long, it probably won't happen! The most common pitfall I've seen is relying on inlining too much (specifically a C++ template monster). As far as I can tell, compilers have a fixed number of inline passes, and don't keep inlining until they hit a fixed point.


The article is spot on about SIMD. Often, if you can write your code in exactly the specific form that you would if you were using intrinsics, the compiler will go ahead and SIMDify for you. But you have essentially a 0% chance of writing in that form unless you already have experience writing SIMD code!

Things like branches, loop termination not being regular, alignment, aliasing, function calls, etc all can and do stop your otherwise totally amenable code from vectorizing.

In the terminology of the article, I'm still at Phase 2. By the time I've worked out the dataflow and constraints needed to make a loop parallelize, I've more or less converted it into the form needed for intrinsics already, so why not write them and be sure.

More broadly, the biggest impact I've seen using MSVC (which most of my code is written in), is the function inliner makes absolutely terrible decisions, particularly with lambdas, which is really unfortunate because there's no way to force-inline the lambda. Combine that with the above - the presence of the lambda will prevent vectorization, even though if it was inlined it would then be trivially vectorizable. :(


I don't think the argument "Compilers are tools. They can reliably vectorize code if it is written in an amenable-to-vectorization form." is proved correct here. I think I'll stay at level 2.

In my experience relying on auto-vectorization is brittle:

- Even if the compiler should be able to auto-vectorize code written in an amenable-to-vectorization form, that doesn't mean it will. Maybe the particular compiler and version you're using does auto-vectorize, but that doesn't mean other compilers will. This might be a less of a problem in Rust, where you effectively have only compiler (although there are still compiler updates).

- Even if you managed to once write the code in an amenable-to-vectorization form, that doesn't mean that the code will stay that way forever. Code tends to change.

- Sometimes it's far less obvious what the amenable-to-vectorization form should look like, or what the reasons are for the compiler failing to auto-vectorize. The classic example is the compiler failing to auto-vectorize a floating point sum, since the order in which floating point numbers are added affects the result.


The people who worked on the compiler are far more intelligent and knowledgeable than I am, so I would say that I can trust the complier with great confidence.

I do not think most people are writing software where niche optimizations are truly necessary. Well, at least I am not.


Another important aspect not brought up is the semantics of the programming language itself. FORTRAN doesn't have memory aliasing (at least in arrays -- it's been a while since I wrote any FORTRAN code) so the compiler can make assumptions (and from them optimizations) that it cannot in, say, C or C++.

The author points out the "myopic" nature of compilation, implying that a more global view of the code might allow the compiler to prove that no aliasing happens in a particular code path. I think that problem is Turing complete.


Rust's borrow checker could probably be used for alias analysis, and avoid emitting redundant load instructions. That would give it a competitive advantage to C++. Does something like this exist yet?


Rust actually already does this to some degree, using LLVM’s noalias annotations (after many years of exposing latent LLVM bugs and having to disable them). There’s more that can be done here, see e.g.

https://github.com/rust-lang/rust/issues/16515

There are also soundness issues around async/await:

https://github.com/rust-lang/rust/issues/63818

LLVM’s aliasing model isn’t really optimal for a language like Rust with very precise memory dependence information, but using anything else would mean making a whole new optimizer/backend from scratch.


In my experience the human should do the gross optimizations--choose the right algorithm. And beware the operator--I've had multiple things where an O(n^2) was perfectly reasonable (and probably the fastest runtime given the lack of overhead of setting up better answers) given the original problem--but then n blew up down the road.


> A higher level language might choose to always represent functions with function pointers

This is a dig at C++ in case anybody didn't notice. In C++ if I write three predicates X, Y and Z which decide if a > b, b < a and a == b respectively, for two ints named a and b -- their types are identical, X, Y and Z have the same type, the type of any function which takes two integers and return a boolean even though of course these are different functions.

Imagine if your other types worked like this. Oh IPv4Address and OldMacFileType both consist of four bytes, so now they're interchangeable and your code which takes an IPv4Address also works with a OldMacFileType. Very sensible.

As with many things in C++ this is inherited from C where it wasn't considered a bad problem because nobody has a decent optimising compiler on the PDP-11 anyway.


I'm not sure I understand the criticism. You have three functions of type (int, int) -> bool. Where the type system accepts one, it accepts the others. I fail to see the issue here? How is the type system supposed to distinguish them? It's up to you, the developer, to do it. If you have two ints x and y, and you mess up and pass y to a function where you wanted to write x, the compiler is never going to catch it for you either.


I have three different functions, and what you wrote is not their type, that's their signature. Actually they are really only two different functions - a < b and b > a are the same thing for this integer type - but whether or not the optimiser knows that will not always be obvious.

Yes, the C++ type system doesn't express this, that's the defect (well, it's the consequence here of a larger defect).

Once they have different types there are two different interesting things we can do, we could just coerce them into a function pointer type based on the signature which is what you've seen in C++ and seem to assume is just naturally the only possibility. Or, we can use parametric polymorphism.


What do you mean by the "type" of a function? It doesn't sound like you're using any kind of standard definition for the word.


A type. The same thing C++ does for a lambda, since you seem to understand C++ better.


Not at all, Rust and C++ are more-or-less equivalent here. In C++, like in Rust, each lambda gets a unique unnamable type, and you can use template parameters to express direct calls.

> The lambda expression is a prvalue expression of unique unnamed non-union non-aggregate class type, known as closure type, which is declared (for the purposes of ADL) in the smallest block scope, class scope, or namespace scope that contains the lambda expression.

https://en.cppreference.com/w/cpp/language/lambda


Of course, but the result is that you need to write a lambda, even when it seems redundant semantically, because otherwise the optimiser may not see what's going on, whereas in Rust the natural expression of what you meant does what you expect.

One of the compatibility snags in Rust's standard library is that although you can write "Some words".contains(char::is_whitespace) and that does what you expect, the same won't work for char::is_ascii_whitespace for historic reasons and so you need to write the lambda in that case, which is uglier. Now, in that case if you write the natural thing you get a compiler error at least, but having the unnecessary lambda be needed to get performance is worse.

I really thought this was a dig at C++. What were you thinking of?


You could use C++ lambdas instead. They're basically structs with a call operator.

See also the comparisons from the functional header https://en.cppreference.com/w/cpp/header/functional


> ‘&’ doesn’t short circuit

Is this always true? I’ve experimented with this optimisation before but backed away from using it in production because it felt iffy to do. The codegen was far better though


Single & (and single | for "or") are bitwise and/or, not logical and/or. They have to evaluate both sides to be able to return a result.

For example:

  1100 & 0101 -> 0100
  1100 | 0101 -> 1101


You can't trust compilers anymore because they are way too aggressive about considering well-defined but platform-specific behavior to be "Undefined behavior".


This comment is tangential at best, but coming from someone who has never written a compiler (i.e. I'm a Dunning-Kruger peak), I'm curious to know if ML approaches could drive more optimizations. As I understand it, compiler optimizations are NP-hard so they use heuristics and clever ways to constrain the search space for optimizations, which I would think could be potentially guided by ML?

I think that there would be a lot of cases where engineers would accept 10x longer compile times if it gave 10% more application performance.


There is a fair amount of academic research on this topic. Eg, MLGO[1], but I don't know of any being used for production work.

1. https://arxiv.org/pdf/2101.04808.pdf


Thanks for the link, I've seen statistical models used in computer architecture e.g. branch predictors, so I figured it might translate to compilers as well.

Excited to learn more when I take a compiler course next year, as compilers are the most "dark magic" part of software to me.


ML systems do not guarantee valid output. If you have a system that can prove a given transformation is behavior-preserving, then sure you can let the AI have a go at it and then check the result, but having such a checker is much harder than it sounds.


> ML systems do not guarantee valid output

no one said this has to be an LLM or whatever. you could have a compiler that just uses ML-based cost heuristics. iirc some research has already been done in this direction


How does this `chunks_exact` business handle alignment?


Ken Thompson once made a similar point about compilers and trust but in terms of security:

https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...


Adding to the security part, there are also situations where compilers will optimize something, only to introduce security issues, hence the need to functions such as explicit_bzero (https://man.openbsd.org/explicit_bzero).


Half of Linus rants are just about compiler bugs .




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: