Don't mistake every machine you have seen death spiraling using swap, with every...

lazide · on April 21, 2022

That I’ve administered? None under any significant load!

I even finally disabled it on the lab raspberry pi’s eventually, and a SBC I use to rclone 20+ TB NVR archives due to performance problems it was causing.

It’s a pretty consistent signal actually - if I look at a machine and it’s using any swap, it’s probably gotten wonky in the recent past.

taeric · on April 22, 2022

Apologies. I forgot I had posted something. :(

I am a little surprised that every machine you admin has had issues related to swap. Feels high.

For the ones that are now using swap and likely went wonky before, how many would have that crashed due to said wonkiness?

Teletio · on April 21, 2022

There are plenty of workload which sometimes just spike.

Batch process for example.

With proper monitoring you can actually act on it yourself instead of just restarting which just leads to a oom loop.

lazide · on April 21, 2022

If you pushed something to swap, you didn’t have enough RAM to run everything at once. Or you have some serious memory leaks or the like.

If you can take the latency hit to load what was swapped out back in, and don’t care that it wasn’t ready when you did the batch process, then hey, that’s cool.

What I’ve had happen way too many times is something like the ‘colder’ data paths on a database server get pushed out under memory pressure, but the memory pressure doesn’t abate (and rarely will it push those pages back out of swap for no reason) before those cold paths get called again, leading to slowness, leading to bigger queues of work and more memory pressure, leading to doom loops of maxed out I/O, super high latency, and ‘it would have been better dead’.

These death spirals are particularly problematic because since they’re not ‘dead yet’ and may never be so dead they won’t, for instance, accept TCP connections, they defacto kill services in ways that are harder to detect and repair, and take way longer to do so, than if they’d just flat out died.

Certainly won’t happen every time, and if your machine never gets so loaded and always has time to recover before having to do something else, then hey maybe it never doom spirals.

Teletio · on April 21, 2022

I try to avoid swap for latency critical things.

I do a lot of ci/CD where we just have weird load and it would be a waste of money/resources to just shelf out the max memory.

Other example would be something like Prometheus: when it crashes and reads the wal, memory spikes.

Also it's probably a unsolved issue to tell applications how much memory they actually are allowed to consume. Java has some direct buffer and heap etc.

I have plenty of workloads were I prefer to get an alert warning and acting on that instead of handling broken builds etc.