I make the assumption that you put the inline assembly in a macro. This keeps your C portable. The macro gets pulled in from a header file that is specific to the CPU.
MTRR/PAT won't save you, at least without a huge performance loss, when you want write combining (etc.) for most of the addresses in a page of address space but not all of them.
MTRR/PAT won't save you from reordering that happens in a PCI bridge. It's allowed and it is done.
We're into semantics here. Portable requiring a CPU shim vs portable not requiring one. More complicated cpu/bus interfaces require the first one but if you can just use volatile for IO as done in embedded work you get much smaller and/or faster code.
If a device has registers that are sensitive to multiple read/writes of the same address and ones that need caching and write combining for performance in the same 4K page then it was designed poorly. :)
PCI bridges need to support legacy mode. Otherwise DOS programs and drivers wouldn't run correctly. The BIOS sets up a compatibility MTRR setup for this reason and you can see the Linux kernel print it out on boot.
So I'm basically agreeing and disagreeing with you on this one. Once the hardware gets past a certain complexity you need to move from volatile to accessor functions/macros.
MTRR/PAT won't save you, at least without a huge performance loss, when you want write combining (etc.) for most of the addresses in a page of address space but not all of them.
MTRR/PAT won't save you from reordering that happens in a PCI bridge. It's allowed and it is done.