This means that the compiler is allowed to optimize
volatile int *p = somevalue;
int s = *p + *p;
as if it said
volatile int *p = somevalue;
int tmp = *p;
int s = tmp + tmp;
is incorrect. The clause about sequence points starts with "Futhermore, " so it is an additional restriction, and to boot it only talks about stores to the volatile-qualified object, not loads, so it is entirely inapplicable to this example.
The prior clause saying "Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3." applies, and the rules of the abstract machine for the binary + operator are that each operand is evaluated (in an unspecified order). So both dereferences must be evaluated.
I agree, I think this part makes it clear: except as modified by the unknown factors mentioned previously.
Although this is quite contradicting:
5.1.2.3. p4: In the abstract machine, all expressions are evaluated as specified by the semantics. An
actual implementation need not evaluate part of an expression if it can deduce that its
value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object)
Volatile is not just for memory mapped hardware, but is also needed when using shared memory, say between two processes.
I have been bit by not having volatile modifiers for data residing in shared memory. Funny thing (and retrospectively obvious) is, it worked in debug mode, but in production data was seen by the other process. It was a fun exercise to find it one.
"shared memory" and "other process" quite hints at the frequent use of volatile where fences and other synchronization primitives are needed.
(Note: I didn't down-vote anything)
Sorry assumed it was you. It was the only comment and a downvote.
Shared memory and processes is a different beast than just threads in same process. Memory was mmap-ed between the two OS procs. Didn't use locks in that part of code for a list of reasons..., and memory write order didn't matter as much (so didn't see an immediate need for memory fences, but someone below suggested those might work to fix compiler optimizing out accesses when nor in debug mode).
Not sure what hinted at frequent use of volatile, it was the only use in 10 years. That is what made it memorable.
Did some more research. Compilers have their own intrinsics that only affect the code generated (in addition to the processor ones). VS has _ReadBarrier, _WriteBarrier and _ReadWriteBarrier while GCC has asm volatile("" ::: "memory"); to prevent reordering optimizations.
There really isn't any difference between multithread with common processes memory and multiprocess with mmaped shared memory. Both have mutliple threads accessing the same physical memory but in one case the OS made the memory common and the other you made the memory common.
BTW, when I say multithreaded I'm talking real threads and not light/green/cooperative/fake/whatever threads.
Thanks for doing more research. Did you get the idea there are two separate issues at hand -- 1) compiler optimizes memory access away because to its logic it seems the value is not written anywhere in that process. Say it could use a register instead if memory read. This is the effect I was seeing. 2) Compiler inserts the memory read instruction, but depending on the processor, at runtime, some of the writes will not show up at all, or be reorded. In this case running program on different CPU might make it appear to work correctly. I think there is a slight difference between the two from reading about back then. Can the first issue be fixed with a memory barrier?
Yes, you are correct that there are two issues. But solving 2 may solve 1 automatically or just need the compiler barrier intrinsics in the same spots as the processor barrier intrinsics.
And yes, the first issue can be fixed with memory barriers. If you put a compiler+processor read barrier in front of the first line that reads the shared variable the compiler will re-read it from memory back into a register. After that it will use the register as to not kill performance until it hits the barrier again.
So you think wrapping every single access to that shared memory structure in barriers would do the trick? I should dig out that code and try, I am curious now. But that would be kind of an large change. Now the volatile modifier is in one place only -- where data is defined not when it is accessed.
Volatile only affects how the compiler handles memory access. It does not tell the processor anything at all. The processor is still free to reorder reads and writes. You need to use the fence and lock instructions to create memory barriers especially on multi core processors.
> Volatile only affects how the compiler handles memory access
My comment specifically mentions memory access being optimized away. You did read it, right ;-) ? Even has an example how debug mode made it work. There was nothing there about telling something to the processor. If anything it is wanting to tell something to the compiler...
> You need to use the fence and lock instructions to create memory barriers especially on multi core processors.
Yap definitely. But will that still prevent the compiler from optimizing away stuff from a shared memory block? Remember looking for memory barriers but didn't know if my ancient system had those exposed in pthreads library.
Yes, I read your comment. :) What I was saying is that when you're using shared memory to talk to another thread or process (which by definition is running on another thread) you need to use memory barriers or the processor may do the same optimizations (and more) that your compiler did. On top of that other cores may not see your read/writes.
Debug mode tends to always work because it writes code as if everything was effectively volatile. After that it depends on how the threads/processes are scheduled on what core. Without proper synchronization it may work, may work slowly, may not work once in a blue moon or fail all the time. Current processor may run it fine and then it may stop working on future processors.
Fortunatly if you use things like mutexes then the barriers are handled for you.
Depends on the compiler if the barier intrinsics need volatile. I'll have to do some research on that.
Are memory barriers transactional i.e. something like begin barrier(), do some shm writes, end_barrier?
I was declaring data structures in shared mmap-ed memory as volatile so compiler does not optimise it away. How would the compiler know not to optimise away operation between memory fences? A fence to me say something about the guarantee of the order in which operations are seen. The order wasn't the problem, write never making it to the other side was the issue...
There are many types of barriers depending on the processor architecture. They're usually used in pairs so for example a read barrier would be used when mutex is acquired and the a write barrier when it's released.
If the compiler is smart about its memory barrier intrinsics it will make sure to reread anything after the read barrier and write everything before the write barrier. This will speed things up by not having to use volatile. Between the barriers it can cache reads and inside registers and optimize out multiple writes like normal.
If the compiler can't do that then you need to use volatile. Volatile was meant for I/O where every read has to hit the bus. So if you're forced to use it then you just copy the volatile variable into a non volatile one for use inside your mutex to get some speed back.
You do need volatile (or equivalent) to instruct the compiler not to optimize away an access/write. Locked read/write intrinsics will already be annotated such that they don't get optimized.
The justification for not using it for memory-mapped IO is that you are supposed to use accessor functions. Volotile still has uses for implementing those accesor functions (with care taken for the other concerns; and, by there nature, these functions are architecture dependent anyway), or when you are not using a kernel.
Volatile is for memory mapped IO. Some processors may need extra instuctions or configuration but that doesn't replace the need for volatile with MMIO.
On x86 the MTTR/PAT mechanisms is what disables caching, write combining and enables write through to MMIO address ranges.
Yes it does replace the need for volatile. The actual access is done via assembly language, either inline (better) or not.
You could make volatile work, probably, by putting inline assembly both before and after it. In that case you could probably even skip marking things valatile, because the inline assembly will make the compiler behave. Why bother? This is more complicated than just doing the access from assembly.
MTRR/PAT is often not enough, or it would only be enough if you killed performance elsewhere. (different types of MMIO in the same page) It's also not something you can rely on if you want to write portable code.
Yes, you can use memory barriers for MMIO but you'd have to put them between every single statement that touches the MMIO address range. Not only will that bloat your code it would slow down access to all your regular variables as the compiler would have to reload everything constantly and the processor wouldn't be able to out of order anything.
Why do you think volatile won't work for MMIO with the proper MTRR/PAT setup? The volatile keeps the compiler for messing with the read/writes and the MTRR/PAT makes the processor perform all reads and writes exactly in the order the instructions tell it to without any caching or write combining.
"This is more complicated than just doing the access from assembly." and "It's also not something you can rely on if you want to write portable code." So in one paragraph you're telling me to do it in assembly and in another you want portable code.
The linux kernel has gone above and beyond using volatile but in embedded it's how IO is done when using C.
I make the assumption that you put the inline assembly in a macro. This keeps your C portable. The macro gets pulled in from a header file that is specific to the CPU.
MTRR/PAT won't save you, at least without a huge performance loss, when you want write combining (etc.) for most of the addresses in a page of address space but not all of them.
MTRR/PAT won't save you from reordering that happens in a PCI bridge. It's allowed and it is done.
We're into semantics here. Portable requiring a CPU shim vs portable not requiring one. More complicated cpu/bus interfaces require the first one but if you can just use volatile for IO as done in embedded work you get much smaller and/or faster code.
If a device has registers that are sensitive to multiple read/writes of the same address and ones that need caching and write combining for performance in the same 4K page then it was designed poorly. :)
PCI bridges need to support legacy mode. Otherwise DOS programs and drivers wouldn't run correctly. The BIOS sets up a compatibility MTRR setup for this reason and you can see the Linux kernel print it out on boot.
So I'm basically agreeing and disagreeing with you on this one. Once the hardware gets past a certain complexity you need to move from volatile to accessor functions/macros.
Volatile also tempts you to break the aliasing rules. If you find yourself casting pointers, you are probably violating the C standard. You'll need something like -fno-strict-aliasing to enable a non-standard C dialect.
It was memory used for a low latency system, didn't have locks even spinlocks. In that case it was manipulating data directly in shm and compiler was optimizating away access to data. What is the current/modern mechanism to ensure a stucture or flag won't be optimized away from shared memory mapped between two processes? Are access functions available in userspace libraries?
This means that the compiler is allowed to optimize
as if it said is incorrect. The clause about sequence points starts with "Futhermore, " so it is an additional restriction, and to boot it only talks about stores to the volatile-qualified object, not loads, so it is entirely inapplicable to this example.The prior clause saying "Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3." applies, and the rules of the abstract machine for the binary + operator are that each operand is evaluated (in an unspecified order). So both dereferences must be evaluated.