Either I'm not understanding something that I thought I understood very well, or...

kloch · on Oct 26, 2021

> "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.

Another "correct" use of volatile is a hack to prevent compilers from optimizing away certain code. It's pretty rare to need that and often you can just use a lower optimization level (like the usual -O2) but sometimes you need -O3 / -Ofast or something and a strategic volatile type def to keep everything working.

A classic example is Kahan summation algorithim. At -O2 it's fine. At -O3 or higher it silently defeats the algorithm while appearing to work (you get a sum but without the error compensation). Defining the working vars as volatile makes it work again. This is noted in the wikipedia pseudocode with the comment "// Algebraically, c should always be zero. Beware overly-aggressive optimizing compilers!"

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Of course -O3 might not be any faster anyway but that's another topic.

vlovich123 · on Oct 26, 2021

I can’t imagine it’s an O2 vs O3 thing unless a compiler enables “fast-math” optimization to allow associativity. Neither clang nor GCC do this (neither does MSVC I think) - optimization levels never silently turn off IEEE754 floating point. I don’t know about ICC but it sounds like they stupidly enable fast math by default to try to win at benchmarks.

Do you have anything to actually support this statement or did you just assume “overly aggressive optimizing compilers” and “O3” are somehow linked?

Generally optimization levels may find more opportunities to exploit UB, but they do not change the semantics of the language, and all languages I’m familiar with define floating point as a non-associative operation because it’s not when you’re working with finite precision.

TLDR: Don’t use volatile unless you really know what you’re doing, and unless you know C/C++ really well, you probably do not. If anyone tells you to throw in a volatile to “make things work”, it’s most likely cargo curling bad advice (not always, but probably).

gpderetta · on Oct 26, 2021

there is some amount of truth on what the parent is saying. Ages ago, when x86 only had x87 FP, gcc would program the FPU to use 80 bit precision even when dealing with doubles. The excess precision meant that GCC could not implement IEEE math correctly even without fast-math. Forcing the storing of intermediate values into memory via volatile variables was a partial solution to this problem.

MSVC configures the FPU to use 64 bit precision which means that double words fine, but it has no 80 bit long double and float still suffer from excess precision.

SSE avoid all these problems.

vlovich123 · on Oct 26, 2021

Kind of, but that still shouldn't have impacted Kahan summation, which only cares about associativity, and extended precision doesn't impact that. They would just end up getting more numerically accurate results on x87.

kloch · on Oct 26, 2021

I did tests on Kahan summation recently on my macbook pro and -O3 defeated the algorithm while -O2 did not. Declaring the below variables as volatile restored error compensation with -O3.

The relevant code is:

          kahan_y=g_sample_z - kahan_c;
          kahan_t=g_sample_z_sum + kahan_y;
          kahan_c=(kahan_t - g_sample_z_sum) - kahan_y;
          g_sample_z_sum=kahan_t;

(this is in an inner loop where a new g_sample_z is calculated and then added to a running g_sample_z_sum with this snippet)

vlovich123 · on Oct 26, 2021

Sounds like a compiler bug to me. Can you file a bug to clang with a reduced standalone test (or I can do it for you if you share the standalone test).

kloch · on Oct 26, 2021

Here is a complete simplified Kahan summation test and indeed it works with -O3 but fails with -Ofast. There must have been something else going on in my real program at -O3. However the original point that 'volatile' can be a workaround for some optimization problems is still valid (you may want the rest of your program to benefit from -Ofast without breaking certain parts).

Changing the three kahan_* variables to volatile makes this work (slowly) with -Ofast.

  #include <stdio.h>

  int main(int argc, char **argv) {
    int i;
    double sample, sum;
    double kahan_y, kahan_t, kahan_c;

    // initial values
    sum=0.0;
    sample=1.0; // start with "large" value

    for (i=0; i <= 1000000000; i++) { // add 1 large value plus 1 billion small values
      // Kahan summation algorithm
      kahan_y=sample - kahan_c;
      kahan_t=sum + kahan_y;
      kahan_c=(kahan_t - sum) - kahan_y;
      sum=kahan_t;

      // pre-load next small value
      sample=1.0E-20;
    }
    printf("sum: %.15f\n", sum);
  }

vlovich123 · on Oct 26, 2021

Correct. `-Ofast` claim to fame is it enables `-ffast-math` which is why it has huge warning signs around it in the documentation. `-ffast-math` turns on associativity which is problematic for Kahan summation. Rather than sprinkling in volatiles which pessimizes the compiler to no end, I would recommend annotating the problematic function to turn off associativity [1][2].

Something like:

    [[gnu::optimize("no-associative-math")]]
    double kahanSummation() {
      ...
    }

That way the compiler applies all the optimizations it can but only turns off associative math. This should work on Clang & GCC & be net faster in all cases.

This is what I mean by "If you're sprinkling volatile around, you probably aren't doing what you want" and are just cargo culting bad advice.

[1] https://stackoverflow.com/questions/26266820/in-clang-how-do... [2] https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-Attrib...

hermitdev · on Oct 26, 2021

I hope this isn't the actual "real" code, because you've got undefined behavior before you even have to worry about the associativity optimizations. There's an uninitialized read of 'kahan_c' on the first loop iteration.

dataangel · on Oct 26, 2021

I think their point is you only need compiler barriers not actual barrier instructions on x86. volatile in practice has been the de facto way to get the effect of a compiler memory barrier for a long time even though it's not the best way to do it nowadays. The original purpose of it is literally preventing the compiler from getting rid of loads and stores and reordering them which is exactly what is needed when implementing a lockless FIFO. As long as all the stores and loads (including the actual FIFO payload) are volatile, it will work (volatile loads and stores are guaranteed to not be reordered with each other). After that the x86 guarantees about not reordering are very strong. Really the best argument against volatile for this kind of thing is actually the opposite of your point, volatile is too strong. It prevents more reordering than you actually want. Acquire/release semantics are less strong and give the compiler more flexibility.

gpderetta · on Oct 27, 2021

volatile does not work as a memory barrier neither in theory nor in practice. On gcc for example you need an explicit additional compiler barrier before a volatile store and after a volatile load to implement the expected release/acquire semantics on x86.

See [1] for example the implementation of smp_store_release and smp_load_acquire in the linux kernel (barrier() is just a compiler barrier and {READ,WRITE}_ONCE are a cast to volatile).

Volatile only prevents reordering of volatile statements (and IO), not all load and stores.

[1] https://elixir.bootlin.com/linux/latest/source/tools/arch/x8...

hvdijk · on Oct 26, 2021

The point is that the bug is unexploitable on x86 because although the source code may have a bug, on x86 it gets compiled to machine code that does not. That's the thing with undefined behaviour, sometimes it does work exactly as you expect, which can make it so tricky to nail down.

stong1 · on Oct 26, 2021

Right. The challenge is written incorrectly on purpose, otherwise the code isn't vulnerable. The use of volatile is a bit of a misdirection for the CTF players, since you're right that it's a common misconception that volatile acts like a barrier.

> You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.

I am not sure about this. From my understanding, on x86, given the absence of compiler reordering, processor reordering should not cause a problem for a single-reader-single-writer FIFO. Normally I just use atomics but I think in this specific instance it should still be okay anyways. Obviously it will not work on ARM.

From my testing if you compile the code on x86 with clang or gcc, the resulting binary is not vulnerable.

gpderetta · on Oct 26, 2021

Without compiler fences in the right place [1] GCC and clang can miscompile the code even on x86. Doesn't mean they will of course.

[1] see the linux kernel implementation of load acquire and store release on x86 for example.

scatters · on Oct 26, 2021

Another place it's meaningful to use `volatile` is in benchmarking and testing: to either ensure that a block of code is run despite not having any side effects, or to ensure that a block of code that should not be run is still compiled and emitted to binary.

But yes, `volatile` for what should be atomics is a clear code smell. I made quite a loud noise when I read "the code quality looks excellent" in the article.

leeter · on Oct 26, 2021

Well their use is definitely UB as it creates data races. Godbolt to the rescue... https://godbolt.org/z/3rsK6n31z

gpderetta · on Oct 26, 2021

As noted elsethread, the code is indeed wrong even on x86, although you only need compiler fences there.