I think this is pretty much exactly right with the caveat that "fast implementations" is probably "the overwhelming majority of implementations". `libpthread` on GNU has been doing this since smart phones were quite the novelty. I'm sure it exists, but I'm having a hard time imagining who wrote a C++11 compliant standard library and didn't do this optimization.