> Combine that with how cheap `std::mutex` is on most platforms This is suspicio...

benreesman · on Aug 21, 2022

I mean it's not hard to read the source for your platform. On Linux/x86_64/libc++ it's roughly:

- https://github.com/llvm-mirror/libcxx/blob/master/include/__...

- https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=nptl/...

I don't particularly care to comb through it to see if anything has changed, but historically it was a a little spin-CAS to make the non-contended path fast and then dropping into a https://en.wikipedia.org/wiki/Futex, which is about as good as it gets for staying mostly in userspace but still letting it be scheduler aware so you're not burning up a core busy-polling, which is what often happens when people try to roll their own shit.

Google wants a bit more latitude on the heuristics and degrees of freedom around read/write ownership, so they did it like this: https://github.com/abseil/abseil-cpp/blob/master/absl/synchr... which is quite a bit better commented/legible.

If anyone reading this can do better than the `abseil-cpp` folks, not only would Google take their PR, they'd probably offer them a job.

imtringued · on Aug 22, 2022

I am always disappointed when someone talks about std::mutex poorly. On Linux it is as good as it can possibly be for a generic catch all lock and by that I mean it is really really good for most usecases. If you want to use a spinlock to outperform std::mutex you will at least have to do the legwork of using real time scheduling and guaranteeing that any spinlock you are locking will be unlocked within a finite amount of time with a known upper bound. Any less and your spinlock will cause problems when your thread locks it and then gets interrupted by the OS scheduler.

tialaramex · on Aug 28, 2022

> On Linux it is as good as it can possibly be for a generic catch all lock

A futex is a 32-bit aligned value, thus it needs 4 bytes. But std::mutex on Linux is 40 bytes, ten times larger. Now, maybe where you come from "ten times larger than it needs to be" is "as good as it can possibly be" but where I come from that's not very good.

gpderetta · on Aug 21, 2022

The only issue with std::mutex on libstdc++ is that it is larger than it needs to be for ABI reasons. Otherwise it is perfectly adequate for many use cases.

benreesman · on Aug 21, 2022

Great point. The `libstdc++` ABI is very low on my list of favorite things. Who doesn't love spilling cache lines because of 1990s layouts and taking L3 misses for `auto s = std::string{"hello world"};` because modern SSO breaks that ABI and... /s

mgaunard · on Aug 22, 2022

It's adequate if you're happy yielding to the kernel and needing some other thread to issue a system call to wake you up.

Some uses of C++ have stricter latency and real-time requirements which aren't compatible with this.

benreesman · on Aug 22, 2022

If you're in a serious multicore setting and you don't take waiting threads off the contended cache line while the other 50 threads take their turn with the underlying contended thing it's very easy to end up way worse off than if you had just let the scheduler drop those threads into e.g. a futex and wake them up when it's their turn.

I'm not sure how deep you'd need to go into the low-power or embedded or hard-realtime worlds before `std::mutex` doesn't do a modest number of spin-CAS rounds before dropping into the futex, I'm sure it exists.

And maybe you work at Optiver and you've got an FPGA interacting with the link-layer and you literally never leave userspace after startup and you've got your own hand-crafted DMA busy-poll situation going because you build your machines with the exact number of cores for the number of threads you need and throw them away when the software changes. There are domains like that. </modest-hyperbole>

The number of "aggressively intermediate" people working at serious companies who roll their own concurrency shit because it's Just Fucking Metal Man is terrifyingly high. And it's dangerous, because if you are someone who needs custom concurrency you know, but if you aren't someone who needs custom concurrency, you often still think you know.

mgaunard · on Aug 22, 2022

As soon as you yield, things become non-deterministic, so aren't compatible with real-time requirements.

If you have 50 threads trying to access the same thing then what you have is a software design problem. Most synchronization should be done through spsc queues, which are easy to make lock-free (or even wait-free) and efficient, so long as you're aware of how to deal with backoff on the producing side and idle work on the consuming side.

The Optiver model you're describing is pretty much how I'd build any low-latency application. It doesn't really require special hardware to do these things (you can use io_uring to bypass the kernel context switches for anything). It's also much simpler than what you hint at.

benreesman · on Aug 22, 2022

I'm willing to accept that a London-based crypto startup (i.e. LMAX-integrated) could have a use for extreme low-latency, extreme low-variance soft-realtime C++. In the sub-mike p99 regime you probably want to keep multithreading out of it entirely in fact.

Hopefully you can accept that nitpicking the use of well-tested concurrency primitives on a forum full of impressionable up-and-coming hackers is almost certainly going to create downward pressure on sensible engineering choices amongst readers of your comments.

mgaunard · on Aug 23, 2022

Have you been googling me? A few inaccuracies there ;).

gpderetta · on Aug 22, 2022

>As soon as you yield, things become non-deterministic, so aren't compatible with real-time requirements.

SCHED_FIFO and SCHED_RR.

Although linux is not really an RT OS.

benreesman · on Aug 23, 2022

Once again, exactly right.

I deeply appreciate you helping to steer this little tire fire of a comment thread I seem to have created off the rocks, I meant well but it seems to have ended up as an advertisement for insane defaults.

With that said, the parent seems pretty committed. And for all I know, is actually doing sub-microsecond software HFT or hard-realtime signal processing.

Thread: listen to this person ^.

mgaunard · on Aug 23, 2022

Threads can be derailed, that's usually how interesting things come up.

imtringued · on Aug 22, 2022

Fast implementations of std::mutex spin for a bounded amount of time before yielding, if you unlock the mutex fast enough it will have roughly the performance of a spinlock and if your locking thread gets interrupted by the scheduler, other threads waiting for the mutex will stop spinning and yield.

I already mentioned real time scheduling which is well outside of the scope of the vast majority of userspace applications.

benreesman · on Aug 22, 2022

I think this is pretty much exactly right with the caveat that "fast implementations" is probably "the overwhelming majority of implementations". `libpthread` on GNU has been doing this since smart phones were quite the novelty. I'm sure it exists, but I'm having a hard time imagining who wrote a C++11 compliant standard library and didn't do this optimization.