For example, I've been working on a big integer library. Addition is a bit annoying, because you need to add 3 numbers (the prior carry bit, and the two digits you're adding) and get out a new carry bit and a resulting digit. This is a bit cumbersome:
let (res, carry1) = target_digit.overflowing_add(carry as u64);
let (res, carry2) = res.overflowing_add(other_digit);
*target_digit = res;
carry = carry1 || carry2;
The resulting assembly is a fairly literal translation. We perform the addition using an add instruction, and extract the carry flag into a register using setb.
But there's a dedicated instruction for this, adc. adc adds two operands and the carry flag, while itself setting a carry flag. Manually unrolling the loop a bit, I wrote this assembly:
Thanks for this - its been many years since I've done anything touching assembly, and never outside of an academic context, so when I read conversations like this I'm always curious for concrete examples of how people are actually _using_ this stuff beyond something vague like interacting with VT-x or working on an experimental OS.
> What is Rust's policy for signed/unsigned int overflow?
Defined, by default checked in debug mode and unchecked in release, unsigned wraps and signed wraps as two's complement. This can be overridden by explicitly setting overflow-checks in the relevant profile.
It also, separately, has explicitly wrapping, saturating and checking versions of basic arithmetic operations.
overflowing_add, like I was using above, is explicitly wrapping.
I've found it very difficult to provoke rustc into emitting an ADC: the only case where it does so AFAICT is when adding u128s, which are implemented using u64s. Not sure why, except that the shortest function I could think of to emulate ADC is kind of baroque, and it's possible the compiler can't figure it out.
I've mostly been using cpuprofiler[1] and Vtune to simultaneously profile my code and show the assembly. In theory they both provide timing information per-instruction, but I don't really trust it. For the 6 adc instructions above, it shows the number of clock ticks as ranging from 22 million to 3 billion, which doesn't make sense to me. But at least it shows me the assembly!
specifically recommends against the carry variants of addition, because the instructions are still dependent on each other and don't pipeline well. In other words, it's using the same algorithm, just buried in a single instruction, and that doesn't necessarily make it faster.
Have you considered using a strategy similar to what the article suggests? I think the HN comments also had some additional suggestions.
I was briefly very excited when I read that article, actually. But as devit points out in a sibling comment, that technique is only relevant to cases where you're adding more than 2 numbers.
Multiplication initially seemed like a very promising use case, since it's basically repeated addition. But I'm not super optimistic about that, because I think it's dominated by the alternate optimization of noticing that the product of two 64 bit numbers cannot saturate the high 64 bits of the resulting 128 bit number, which causes carries to be bounded [1].
Yikes. I was just trying to link to a relevant technique that the author might find helpful.
A big integer library may have many use cases; a benchmark only shows one data point. It's possible that by deferring carry work across more operations he'd see an even bigger improvement.
The dependency is unavoidable due to the way addition works.
The approach in the article only works if you are adding a lot of numbers together, and then indeed doing carry propagation once at the end is obviously faster.
But of course there is no way that doing the carry propagation yourself on one addition can possibly be faster on a decent CPU that implements add-with-carry efficienly.
For example, I've been working on a big integer library. Addition is a bit annoying, because you need to add 3 numbers (the prior carry bit, and the two digits you're adding) and get out a new carry bit and a resulting digit. This is a bit cumbersome:
The resulting assembly is a fairly literal translation. We perform the addition using an add instruction, and extract the carry flag into a register using setb. But there's a dedicated instruction for this, adc. adc adds two operands and the carry flag, while itself setting a carry flag. Manually unrolling the loop a bit, I wrote this assembly: And got a 3x speedup.