It's not always superior! For example, I've been working on a big integer librar...

sethhochberg · on June 9, 2020

Thanks for this - its been many years since I've done anything touching assembly, and never outside of an academic context, so when I read conversations like this I'm always curious for concrete examples of how people are actually _using_ this stuff beyond something vague like interacting with VT-x or working on an experimental OS.

secondcoming · on June 9, 2020

What is Rust's policy for signed/unsigned int overflow? I assume that it's not modulo or else the complier should have generated ADC for you.

I assume you've been using Compiler Explorer a lot?

masklinn · on June 9, 2020

> What is Rust's policy for signed/unsigned int overflow?

Defined, by default checked in debug mode and unchecked in release, unsigned wraps and signed wraps as two's complement. This can be overridden by explicitly setting overflow-checks in the relevant profile.

It also, separately, has explicitly wrapping, saturating and checking versions of basic arithmetic operations.

palmtree3000 · on June 9, 2020

overflowing_add, like I was using above, is explicitly wrapping.

I've found it very difficult to provoke rustc into emitting an ADC: the only case where it does so AFAICT is when adding u128s, which are implemented using u64s. Not sure why, except that the shortest function I could think of to emulate ADC is kind of baroque, and it's possible the compiler can't figure it out.

I've mostly been using cpuprofiler[1] and Vtune to simultaneously profile my code and show the assembly. In theory they both provide timing information per-instruction, but I don't really trust it. For the 6 adc instructions above, it shows the number of clock ticks as ranging from 22 million to 3 billion, which doesn't make sense to me. But at least it shows me the assembly!

[1] https://docs.rs/cpuprofiler/0.0.4/cpuprofiler/index.html

secondcoming · on June 9, 2020

This will change your life!

https://godbolt.org/z/fntvYa

jeffdavis · on June 9, 2020

The article here:

https://news.ycombinator.com/item?id=23351007

specifically recommends against the carry variants of addition, because the instructions are still dependent on each other and don't pipeline well. In other words, it's using the same algorithm, just buried in a single instruction, and that doesn't necessarily make it faster.

Have you considered using a strategy similar to what the article suggests? I think the HN comments also had some additional suggestions.

palmtree3000 · on June 9, 2020

I was briefly very excited when I read that article, actually. But as devit points out in a sibling comment, that technique is only relevant to cases where you're adding more than 2 numbers.

Multiplication initially seemed like a very promising use case, since it's basically repeated addition. But I'm not super optimistic about that, because I think it's dominated by the alternate optimization of noticing that the product of two 64 bit numbers cannot saturate the high 64 bits of the resulting 128 bit number, which causes carries to be bounded [1].

[1] https://github.com/rocurley/bignum/blob/b45448a156fb9100ab06...

adwn · on June 9, 2020

> that doesn't necessarily make it faster

Did you miss the "And got a 3x speedup" part of the post you were replying to? Actual benchmarks of real code always trump theoretical deliberations.

jeffdavis · on June 9, 2020

Yikes. I was just trying to link to a relevant technique that the author might find helpful.

A big integer library may have many use cases; a benchmark only shows one data point. It's possible that by deferring carry work across more operations he'd see an even bigger improvement.

devit · on June 9, 2020

The dependency is unavoidable due to the way addition works.

The approach in the article only works if you are adding a lot of numbers together, and then indeed doing carry propagation once at the end is obviously faster.

But of course there is no way that doing the carry propagation yourself on one addition can possibly be faster on a decent CPU that implements add-with-carry efficienly.

jeffdavis · on June 9, 2020

Maybe it's worth considering an interface to a big int library that can defer carry work across many operations, and then normalizing at the end?

That certainly sounds useful for, e.g., totaling an array of big integers.