Reports of *some* jitted Java code running faster than C (after warmup) are old....

zmmmmm · on July 31, 2018

Heh ... one of my employees took it into their head to code up some arithmetic algorithms in C++ a month or so ago. We do not use C++ for anything, we are all Python and JVM based. But he decided that he was going to achieve an amazing win by optimising some numerical code to get order of magnitude benefits, and without asking invested 4 hours into coding it up. I wrote a naive implementation of the same thing in Groovy, of all languages. My implementation was initially 20 times as fast and I coded it in 30 minutes.

So he debugged some more and figured out that he misunderstood some of the inner workings of how vectors copy data and also that he did not understand the threading library he was using properly. He then fixed those two things. After this further exercise he reduced the difference to factor of 4. However he was never able to work out why my code was 4 times as fast as his C++ and abandoned it.

I know for sure that with appropriate expertise the C++ could probably be made to go perhaps twice as fast as my Groovy code. But the point is, none of the supposed benefits come automatically regardless what language you are using. And unless you flip over to GPU or FPGA accelerated methods, the final outcome is well and truly in the same ballpark anyway.

But all this is to say that "rarely translated" might be true at the for applications that are completely in the high performance domain. But for all the applications where the high performance code is in niches at the edge and there simply aren't resources or expertise to fully tune the native implementation ... I think it's translated all the time.

physguy1123 · on July 31, 2018

In my experience, writing java (or groovy here) in c++ results in horribly slow code which the jvm runs circles around, and it sounds like that's the problem your employee ran into.

> But for all the applications where the high performance code is in niches at the edge and there simply aren't resources or expertise to fully tune the native implementation

It's interesting you say this, because in my experience it's the JVM which requires absurd amounts of tuning and native programs which are much more consistent. The proper and easier way that native programs are written lends itself to fairly respectable performance, mostly because the object and stack model of say C or C++ is so much friendlier to the CPU than in most dynamic languages.

In general, for all that I hear statements along this line, I've only twice seen code to back it up, and the C was so de-optimized from the OCaml version that I suspect it was intentional - the author (same for each) was a consultant for functional languages, and in one case switched the C inner loop to use indirect calls for every iteration and in the other switched the hash function between the C and functional comparison.

shub · on July 31, 2018

In addition, a lot of the techniques used to write high-performance Java boil down to "write it like C". Avoid interfaces, avoid polymorphic virtual calls (as you can't avoid virtuals entirely), avoid complex object graphs, avoid allocating as much as possible...it's not nearly as nice as naive Java. Still nicer than C IMO. If your process segfaults you can know for certain that it's a platform bug.

zmmmmm · on July 31, 2018

The other thing that makes Java nicer than C is the ease and depth with which you can profile it to discover where the bottlenecks actually are. While it's certainly possible to profile in both cases, the runtime reflective and instrumentation capabilities of the JVM really add a lot of power to it.

repolfx · on July 31, 2018

There's this classic paper from Google that runs an optimisation competition on the same program written in C++, Java, Scala and Go:

https://days2011.scala-lang.org/sites/days2011/files/ws3-1-H...

physguy1123 · on Aug 2, 2018

This is great benchmark of the fundamental problems with say Java - the code itself is fairly simple and the JITs probably generate optimal code given their constraints, but the performance problems clearly show that the GC and pointer chasing really hinder your performance.

If you add in cases where simd, software prefetching, or memory access batching help, the difference will only grow.

weberc2 · on July 31, 2018

It’s not native vs VM, but rather “has stack semantics/value types” vs “no stack semantics/value types”. In particular, OCaml’s standard implementation is a native, not VM.

Also worth calling out Go, which is rather unique in that it has stack semantics but it also has a garbage collector, so it’s kind of the best of both worlds in terms of ease of writing correct, performant code.

pjmlp · on July 31, 2018

Go is not rather unique in having GC and stack semantics, there are plenty of languages that have it, all the way back to Mesa/Cedar and CLU.

weberc2 · on July 31, 2018

I should have been more clear I guess; I was comparing it to other popular languages. Few have value types and many that do (like C#) regard them as second-class citizens.

ernst_klim · on July 31, 2018

But go has an imprecise GC (in reference implementation) or stack maps (in gccgo), so the GC overhead is rather huge. It also lacks of compaction, so cache misses are not that good too.

weberc2 · on July 31, 2018

Not sure what you mean by imprecise, but Go’s GC does trade throughput for latency. The overhead still isn’t huge if only because there is so much less garbage than in other GC languages. I’m also surprised by your cache misses claim; Go has value types which are used extensively in idiomatic code so generally the cache properties seem quite good—maybe my experience is abnormal?

ernst_klim · on July 31, 2018

>Not sure what you mean by imprecise

It's a rigid term:

https://en.wikipedia.org/wiki/Tracing_garbage_collection#Pre...

perf shows how much time does GC eat, and that's quite a lot. Thus in the majority of benchmarks go lags behind java or on par with it at best.

>there is so much less garbage than in other GC languages

That is not true since strings and interfaces are heap allocated thus the only stack allocated objects are numbers and very simple structs (i.e. which contains only numbers), so you would have a lot of garbage unless you are doing a number crunching, which could be easily optimized by inlining and register allocation anyway.

weberc2 · on July 31, 2018

> It's a rigid term

Ah, neat! I learned something. :)

You’re mistaken about only numbers and simple structs being stack allocated. All structs are stack allocated unless they escape, regardless of their contents. Further, arrays and constant-sized slices may also be stack allocated. I’m also pretty sure interfaces are only heap allocated if they escape; in other words, if you put a value in an interface and it doesn’t escape, there shouldn’t be an allocation at all.

ernst_klim · on July 31, 2018

Both arrays and interfaces are heap allocated. Slice is just a pointer to a heap allocated array.

Structure could be stack allocated, but any of it's fields would not if there is anything but a number.

A trivial example:

https://segment.com/blog/allocation-efficiency-in-high-perfo...

    func main() {
            x := 42
            fmt.Println(x)
    }

    ./main.go:7: x escapes to heap

So a trivial interface cast leads to allocation.

weberc2 · on July 31, 2018

Looks like you're right about interfaces (full benchmark source code: https://gist.github.com/weberc2/87d2fdc379065a2765d1c9f490ad...)!

    BenchmarkEscapeInterface-4        50000000   33.3 ns/op  8 B/op  1 allocs/op
    BenchmarkEscapeConcreteValue-4    200000000  9.45 ns/op  0 B/op  0 allocs/op
    BenchmarkEscapeConcretePointer-4  100000000  10.0 ns/op  0 B/op  0 allocs/op

But arrays are stack allocated:

    BenchmarkEscapeArray-4  50000000   21.3 ns/op  0 B/op  0 allocs/op

And structs are stack allocated, as are their fields--even fields that are structs, slices, and strings!:

    BenchmarkEscapeStruct-4  100000000  12.8 ns/op  0 B/op  0 allocs/op

The code:

    type Inner struct {
    	Slice  []int
    	String string
    	Int    int
    }
    
    type Struct struct {
    	Int    int
    	String string
    	Nested Inner
    }
    
    func (s Struct) AddThings() int {
    	return s.Int + len(s.String) + len(s.Nested.Slice) + len(s.Nested.String) +
    		s.Nested.Int
    }
    
    func BenchmarkEscapeStruct(b *testing.B) {
    	for i := 0; i < b.N; i++ {
    		s := Struct{
    			Int:    42,
    			String: "Hello",
    			Nested: Inner{
    				Slice:  []int{0, 1, 2},
    				String: "World!",
    				Int:    42,
    			},
    		}
    		_ = s.AddThings()
    	}
    }

ernst_klim · on July 31, 2018

I'm sure your strings are not stack allocated, they are statically allocated (and would be statically alocated in any language). Not sure about arrays, but dynamic arrays should be dynamically allocated do, your arrays are static probably. They would be heap allocated, if you would use make.

weberc2 · on July 31, 2018

It doesn't matter whether they're stack allocated or statically allocated; neither is garbage, contrary to the original claim ("Go generates a lot of garbage except when dealing with numeric code"). The subsequent supporting claims ("structs with non-numeric members are heap-allocated", "struct fields that are not numbers are heap allocated", etc) were false (sometimes non-numeric members are heap allocated, but they're often not allocated and never because they're non-numeric and their container is never heap allocated on the basis of the location of the member data).

I think this matter is sufficiently resolved. Go trades GC throughput for latency and it doesn't need compaction to get good cache properties because it generates much less garbage than traditional GC-based language implementations.

ernst_klim · on July 31, 2018

>It doesn't matter whether they're stack allocated or statically allocated

It does. Any language could do static allocation, go is not different from java here, the problem is that in any real code nearly all your strings and arrays would be dynamic, thus heap allocated, as well as interfaces. Consider also that allocations in Go are much more expensive than in java or haskell.

weberc2 · on July 31, 2018

We're talking past each other. My claim was that Go doesn't need compaction as badly as other languages because it generates less garbage. You're refuting that with "yeah, well it still generates some garbage!". Yes, strings and arrays will often be dynamic in practice, but an array of structs in Go is 1 allocation (at most); in other many other languages it would be N allocations.

> Consider also that allocations in Go are much more expensive than in java or haskell.

This is true, but unrelated to cache performance, and it's also not a big deal for the same reason--allocations are rarer in Go.

EDIT:

Consider `[]struct{nested []struct{i int}}`. In Go, this is at most 1 allocation for the outer array and one allocation for each nested array. In Python, C#, Haskell, etc, that's something like one allocation for the outer array, one allocation for each object in the array, one allocation for each nested array in each object, and one allocation for each object in each nested array. This is what I mean when I say Go generates less garbage.

ernst_klim · on July 31, 2018

>Consider `[]struct{nested []struct{i int}}`.

A typical example, yeah. I've said about structs of ints already, it's not a common type unfortunately anywhere beyond number crunching, in which go sucks anyway.

In haskell you could have unboxed array with unboxed records. Check Vector.Unboxed.

weberc2 · on July 31, 2018

> I've said about structs of ints already

Yeah, but you were wrong (you said other kinds of structs would escape to the heap). The innermost struct could have a string member and a `*HeapData` member; it wouldn't matter. The difference in number of allocations between Go and others would remain the same. The difference isn't driven by the leaves, it's driven by number of nodes in the object graph; the deeper or wider the tree, the better Go performs relative to other GC languages.

> In haskell you could have unboxed array with unboxed records. Check Vector.Unboxed.

For sure, but in Go "unboxed" is the default (i.e., common, idiomatic); in Haskell it's an optimization.

yxhuvud · on July 31, 2018

Regarding your last point, Crystal has the same features as go in that regard, while at the same time being vastly more expressive. This mostly due to the standard library in Crystal being so nice for work with collections (which perhaps isn't surprising as the APIs are heavily influenced by Ruby). Blocks being overhead free is another necessary part for this to work well.

weberc2 · on July 31, 2018

Yeah, I often find myself wishing Go's type system were a bit better, but the reason I prefer it is because it's fast, easy to reason about, and the tooling/deployment stories are generally awesome (not always though--e.g., package management). So far I'm only nominally familiar with Crystal; I'll have to look into it sometime.

tybit · on July 31, 2018

.NET is another example of value types in a garbage collected language. It’s also somewhat unique afaik in doing so within a VM.

weberc2 · on July 31, 2018

Definitely. I’m sad that they’re not more idiomatic in C#. I definitely prefer values and references over OOP class objects.

mynegation · on July 31, 2018

This is exactly it: dynamic languages give you OK ish performance and fast development speed. Fast C++ code requires a lot of expertise, this kind of expertise is expensive and there are diminishing returns too. I don’t know anything about expertise of your colleague in C++ but given that their first optimization was to eliminate some redundant copying, I suspect there is some more room for improvement. After unnecessary copying is removed it usually boils down to things like cache locality, better memory allocation discipline, data alignment, and sometimes knowledge of a better algorithm applicable to a particular situation (eg radix sort or perfect hashing or tries), judicial use of multi threading (in form of OpenMP), understanding whether single precision floating point is good enough etc etc.

weberc2 · on July 31, 2018

I don’t think this is a property of dynamic languages. This groovy example is almost certainly the best case for the JVM (arithmetic, few allocations or polymorphism, etc). In other words, probably not taking advantage of the dynamism.

Dynamic languages can be fast by being well designed and simple (Wren) or highly optimized (JS) or both (LuaJIT). There’s also the experimental GraalVM, but this is definitely the exception and not the rule.

Writing performant C++ is actually not hard if you have rudimentary C++ knowledge. That said, rudimentary C++ knowledge is a lot more expensive than rudimentary knowledge of other languages. But the options aren’t just dynamic languages vs C++; there’s a ton of middle ground with VM languages like Java and C# and native languages like Go. The first two aren’t much harder than dynamic languages and I find Go easier than any dynamic language I’ve used to date (I’m a professional python developer). But all of these languages are on the order of half of C’s performance and 100X the speed of CPython or Ruby and 10X of JS.

pjmlp · on July 31, 2018

> VM languages like Java and C# and native languages like Go.

Oh well this VM meme is getting old,

https://www.ptc.com/en/products/developer-tools/perc

https://www.excelsiorjet.com/

https://docs.microsoft.com/en-us/dotnet/framework/net-native...

https://docs.microsoft.com/en-us/xamarin/mac/internals/aot

https://docs.unity3d.com/Manual/IL2CPP-HowItWorks.html

weberc2 · on July 31, 2018

I don’t think I’m following your point. That C# isn’t a VM language because there exist AOT compilers? That’s fine; my point is unrelated to the VM/interpreter/AOT taxonomy—just that dynamic languages aren’t particularly performant. I’m happy to concede the “C# is/isn’t a VM (even though 99% of production deployments are VMs)” point if it matters to you.

tom_ · on July 31, 2018

Oh blimey, four hours...

blub · on July 31, 2018

Are you going to let your employees get away with investing all of 4 hours into things you did not give them permission for? You haven't shown them who the alpha is until they ask you for permission to go to the bathroom.