Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Doing small network scientific machine learning in Julia faster than PyTorch (julialang.org)
178 points by adgjlsfhk1 on April 14, 2022 | hide | past | favorite | 123 comments


A library I designed a few years ago (https://github.com/Netflix/vectorflow) is also much faster than pytorch/tensorflow in these cases.

In "small" or "very sparse" setups, you're memory bound, not compute bound. TF and Pytorch are bad at that because they assume memory movements are worth it and do very little in-place operations.

Different tools for different jobs.


Thanks for designing this tool, somehow it's missed my radar. Interestingly it is kind of given that D language performance is always fast. It is a shame that it's not being used more on data science since personally I think it's more natural and better fit to replace Python and C eco-system that's very popular now for data science. Heck, it even beats Fortran in its own turf for the BLAS benchmark several years back [1].

Just wondering why did you choose D when you write the Vectorflow library?

[1]http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...


Is this still used at Netflix? Looks like it's not actively developed anymore.


It's not developed anymore. I still have a service in production using it: estimating the distribution of runtime of any container we launch, through joint conditional quantile regression (output of the net is p5, p10, ..., p95 of the runtime distribution).

It's a good use case for it, because it needs to be low-latency (it's in the critical path when launching a container at Netflix) and it's using a lot of sparse features to do its job (hashed tokens of a lot of poorly structured metadata about the container and its runtime).


In "small" setups, you're actually more likely to be overhead-bound if anything (especially on CPU).


Would you mind defining "overhead"?


IMO, one of the big sources of overhead is the PCIe bus.

Transferring data to-and-from the PCIe / GPU takes time. If the CPU is faster and can perform the calculation before the PCIe is done transferring, then there's no point even touching the GPU.

PCIe is on the scale of ~5000 nanoseconds (20,000 CPU cycles assuming 4GHz). A PCIe write, then read could be ~40,000 CPU cycles or so, and it turns out that a lot of things can be done in that time before the GPU was even _NOTIFIED_ that there was work to do.

CPUs can contact other CPU-cores within 50 to 500 nanoseconds or so, depending on the distance. A 64-core CPU could then have 64-cores * 18000 clocks == ~1-million CPU-clock cycles before the GPU gets any message at all.


(nit picking) 5000ns (aka 5us) seems ... too high ... for PCIe. Infiniband, connected over PCIe can be sub 600ns for latency[1] for 2 generation old hardware.

Still, that's around 1800 clock cycles for a 3GHz CPU. Which implies there is a floor, below which you want to keep computations local on-CPU versus communicating over PCIe. Whether it be to a remote system via RDMA, or to a PCIe connected accelerator.

[1] https://www.aravision.com/products/infiniband/pdf_mellanox_a...


I appreciate the hard data from a different PCIe system.

I came to the 5000ns estimate because I tested some null-kernel a long time ago on a GPU. Basically a "hello world" of GPU kernels, and I tested to see how long it took for the kernel to be called.

I hear that GPUs have sped up a bit since my test a few years ago, but I'd expect it to still be around 5000ns for a "CPU-GPU ping-pong".


There's no mention of GPUs, TPUs, or indeed QPUs in the memory hierarchy described by Wikipedia!

Locality of reference ("Data locality") > Spatial and temporal locality usage : https://en.wikipedia.org/wiki/Locality_of_reference

Memory hierarchy https://en.wikipedia.org/wiki/Memory_hierarchy :

> Most modern CPUs are so fast that for most program workloads, the bottleneck is the locality of reference of memory accesses and the efficiency of the caching and memory transfer between different levels of the hierarchy [citation needed]. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete. This is sometimes called the space cost, as a larger memory object is more likely to overflow a small/fast level and require use of a larger/slower level. The resulting load on memory use is known as pressure (respectively register pressure, cache pressure, and (main) memory pressure). Terms for data being missing from a higher level and needing to be fetched from a lower level are, respectively: register spilling (due to register pressure: register to cache), cache miss (cache to main memory), and (hard) page fault (main memory to disk).

Is it that PCIe is necessarily implied by the debuggable pipeline specified by the von Neumann architecture? https://en.wikipedia.org/wiki/Von_Neumann_architecture

Otherwise, computation within RAM avoids interconnect saturation.

"Neuromorphic" computing, stateful RAM with operators mapped to particle interactions:

Memristor > Derivative devices > memtransistor https://en.wikipedia.org/wiki/Memristor#Derivative_devices

Quantum reservoir computing: https://en.wikipedia.org/wiki/Reservoir_computing#Quantum_re...

But these still need a faster and wider (and qubit) bus than PCIe, too: https://en.wikipedia.org/wiki/PCI_Express


https://en.wikipedia.org/wiki/Programming_the_Universe :

> Lloyd also postulates that the Universe can be fully simulated using a quantum computer; however, in the absence of a theory of quantum gravity, such a simulation is not yet possible. "Particles not only collide, they compute."

Quantum on Silicon looks cheaper in today dollars.

Devide the universe in QFT field-equal halves A and B, take energy from A to make B look like A, then add qubit error correction, and tell me if there's enough energy to simulate the actual universe on a universe QC with no instruction pipeline.


Anything that’s not an actual computational kernel (python interpreter, Pytorch dispatcher, etc.)

See the flame graph here in the overhead section: https://horace.io/brrr_intro.html


Right, all this overhead is very pytorch specific and doesn't have to be present in all cases. If you resolve everything at compile-time for a limited class of cases (which is what vectorflow does), it's possible to have a solution where none of this overhead exists at runtime (no dispatch, no indirections, no memory copies nor allocations...), neither during training nor serving.

I think pytorch has a genius design and makes excellent tradeoffs. But it's not the perfect solution to every problem. Genericity vs specialization: hard to win on all fronts.


Yeah, my main point is that I don't think the issue is memory movement. PyTorch/Tensorflow do care a lot about memory movement, as memory movement doesn't stop being an issue with larger networks.


Loading the model, preprocessing, post processing, i.e. things not even done in the NN.


If you're loading the model for every prediction, you're doing something really wrong. Pre-processing and post-processing are indeed usually trumping the core computation itself, but as you mentioned it's not done in the NN.


Could be something that is just one once per process or the like. But yeah, if you are doing multiple inferences it would be silly to load the model each time.


anything that prevents my code from running at one or more instruction per cycle.


"small" appears to be key here:

> When we get to larger matrix-matrix operations, such as 100x100 * 100x100, we can effectively write off any overheads due to memory allocations. But we definitely see that there is a potential for some fairly significant performance gains in the lower end! Notice too that these gains are realized by using the pure-Julia LoopVectorization.jl as the standard BLAS tools tend to have extra threading overhead in this region (again, not optimizing as much in this region).

> But, if you have been riding the GPU gospel without looking into the details then this plot may be a shocker! However, GPUs are designed as dumb slow chips with many cores, and thus they are only effective on very parallel operations, such as large matrix-matrix multiplications. It is from this point that assumption (2) is derived for large newtork operations. But again, in the case of small networks such GPU kernels will be outperformed by well-designed CPU kernels due to the lack of parallel opportunities.


"Small" is difficult, and it is actually in many cases, harder than "big". Small needs language support, GC support, runtime support, and is delicate - any one thing can throw off performance. You can't hide behind library calls. In Julia, since we can orchestrate everything, from the abstractions all the way down to the instructions, making small problems work well has been possible.

A single grad student can make large problems work (my thesis was large linear algebra and I could just hack away on enough C and MPI to get it done). In our early days on Julia, we realized that 90% of the world actually needs small linear algebra and it is a tantalizingly difficult problem. The work done in the Julia community over the years has made it all possible through a collaboration across lots of different teams and disciplines.


I think this is a pretty key point- the majority of users haven't been well-served by large-scale compute. I've heard from a number of folks in genomics that all they need is a faster way to invert a "big" matrix- and when they show me the matrix, it's tiny compared to what state of the art supercomputers are working on.


Ask them to download Julia and try it, and file an issue if it is not fast enough. We try to have the latest available.

See for example: https://github.com/JuliaLinearAlgebra/RecursiveFactorization...


One of the first questions I ask when discussing issues like this, is to define "big" and "small" for me. Every group has their own set of definitions, and the differences can result in interesting conversations.


That's especially funny because inverting a matrix is almost never what you want to do anyway.


Yeah, in retrospect, I'm sure what they really wanted was to diagonalize the matrix.


The performance advantage was still in the ballpark of 4-8x faster for training MNIST on the CPU, which while smaller than most networks people are training on their GPUs, still has more than 40 thousand parameters.

For someone with a statistical background, this is a lot of parameters. John von Neumann could wiggle a lot of elephant trucks.

A lot of practical/useful models fill the range from the tiny ones we may use in UDEs and SciML to this MNIST convnet.


yeah, SimpleChains definitely won't be useful for the next large language model, but there's a lot of very small models in the world from physics, to pharma, to high frequency trading. Having libraries that are optimized for these usecases is a major benefit for a fairly large niche of users.


Neural network renaissance definitely renewed the interests to use neural networks as function approximators in many areas.

For the one I am interested in recently (dynamics control), many networks are small feed-forward ones or two-layer LSTMs. The importance there is about balance robust controller v.s. update frequency. If you can run the same controller at higher frequency, it is basically free performance improvement (the robot will be more stable, more responsive etc.).


And video games, too. Small models may be useful for procedural content and dynamic environment and actor responses. The barrier, so far, had been the performance overhead of using them.


The question is if these users don't already have all of their code in Python/C++/etc. and would even consider using Julia.


Yeah, also for small networks that perform well in Julia, there's XNNPACK with AVX2, so I wonder how those 2 compare. But as soon as your network is large enough that execution time exceeds PCIe transfer time, GPUs win in any case.


Another factor is that you don't have to ship the data to the other side of the PCI bus.


For a scientific / academic focused language, I can’t help but feel like blog posts about Julia always seem to evangelize performance in the language in a biased or subjective manner. Anyone else get the same sense?


SimpleChains.jl/LoopVectorization.jl author here, and a co-author of the blog post.

I love working on performance. It's fun and challenging. Trying to best high scores (times/benchmarks) is fun and gamifies it.

Personally, I answer tons of questions online (e.g. the Julia discourse and Slack) related to things like SIMD because I find it fascinating and want to share this excitement with others. Same thing when it comes to a blog post or announcing a package; I focus on what motivates me most.


Chris, your work is phenomenal and I super appreciate everything you do.

I think my comment was mainly that if I were viewing it objectively (for example, knowing nothing about Julia), I think a lot of Julia's evangelizing comes off too strong. This is definitely not directed at you btw, but just that the community as a whole is just overly positive and ONLY positive.

For example, if you search for Rust on hacker news or even browse /r/rust, you'll need 80-90% pro rust, but you will also definitely see another 10-20% of posts being very critical about Rust. And these critical posts are even from prominent members of the Rust community. Additionally, there's a roadmap for Rust posted very often.

In Julia however, I can count the number of "negative" Julia posts I've seen on one hand. Don't get me wrong, I really like Julia, but there are A LOT of things that need to be improved that I know about and probably more that I don't even realize. It would be nice to have experts post and comment on that, and the core Julia members describe how they are going to tackle that. IMO, that discussion does happen, but it is largely on Slack and forever lost to Slack's history.


As a Julia evangelist, your point is well received.

I've definitely been making an effort to temper any discussion of benefits with associated drawbacks.

Though, I just want to express WHY I'm so enthusiastic. I've used both R and Python for DS. I'm familiar a bit with matlab.

I've played with Keras and Pytorch.

Aside from all the technical and productivity benefits Julia just feels GOOD to write in, compared to all of the above. It's a real quality of life improvement + a feeling of freedom in that I can express my ideas, compose them with others all without worrying about a different compiler or framework or array type system or having to code units in c++.

This is freeing and honestly, it would be sad if it's relegated to niche status forever. There's definitely a selfish component here, but I'd like everyone to feel the same benefits.

It's hard to go back to pandas or tf or pytorch after using chain.jl, dataframes.jl and flux.

Yes, there are drawbacks with compile times and things like that, which can be annoying. But I keep working in julia because the benefits are worth it, those aren't intrinsic to the language, and most importantly, there's a roadmap to fixing them: https://discourse.julialang.org/t/precompile-why/78770/8 with progress that is tantalizingly close to the end goal of easy dev and easy deployment.


> It's a real quality of life improvement + a feeling of freedom in that I can express my ideas, compose them with others all without worrying about a different compiler or framework or array type system or having to code units in c++.

As someone who spends 98% of his time in Python and the remaining 2% in R I'd love to switch to a more nicely designed, ergonomic, high-performance language. Adopting Julia at work is a nonstarter for now, but what really holds me back from learning it is how thoroughly academic all the evangelism seems to be. Most working data scientists aren't doing things like physics simulations and won't be terribly interested in solving differential equations; even in grad school the topic rarely came up for my econometrics research.

To put it bluntly: reading about Julia kind of makes me feel a bit dumb, not excited.

When I do work with neural networks, the networks just aren't going to be small because the value in them -- for the kind of work I do -- is in their capacity to use large amounts of data to find useful representations. So the benefit of Julia over PyTorch described in the link is only interesting to me in a "huh, I guess that's a bit cool" kind of way.


I guess it's mostly academics who have unusual problems that demand custom solutions, but which are small enough so people can switch to a language their lab group is not using and still be successful with their projects. I also wish for more common adoption but it is hard to make headway against python and R that have years of development for "basic packages" for typical users in them, which is just not the kind of stuff most academics can spend their time on.


I appreciate your honestly and clarity about your intentions.

Maybe this my biggest gripe when someone claims Julia is objectively better than Python or R. The answer to this question is highly subjective, heavily depends on whether you use VSCode or not, and people that usually answer this question are heavily invested in getting other people to use Julia.

But the answer is usually presented as an objective answer, and I guess that just kind of rubs me the wrong way.


It's definitely not objectively better. For normal workflows, the startup latency is a real hindrance, even if the community has collectively built norms around that which help a bit (keeping a long running repl session open, etc.). But in terms of expressiveness plus attainable performance it really is hard to beat. I think it's nice that the path from quick draft to really performant code is continuous, and not a big gap like switching languages.


Come hang out on our #gripes slack channel for a while; it's one of my favorite channels. But core folks also regularly discuss negative points on the discourse board, too. In my recent memory (like, past month or so?) that has included private APIs, default environments, and (of course) precompilation latency.


Thank you for saying this. As an outsider I (I only tried Julia a few times) this is exactly the impression I get every time I read a post (and comments) about Julia on hn. Don't get me wrong, I believe there are lots of interesting things happening in Julia, but it is not a magic wand. Often it is characterised as python with the speed of c, based on some highly optimised benchmarks, while the reality is much more complex. I find this uncritical evangelism quite off-putting tbh.


If all you read is the highlights of the accomplishments, then of course you're getting only the good stuff. Feel free to read all of the details: it's posted all over Github/Reddit/Slack/Discourse/etc. The same individuals share all of the nitty gritty performance issues of every aspect we've ever found.


Point taken.

I, speaking for myself, will keep this in mind next time I set myself to gushing about Julia.


Appreciated. I realise that part of this is just being human, if one likes something one wants to promote it. There is a real danger with overselling though. I got quite excited about Julia initially, but when I looked into it some more I found that it couldn't live up to the expectations that were set, and so I have become very sceptical towards all the news.


It's largely a matter of where the community gathers, that ends up giving this impression.

You mention r/rust downthread - r/Julia is quite inactive in comparison. However, the places where Julia discussion actually happens - Discourse, Zulip, etc. - do contain a lot of criticisms, wishlists from other languages, roadmaps for improvements, etc. But these are not places you would randomly come across if you're not involved with Julia, like a subreddit would be.

And the scientific/academic focus of the language that you mention is actually another reason this happens. When the language's users are (primarily) developers, they (we) nitpick and think about alternate designs and blog about them for no reason at all. The average Julia user instead would just ask in the discourse/other forums, maybe complain, and then move on with their research/engineering problem. So the blog articles that end up getting posted here are most often by the core language/package developers announcing new features, breakthroughs, and other positive news.


Do you have a specific example? Did they miss some pytorch optimization? It takes work to write fast idiomatic code. It's not fair to expect someone to be able to squeeze every ounce of performance across n languages and n frameworks. So it's natural that they're not going to know every trick.

Part of what Julia offers is that you need less tweaking, and the whole stack is in julia so it's quite hackable.


Yes, as a long-term Lisper in research computing who knows a bit about linear algebra implementation and compilation. I don't see the comparisons I think are relevant to claims of how much more performant it's supposed to be. It's worth pointing out that you don't get the highest performance (close to peak) for things like normal GEMM on something like SKX a priori, at least from what I've seen. It needs experimentation, like fiddling with block sizes and prefetching.

An example I've asked about previously: When comparing a recursive implementation of an LAPACK operation with OpenBLAS, is that using the RELAPACK(?) implementation in OB, and if not, how does using a similar algorithm in C or Fortran compare? Generally, I want details of measurements.


In fairness, for the last major scientific/academic ML-based project I was working on, we had to make a lot of compromises due to speed and performance.


My experience is that you use as large a model as your target platform and application allows. Bigger is better.


+1. And aggressively dismiss anyone who doesn’t love it.

I’d quite like to like Julia, but that’s one of my sticking points.


They’re taking what they can. The language isn’t getting the adoption that it would have if it was as good as the small community around it claims that it is. It just comes across as a cult.


I really like Julia. I wouldn't say I come off like a cultist, but I do think they spend an annoying amount of time on PR for themselves.

I think Julia is stuck in a spot where they are better than say R/Python for scientific computing on a fundamental language structure level and package manager level, but that it's not good enough to make up for the fact that those languages have a more robust ecosystem.


I think many people underestimate the inertia present in a large community, particularly one like scientific computing. I think Julia is much-much better than R/Python at basically everything important (as a language and platform). I certainly wouldn't call the python ecosystem robust. It's decades upon decades of ugly hacks and hacks upon hacks. The pandas internals are a Kafkaesque nightmare. But R/Python is where millions of people work daily and produce a huge amount of mindshare and libraries. You'd need an incredible amount of resources to replicate that in any other space.

I'd argue though that this mindshare is not there due to Python/R as a language (and platform) being better, but simply because there were no better alternatives 10-15 years ago and by now the sheer inertia makes it impossible to stop.

You'd need to convince a sufficient portion of people to move to Julia at more or less the same time. Few people want to be first movers. These tend to be the ones who actually care about the qualities of the platform, not just "get the work done and clock out".

If you are a data scientist, you won't be paid for moving to Julia. You'll be paid for coming up with working models. And you take a serious risk by moving to a new platform with little adoption. The platform might die, taking your tools and processes with it. You won't be able to rely on your colleagues advice about technical issues. You can run into bugs more frequently simply because fewer eyes have looked at the ecosystem.

The end result is that few people take the plunge.


I would outright disagree with this - data science and ML is a subfield of software where you have a much higher ratio of greenfield projects over the past decade, and you also have many people who are moving to the field or are starting out in it. It also attracts the kinds of people who will be willing to use cool tech just for the sake of it. If that's not conducive to picking a new programming language with lots of promises, I don't know what is. And they still aren't picking Julia.


What do you disagree with if I may ask? Most of the people I work with really like Julia, we have dabbled with it and would like to see it succeed python/matlab/r.

Why aren't we moving, you ask? Because moving to another ecosystem does not put bread on the table. I cannot go to clients and say: this past 6 months we made no improvements to our strategies, but look, we migrated to a new programming language that is used by a fraction of a percent of our peer group.

The field attracts clever people and there are many greenfield projects. But just because we start a new project, it doesn't mean that doing it in a tiny ecosystem is sensible. Some firms can do it. Jane Street has the means to basically be "OCAML The Systematic Trading Language". This is not a luxury afforded to most.

And thus we get a chicken and egg problem, where nobody wants to be the sole first mover as there's little advantage to it. At the same time, we all see that everyone would be much better off if we moved.


I've been using R since beta and Julia is the first thing that I've seen since then in numerical computing that reminds me of that time.

Julia still has a lot of empty library space compared to R or Python, and isn't perfect, but my guess is it will catch up. R did when it was being compared to SAS, fortran, C/C++, lisp, and so forth.

I'll be honest and say that I wish something else more general-purpose (to the point of say, having a bootstrapped compiler) would be in its spot but I can't really complain right now. Maybe something else will catch up.

I think for me personally is that R and python is a bit in denial about its performance limitations when it comes to hard problems. You pretty much have to drop down into C/C++ to address them, and for certain things, julia really does do many times better time-wise, without the weeds of C/C++. I think there's some people (myself probably included) that are tired of being forced to choose between the C/C++ and R/python worlds. I think there is a bit of overhyping and/or cult-like nature of julia but I also think some of it is trying to convince people that you don't have to choose between expressiveness and performance.


Since when does popularity have a causality with quality?


Since people have been able to choose what tools to use.

It might not be a perfect relationship - there will be lagging effects and other factors, it will vary based on the specific task, and some of it will be subjective - but you can’t pretend that there is no relationship at all.


If programming is your job you seldom have the choice of the programming language.

There are lits of external factors like costs, available support, availability of compilers for certain hardware, support of the existing tool chain, learning curve, career chances etc.

Python and JavaScript aren't the best but sufficient enough


> If programming is your job you seldom have the choice of the programming language.

This isn't true. Sure, you can't unilaterally pick whatever language you want when you work with other people. But for every piece of software ever built somebody had to decide what to build it with. And even though Julia has existed for 10 years, almost nobody is picking it over the alternatives.


Julia's first light may have been 10 years ago, but the first point that it really met a substantial fraction of promise was much more recently. Until 2.0 was released around 2000, Python really was pretty niche and even its popularity was even waning (according to TIOBE) until about 2014.

Java started with a first implementation in 1991, but I didn't find it usable for serious production until about 1999-2000. That isn't quite 10 years.

10 years from inception to mature-enough-for-production seems to be kind of the rule. Building support systems and a community is hard, hard work.


I doubt that any software company switches it main programming language and throw away all of it's gathered experience. So it's not about quality.

You need to gather a certain critical mass of users to get popular. Otherwise Java , JavaScript or Python would have been replaced already.

Most of the time accessibility beats performance.

What do you think why Visual Basic was so popular and why Python is now?


Mainstream tools are gravitate towards lowest common denominator and have to accommodate ease of use. It’s no different in other fields.


I think this is a great opportunity - actually being able to show real-world (though small) nets in an educational context is great. This means students can try & tweak the net without requiring access to a big cluster or tons of beefy GPUs.


The 5x speedup does nothing for me - I want to use a stable, widely adopted, well documented tool, with convenient deployment and lots of available pretrained models for transfer learning. Julia is very far from it right now.


This is one of the benefits of a capable JIT for numerical computing. Even including compile time, the overall execution time is way lower. Chris Elrod is an AVX2/512 whisperer.


JIT-compiled code is way faster than linear algebra kernels which are completely vectorized by GCC et al? How does that work?


if you only vectorize the linear algebra, you leave performance on the table. Vectorizing fused operations reduces the number of memory passes. Also knowing the sizes (which are chosen at runtime) is necessary to make optimal decisions.


I was responding to a blanket statement alluding to AVX512 (which is rather a can of worms). I don't understand what this is on about, especially with reference to the article. Of course you account for matrix dimensions at run time, like to decide whether or not to do packing in GEMM, whether to call out to GEMM for matmul, etc.


SimpleChains does not call out to gemm. All layers are currently implemented as naive for loops. It uses LoopVectorization.jl to compile them, which does a good job leveraging AVX512 -- much better than llvm or gcc.

It doesn't pack, which is why column major A' * B is slow: https://juliasimd.github.io/LoopVectorization.jl/latest/exam... And also why SimpleChains wouldn't scale to large matrix multiplies. But, with 1 MiB L2 cache (e.g. Skylake-X), a 512x256 dense layer of Float32 is still small enough to not really need packing, so I haven't yet needed to implement it (but I will eventually, in a future version of a rewritten LoopVectorization that also actually adds dependency analysis). For an ML library, I'd implement packing via changing the data layout of the parameter vector to tile major, to just skip any runtime packing altogether (i.e., the data would be pre-packed). Only extremely large arrays benefit from a second packing level, so that I don't think it's worthwhile; smaller batch sizes would avoid the need.

The benchmark plots I linked above used dynamically sized arrays. LoopVectorization.jl performed better at small sizes than MKL, and much better than everything else. Compared to that application, LV can also specialize on compile time sizes, fuse the addition of the bias vector, and fuse the activation function.


I've been doing optimization work in C#, and I think findings like this one are unremarkable. I will readily admit C# is slower and less flexible than say C++, but for small things, I can always beat calling out to external code by writing optimized C#. This is usually because even a slower language like C# is still good enough to make the cost of marshalling objects or an IAT lookup become far more costly than the difference between the optimized C++ and C#. I don't know if thats the reason here, but I have my suspicions.


Julia can be as good as it wants, but until I can download JuliaStudio(r studio) or Juliconda(anaconda) and get to Just Work™, forget it.

I tried to use Julia, and all I got was visual studio code complaints about this and that and how the julia plugin cant detect something...ugh

The website doesn't even explain how to get started. I figured out that VScode + plugin was the way to go based of reddit.


All I had to do to "install" Julia on my Mac was download the .app and symlink its executable into /usr/local/bin (or wherever). Iirc using the vscode extension requires a handful of packages but they tell you the ones you need to install.


https://github.com/JuliaLang/juliaup

Rest your weary installers friend. Never install a new version of Julia by hand again.


Ease of installation is definitely a fair criticism until juliaup is made the default way that people are told to install and update on the website. We should change this sooner rather than later.


You don't need juliaconda with julia cause default package manager just works good enough.

But i can agree community is way 2small and compared to big languages tooling are extremely poor.


If I'm understanding right, this CPU vs. GPU bit seems apples-to-oranges for how GPU/parallel alg research would advocate optimizing this:

- if you have one tiny network, no one cares, use pure python

- if you have a bunch of tiny networks, yes, doing a ton of tiny GPU kernel calls will kill you with overhead <-- strawman because ...

- ... you can likely combine them into bulk sparse kernel calls and spread them out. If more of an iterative simulation, same thing, just over time.

This goes back to the 80's with segmented scans, cray, & connection machines. Put all the data together, figure out regular bulk operators, even if individual parts are unbalanced or otherwise irregular, and voila. Since then, a lot of basic tricks for regularizing (sparsity, speculation, ...), and has gone into overdrive w/ GPU era.

If this is as worthy the investments these specialists are doing, as it seems to be many years of R&D by smart groups, I'm surprised they're comparing to an unoptimized GPU alg. Presumably, at some point, they did a more reasonable one?

It used to be that honest CPU<>GPU implementations were generally within 2X of one another here once you considered normalizers like performance-per-watt (= per $), though new CPU HW is closer to GPU HW and new GPU HW is fancier, so one of the funnier results is, algorithmically, you have to use the same fundamental GPU-era tricks either way.


This approach sometimes works, but not always. For a simple example, if you are solving control problems/ or Reinforcement Learning, you need to keep your batch sizes small to keep latency low.


Cool that tiny models have better performance on some Julia frameworks then on pytorch but till Julia get some framework that allows for seamless multi-gpu training a lot of DL people just wont use it sadly. Its for sure easier to just optimize your small model on device exporting it from pytorch then write your own multi-gpu bridge using Nvidia software to train it :/


If your network is small, use a larger batch size to take advantage of GPUs


Yes, we discuss this in the post, and cases where this is not possible. For example, in the adjoints of universal differential equations you have to take a Jacobian-vector product. You can make that a matrix by batching, but then you are making a trade-off because you're effectively solving multiple ODEs simultaneously. When doing this, you might be able to fill the GPU kernels more, but you also have to take more time steps!

This is because adaptive time stepping is only correct if you take the minimum of all of the steps (average would make some above error tolerance, or worse unstable). If you have some ODEs stiff, or worse have stochasticity that can cause random times to shrink the time steps at different points for different solvers, making a solver of the union of equations something that is extremely wasteful. And that becomes a dilemma: we started trying to solve multiple ODEs at the same time to fill a matrix-matrix multiplication kernel, but by doing so we now have to take 100x more matrix-matrix multiplications! So that ended up nowhere, and we focused on matrix-vector performance via direct SIMD and effective use of mutation (as the blog post describes) and got very good results. We are also redoing some of the GPU ODE solvers and seeing about 100x improvement by avoiding batching in a similar way, but that's a post for another time.

All in all, some applications can just slam a big net. Others can't. The two cases need different optimizations.


Taking a union of differential equations is a pretty bad idea for exactly the reasons you describe. But it's absolutely possible to parallelise multiple diffeq solves without needing to use the same time steps for each solve. That is, the time steps can be vectorised, rather than broadcast, over the batch axis.

So the only thing you actually need are the same number of steps for each solve, which can be easily accomplished by just padding out the solves that finish with slightly fewer steps. In practice this ends up introducing negligible overhead whilst solving the above issue very neatly. For example this is precisely what Diffrax (https://github.com/patrick-kidger/diffrax) does under `jax.vmap`.

I've not dug into what Julia does here; is this not already done when broadcasting `DifferentialEquations.solve`?


Yeah, that's why I don't like vmap so much. It just never gets you an optimal algorithm. On CPUs, you don't need to do the same number of steps each solve: you might as well fully decouple all of the solves and max out the threads. Broadcasting DiffEq.solve will do this fully decoupled version, but it's better to parallelize it either with multithreading and distributed (or polyester). That's of course the better thing to do on CPU, and then you don't need to worry about filling kernels or whatnot because you can just oversubscribe if you have to.

On GPU vmap isn't as bad at face value, I thought it would be fairly close to optimal, but when we dug in we found it wasn't the right approach for the ODEs we were looking at either. Julia has KernelAbstractions.jl which can do similar code transformations as vmap, but when digging into the CUDA profiler we found it takes a pretty substantial number of trajectories to fill the kernels. There were a few more optimizations we could do to it, but it was reaching its limit rather quickly. Meanwhile, it turns out that if you can just generate .ptx kernels for the ODE, minimize the usage of global memory by keeping all ODE parts locally in registers, and then only couple within warps, you can beat that by about 100x. So some of our other GPU stuff has gone stale for a bit while we've been working to get this approach completed. It was somewhat of a disappointing conclusion because it means we have to really specialize some codes for GPU, but anyways, full details will come probably around the end of summer.


How does this work with stiffness adaptive solvers? The computation you need to do for when an equation is stiff looks pretty different from when it's not stiff, and I'm having a hard time seeing how that would work if you vectorize the timesteps. Do you need to synchronize after every timestep to split into separate batches for stiff and non-stiff?


You're discussing switching algorithms like LSODA?

This is a really good question that I don't have a neat answer to. You can actually also hit similar issues when naively vectorising some stiff algorithms that detect when to recalculate Jacobians: at each timestep, some batch elements will want the recalculation whilst some won't.

In both cases, something like what you suggest might be a possible solution. And thankfully support for such custom batching will Soon (tm) be coming to JAX.

In the mean time, Diffrax actually side-steps the problem by not implementing those kinds of solvers. New solvers usually get added on a by-request basis, so this just hasn't been an issue yet.

So:

1. Do you know what DifferentialEquations.jl does in this scenario when running on the GPU? (On the CPU needing to sychronise batch elements usually isn't such a concern.)

2. If the price of using JAX is giving up a bit of performance in this one use case, then I think I'm okay with that. The clear contenders in this space are JAX and Julia, and right now it's a concious choice to be using JAX as a better fit for the problems I'm tackling. This comparison is something I've written about before: https://discourse.julialang.org/t/state-of-machine-learning-....


One of the best general purpose solvers (which DifferentialEquations.jl often uses by default) is AutoTsit5. This is a solver that detect stiffness per timestep and uses either Tsit5 (a 5th order RK solver) or Rosenbrock23 depending on the stiffness. These solvers are great because they typically give really good results for a wide variety of problems.

The main way DifferentialEquations.jl deals with this currently is being faster on the CPU than other solvers are on GPU. For small models, GPU has too much overhead, and for large models, you can do linear algebra within timesteps on the GPU without compromising solver efficiency.


The article addresses that:

> This problem is far too small to saturate the GPU, even with such a large batch size. Time is dominated by moving batches from the CPU to the GPU. Unfortunately, as the batch sizes get larger, we need more epochs to reach the same accuracy, so we can hit a limit in terms of maximizing accuracy/time.

The batch size in their example is 2048.


You could also preload your datasets to GPU, if its RAM is large enough.


The Flux.jl example did this. A PR to the PyTorch example to do this would be welcome: https://github.com/chriselrod/LeNetTorch


For reinforcement learning using mini batch updates, batch size beyond 32/64 is shown to decrease performance in various problems.


The article asks "Which Micro-optimizations matter for BLAS3?", implying small dimensions, but doesn't actually tell me. The problem is well-studied, depending on what you consider "small". The most important thing is to avoid the packing step below an appropriate threshold. Implementations include libxsmm, blasfeo, and the "sup" version in blis (with papers on libxsmm and blasfeo). Eigen might also be relevant.

https://libxsmm.readthedocs.io/

https://blasfeo.syscop.de/

https://github.com/flame/blis


Blis is very bad at small sizes, last I checked. Their approach was based on having a single microkernel, and setting up calls to it.

Last I checked, blasfeo did not support AVX512, and this performed poorly on CPUs supporting it.

I'm not really familiar with xsmm, but someone showed me this: https://haampie.github.io/smm-bench/cascadelake/ https://haampie.github.io/smm-bench/skylake-avx512/ LoopVectorization.jl performs best with AVX512, but not badly without it: https://haampie.github.io/smm-bench/znver2/

Versus Eigen, LoopVectorization.jl does better at small sizes (less than a couple hundred, up until packing matters) on my computer.


One thing to note is that LoopVectorization is much more general than traditional BLAS interfaces since it can automatically generate complicated efficient kernels eg tanh(A*v+y). The traditional model of calling out to hand optimized BLAS is severely lacking because that means that your code has to make use of existing hand designed kernels which ends up leading to extra passes over memory.


What does that have to do with what I wrote? At least compare maths function apples with apples.

Most of the point of optimized L3 BLAS is optimizing for memory so it's compute-bound asymptotically but, for instance, header-only libxsmm isn't normal BLAS.

(I'm used to just using -Ofast to get vectorization, not having to include a special library.)


If you want the -Ofast equivalent in Julia, there is @fastmath.

LoopVectorization.jl is substantially more powerful. It also has tons of limitations, but I am rewriting it to fix most of them.

Disclosure: I'm the primary author of both SimpleChains.jl and LoopVectorization.jl.


Could you elaborate on "just using -Ofast to get vectorization"? I can not really believe this works for anything but the simplest of kernels.


-Ofast can replace some functions with more vectorizable versions (in Julia terms, tanh to tanh_fast) which can then let the compiler optimization passes add SIMD where they see fit. The issue that we've seen though is that the compiler optimization passes really do not perform as well as you'd hope.


Does anyone care about small neural networks? What can a small NN do that XGBoost can't?


Better accuracy vs time (especially with optimized frameworks).


It's interesting, because the common wisdom that I've encountered is that neural networks are better when you can feed them large amounts of data. But some of the use cases here (e.g. approximating the solution to a set of partial differential equations) are so far outside the kind of work I do that I have a hard time conceptualizing them or how they work.


Hell, the CPU is probably faster than GPU (... wait for it: for small tensor sizes).


Python will be done soon.


one might hope


I thought that - like numpy - pytorch was a thin wrapper over a C++ library. So why the slowdown?


FFI and allocation overhead, plus not being able to perform global optimizations. Each numpy call is pretty fast, and if that's all you're doing then it's probably pretty close to optimal (ignoring for the moment that their routines explicitly optimize large inputs at the expense of small -- for small vectors you can beat numpy even in pure python if you don't explicitly rewrite the problem as a function of larger vectors with appropriate shapes), but a composite call graph potentially (and often) might have additional optimizations that can be applied.

For an illustrative example, imagine computing the sum of all elements of some `x-1` where x is some numpy vector. Each individual operation is dispatched to numpy without any global context, so it must explicitly allocate a buffer for the x-1 result before computing the sum. The astute observer might note that you can instead use np.subtract(x, 1, out=x) or np.subtract(x, 1, out=buffer) to reduce the impact of allocation, but the problem is _still_ dominated by memory bandwidth on modern hardware (at one point in the past memory access was about as slow as a computation, and now it's 100x slower at least), and having to iterate over roughly the same elements twice rather than doing a bunch of CPU/GPU operations at each element roughly doubles the total computation time if everything doesn't fit an a very fast cache. Even when using numpy to its fullest (which most people don't), the model of stringing together a bunch of local optimizations can't always and doesn't usually compete with the model of at least attempting to globally optimize the whole computation.


In general, numpy has a minimum execution time of around 1 microsecond per operation. If your operations are on small arrays, Julia will be much faster since small operations can be very fast (eg adding 2 100 element arrays takes about 10 ns).


...a C++ library with a CUDA backend. But these high-performance building blocks might only be saturating the GPU fully if the data is large enough.

I haven't looked at implementing these things, but I imagine uf you have smaller networks and thus less data, the large building blocks may not be optimal. You may for example want to fuse some operations to reduce memory latency from repeated memory access.

In PyTorch world, there are approaches for small networks as well, there is https://github.com/NVlabs/tiny-cuda-nn - as far as I understand from the first link in the README, it makes clever use of the CUDA shared memory, which can hold all the weights of a tiny network (but not larger ones).


Not related to the authors and don't have the same machine, but on a V100, tiny-cuda-nn performance for the blog post matrix power example:

    Warning: FullyFusedMLP is not supported for the selected    architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
    Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
    Initial Train Loss: 5.7188
    Initial Test Loss: 5.2812
    Took: 11.41 seconds
    Train Loss: 0.0354
    Test Loss: 0.0514
    Took: 11.58 seconds
    Train Loss: 0.0327
    Test Loss: 0.0511
    Took: 11.42 seconds
    Train Loss: 0.0316
    Test Loss: 0.0505


I think almost of the time here is python overhead because if we increase the batch size 10x, it still takes the same time:

    Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
    Warning: FullyFusedMLP is not supported for the selected architecture 70. Falling back to CutlassMLP. For maximum performance, raise the target GPU architecture to 75+.
    Initial Train Loss: 5.5391
    Initial Test Loss: 5.5938
    Took: 11.03 seconds
    Train Loss: 0.0444
    Test Loss: 0.0545
    Took: 11.16 seconds
    Train Loss: 0.0388
    Test Loss: 0.0496
    Took: 11.01 seconds
    Train Loss: 0.0384
    Test Loss: 0.0490
See [gist](https://gist.github.com/rejuvyesh/6c428ea12154edbb36cd4359fa...) for the implementation.


I guess JAX would be a more fair comparison? It has a JIT compiler


I'd be really interested in a Jax implementation. Any chance you can PR one?


here's an impl: https://jott.live/code/train_matrix_exp_estim_jax.py

it works but surprisingly slow on my devices


    Initial Train Loss: 5.9414
    Initial Test Loss: 5.9414
    Took: 175.90 seconds
    Train Loss: 0.0108
    Test Loss: 0.0108
    Took: 183.89 seconds
    Train Loss: 0.0016
    Test Loss: 0.0016
    Took: 188.95 seconds
    Train Loss: 0.0010
    Test Loss: 0.0010
SimpleChains is about 0.5 seconds on this computer, about 5 seconds to compile. So you're right, it is slow, too slow, like 300x faster than Jax, and I suspect something is wrong with the Jax script. It should probably be closer to like 2.5x or so because of memory handling and vectorization stuff, but not any slower than PyTorch.


I've just run this myself.

Tidying up the JAX script, I find that it runs in ~15 seconds on my laptop CPU. Likewise, fixing the crash error in the blog post (UndefVarError: alloc_threaded_grad not defined) then I find that the Julia implementation also runs in ~15 seconds on my laptop CPU.

So, 1x faster than JAX?

Clearly we're getting very different results here, so (a) I'd like to get to the bottom of this, but (b) perhaps we should be more cautious about making performance claims.

(EDIT: updated script in this GitHub gist: https://gist.github.com/patrick-kidger/68bf7b99ba02c246b20ea...)


"UndefVarError: alloc_threaded_grad not defined" means that you weren't running the released version that is discussed in the blog post but the one of the much older slower ones. Any anyways, removing threaded calls and then benchmarking is clearly going to make it slower. With the updated ones, we seem to see 2x, 10x, and 22x, with increasing performance improvements based on the availability of threads and the existence of AVX512. For anyone who wants the details, just check that same gist:

https://gist.github.com/patrick-kidger/68bf7b99ba02c246b20ea...


Nope, this was on v0.2 of SimpleChains.jl; I did check.

Anyway, something to investigate if I ever get the time.


Do you mean v0.2.0 or v0.2.2? From a cursory look at the repo, it looks like alloc_threaded_grad was added in 0.2.1, so the UndefVarError makes sense if you were using the older 0.2.0 version.


We probably should have released the blog post version as 0.3 instead of 0.2.1.

There were not any breaking changes, but enough fixes and additions (e.g., the convolutional layer and threading) that 0.2.0 isn't really comparable.

Odds are there is a version conflict in 0.2.1 (which also bumped dependency requirements) that does not exist in 0.2.0, and that this is preventing you from upgrading. You could start a smaller environment, or try `] add SimpleChains@0.2.1` to force it/get an error message telling you what the conflict is.


Just to confirm, v0.2.1 is the fast version.


Hmm... looks like he never came back to properly benchmark it with the right version, never did a version with the multithreaded call, and has a 1x number that is suspiciously different from everyone else who ran the benchmark in the Gist. Peculiar.


Thanks for the updated script! I added a comment to my code pointing to yours. I’m curious what the key slowdown in my attempt was.


Not at all -- I was very happy to see some Equinox code out in the wild!

I think the main slowdown was doing the model updates out of the JIT region. JAX-without-JIT is ridiculously slow.


Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: