Yet another parallel framework. Without the 3rd party eco-system of APIs for matrix math, any framework is doomed to just add noise, not value. Sure there's some benefit in getting marginal speedup on some algorithms but for real speedup, you need to know the parallel architecture of the processor (GPU, CPU or APU) which means a learning curve. The GPGPU industry has been trying since long to abstract away the fine details and offer a plug-and-play kind of easy-to-learn framework but then we suffer performance losses and it really doesn't make sense to invest in GPUs for the kind of performance gains you get with these high-level APIs.
Read the release, this is a collaboration with Google and Mozilla. But you are right, one of the main reasons CUDA is so popular is because of cuBLAS. And it is a big pipe dream that you could program a GPU without being aware of communication and memory transfer behavior.
Won't we have HSA in the future? HSA is supposed to provide unified coherent memory access to both CPU and GPU. Do you think HSA is a pipe dream? If so why?
OK, it's not that HSA isn't useful, it's that coordination between the CPU and GPU is still stupidly hard and has a lot of CPU-side overhead, making it impractical for small workloads. The problem is that a large number of small workloads still can't be done by a GPU. I'm seriously doubting the limits of "coherent memory access" -- unless the GPU can snoop into the CPU's cache (or the GPU/CPU share L1 cache -- eeek), then you will still need cache flushes and fences. Let's hope that "HSA" is a lot lower overhead than current CPU/GPU combos from AMD/Intel.
It's not parallel, a framework, or a GPU feature. It's single-instruction-multiple-data (SIMD) which is used to speed up single threaded execution on a CPU when working with lists of numbers.
He found himself writing the NEON code in assembly entirely by hand because vector intrinsics didn't even expose CPU features he wanted to use—even in C, where vector intrinsics are CPU-specific.
Having access to SIMD is definitely better than not having it, but it really should be paired with good optimized implementations of things like BLAS and FFT libraries.