Wow, I've been out of the GPU game for a year or two now and it's clear how the ...

Eridrus · on Sept 13, 2016

I've read a lot about fp16 being good enough for training, but the thing that people don't mention is that just swapping fp32 for fp16 will make things fail because your deep learning framework doesn't implement everything you use for fp16, and after you fix that, it will probably diverge because standard practices aren't used to dealing with such a limited range.

Which isn't to say that the Deep Learning stacks won't get there eventually, but at the moment it's not as easy as flipping a switch.

aab0 · on Sept 13, 2016

For a 4x boost, I imagine Tensorflow and Torch will get support shortly after the GPUs start shipping in real quantities.

Eridrus · on Sept 13, 2016

It's a 2x boost for training since you can't use int8 for training and need fp16.

INT8 is a 4x for inference, but most people aren't using GPUs for inference atm.

scottlegrand · on Sept 13, 2016

Some of us are...

https://blogs.aws.amazon.com/bigdata/post/TxGEL8IJ0CAXTK/Gen...

Eridrus · on Sept 14, 2016

Fair enough; when I said most people I meant companies who are not AmaGooFaceSoft and their international equivalents. Though I can't quite tell if you're doing GPU batch predictions and storing them or doing them in realtime with Spark Streaming.

Unrelated question though: any chance you will do blog post/paper about how DSSTNE does automatic model parallelism and gets good sparse performance compared to cuSparse/etc?

scottlegrand · on Sept 13, 2016

Or as Urs Hoezel would say: "advancing Moore's Law by 7 years(tm)..." Badum ba bum bum...

And given Frank Seide et al. demonstrated 1-bit SGD in 2014 (https://www.microsoft.com/en-us/research/publication/1-bit-s...) the race to the bottom is just beginning...