Hacker Newsnew | past | comments | ask | show | jobs | submit | WithinReason's commentslogin

Which tools? Even file reads and writes?

Especially these things.

The only tools permissible to root in my scheme are call() and return().


Is it in pi.dev? Don't thinking tokens still take up context?

folding@home reached 2.43 exaflops by April 12, 2020, which would make it the largest supercomputer on the planet.

it's down 99% since that peak. But let's compare to it anyway.

It's pretty useless to compare raw FLOPS, but as a general hand-waving guesstimate, F@H is currently doing about 25 petaflops in a mix of FP16 and 32. AI usually trains at FP8, but to keep things fair the H100 is quoted at 60 FP64 teraflops per unit, so that's 12 FP64 exaflops given its 200k count.

So F@H at its peak did 2.43 exaflops@FP16/32. Colossus 1 does 12@FP64. These numbers are very hand-wavy, but I think the point is made.

By the way, I'm not trying to crap on F@H - I think it's an outstanding project and I've run it in the past. But a volunteer group simply cannot compete with well-funded, concentrated effort like what's going into AI.


Efficiency difference between training on GPUs and TPUs is 2x at best. You can get very efficient with tensorcores, converging to TPU efficiency. In the end math is math, you can't make a multiplication more efficient than it already is on GPU.

I guess this was more related to syncing GPUs.

If you were to take 500 computers with older 1080 GPUs, you might have enough compute/ram equivalent to an H200 GPU for training such a model. Maybe take 10000.

But if those machines are spread over 10000 homes, wired with residential internet service, training a large model will not get anywhere.

You go from "data in the same HBM memory chip" at 4.8TB/s or "data in adjacent GPU" with NVlink at 1.2 TB/s down to 25 MBit/s upload speed. Accessing the next piece of data is going to be about a Million times slower. At the same time you will heat a thousand times more, for a Million times longer.


You need to train independently and merge rarely. The problem is the merge step. Weights are too entangled, you are not going to get an improvement commensurate to the effort. Otherwise, everyone would do it. It is an open research problem.

That sounds like the way. Everyone trains their own small problems to maximally compressed weights and then merges.

The power-constrained part of compute is data movement, not the elementary arithmetic per se. Anyway, it's very possible to tweak the underlying design to increase throughput a lot for any given power budget at the cost of high latency. This seems especially useful for training workloads where we don't really care about latency as much.

Math is math, but sadly math isn't physics nor engineering.

math has physics.

The gradient info can be compressed 10000x with the right tricks, I think it is achievable. Nous claims they did it already:

https://github.com/NousResearch/DisTrO

There are other gradient compression papers from the past reporting large compression rates


This likely says something about the harness Fable was trained in. It knows how to do this because it has done this millions of times during reinforcement learning.


It's a meme, and HN loves upvoting memes. Just like Reddit!

The clone is you though, assuming it's a perfect copy

There is a similar analysis from the Netherlands

Not every one can afford millions to publish a paper

That's why you do several small and medium scale tests, fit a curve, and ideally show that the trend persists at several scales. Not a single large or medium run - see the other comments down thread for example sizes.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: