Yhprum
@yhprumslaw
no clue if I’m right but guessing from reading this that it’s some type of tensor splitting where a model is shared across nodes. no one holds the full weight set—each gets a tensor chunk, like a weight matrix slice, cutting comms to 1-5MB/step over slow nets.
1 reply
0 recast
0 reaction
Yhprum
@yhprumslaw
tensor splitting divides big arrays (e.g., a 1B-param layer) into sub-tensors. node a might get rows 1-100, node b 101-200, etc., slashing data sent between them to fit low bandwidth.
1 reply
0 recast
0 reaction
Yhprum
@yhprumslaw
maybe then prediction kicks in: each node guesses its neighbors’ tensors with a small model (e.g., LSTM), trained on past data. Syncs happen only when errors spike, saving bandwidth further.
1 reply
0 recast
0 reaction
Yhprum
@yhprumslaw
over a 10Mbps net, full tensor syncs (100MB) crawl, but sharding + prediction drops it to 1MB bursts. training stays fast as nodes compute locally, syncing less often… such a fascinating idea between distributed systems and the idea that open source will be won by AI
1 reply
0 recast
0 reaction
Yhprum
@yhprumslaw
still thinking about this more… to further minimize communication, nodes could predict incoming tensors (e.g., activations or gradients from other shards) using lightweight models, like an lstm or small transformer, trained on historical patterns. long doesn’t explicitly mention prediction in the blog, but my guess is this aligns with solving the “low-bandwidth bottleneck” he highlights. syncs would occur only when predictions diverge significantly, cutting comms frequency. would be interested to get his input if you know his handle @fredwilson.eth
1 reply
0 recast
0 reaction