open-source-ai

Re-reading attention is all you need

How did they throw CNNs and RNNs out the window (kinda) and get to this conclusion that this self attention mechanism would work I.e. let’s just get every token to directly measure itself against every other token - I don’t get it

It’s sort of going from - thinking about nature as being a subjective sequential experience (and seeing that as a bottle neck) and instead thinking about how every thing is connected to everything and what are those weights 

Mind blowing to be honest that this works

creating and destroying; built flying cars & autonomous vehicles, now building roc.camera, writings: july.rocks

My understanding is that it was the need for a kind of selective short-term memory while evaluating the next token. We can treat all prior tokens equally, or we can prefer more recent tokens, or we can try to find particularly relevant clusters to emphasize in the derivation function.

Sometimes I think of it in simile to "fairness"/MEV, we tried FIFO but turns out routing is gamed. We tried randomness/encryption, but side effects can be gamed. There's a nice quote from Nick Szabo's 1997-era piece The God Protocols:

> "Fairness means everybody learning the results in such a way that nobody can gain an advantage by learning first."

In a weird way, I see it as a similar challenge to token attention (especially in adversarial contexts where people are trying to jailbreak).