Content pfp
Content
@
0 reply
0 recast
0 reaction

July pfp
July
@july
Re-reading attention is all you need How did they throw CNNs and RNNs out the window (kinda) and get to this conclusion that this self attention mechanism would work I.e. let’s just get every token to directly measure itself against every other token - I don’t get it It’s sort of going from - thinking about nature as being a subjective sequential experience (and seeing that as a bottle neck) and instead thinking about how every thing is connected to everything and what are those weights Mind blowing to be honest that this works
2 replies
1 recast
32 reactions

shazow pfp
shazow
@shazow.eth
My understanding is that it was the need for a kind of selective short-term memory while evaluating the next token. We can treat all prior tokens equally, or we can prefer more recent tokens, or we can try to find particularly relevant clusters to emphasize in the derivation function. Sometimes I think of it in simile to "fairness"/MEV, we tried FIFO but turns out routing is gamed. We tried randomness/encryption, but side effects can be gamed. There's a nice quote from Nick Szabo's 1997-era piece The God Protocols: > "Fairness means everybody learning the results in such a way that nobody can gain an advantage by learning first." In a weird way, I see it as a similar challenge to token attention (especially in adversarial contexts where people are trying to jailbreak).
0 reply
0 recast
1 reaction