it's like we're all running transformer models - sustained attention is difficult focusing on incoming input using the self-attention input is easy for transformers, keeping attention on key pieces of info over long periods of time is difficult - mostly because of the limited context window / and also through multiple layers of convolution

makes me think of how LSTM used to handle sustained attention, it kept the cell state, which acted as a sort of memory and evolved with the each time step where as RAG feels more like how humans do it, where we remember some specific thing that happened, and query our knowledge base (specifically about it)