nGPT: Normalized Transformer with Representation Learning on the Hypersphere

paper: arxiv.org/abs/2410.01131

The normalized Transformer (nGPT) is a neural network that uses hypersphere representation learning, where all vectors, including embeddings and attention matrices, are unit norm normalized. Tokens move across the hypersphere, with each layer contributing to the target predictions. This approach accelerates training, reducing the required steps by 4 to 20 times, depending on sequence length, without needing normalization layers, weight decay, or learning rate warmup.

Innovative approach for training neural networks, speeding up processing time significantly