"Scaling up self-attention inference."

This wepage outlines the mathematics behind the "attention" mechanism used in large language models, then describes a new mathematical technique that allows the context window of a large language model to be split into pieces that can be computed independently and then combined. The end result is the same as computing the "attention" results from the entire context window.

This should enable large language models (LLMs) to continue to have larger and larger context windows, because now the computation requirement scales logarithmically with the size of the context window instead of linearly. So each time you increase your CPUs and GPUs by some linear increment, you'll double the size of the context window you can do.

Scaling up self-attention inference

#solidstatelife #ai #genai #llms #transformers