KVCache over MoQT

Introduction: KVCache in LLM inference The inference process of large language models is typically divided into two distinct stages: prefill and decode. The prefill phase processes the input prompt in parallel, generating a KVCache, which serves as an essential input for the decode phase. The decode phase then utilizes the KVCache to generate output tokens sequentially, one at a time. Prefill is a computationally intensive process, whereas decoding is constrained by memory bandwidth. Due to their differing resource requirements, prefill and decode processes are often deployed on separate computing clusters using different hardware chips optimized for computational performance in prefill nodes and memory bandwidth efficiency in decode nodes, with KVCache transferred between them.

LLM inference process KVCache is significantly large, with a single token requiring 160KB for a 70B model(8bit quantization). For a prompt of 1000 tokens, the KVCache size reaches 160MB. To reduce the size of KVCache, various quantization and compression algorithm are proposed such as . Furthermore, KVCache can be reused across sessions if derived from the same prompt and model, as shown in . The most basic reuse strategy is prefix caching, where KVCache is shared among prompts with a common prefix. More advanced methods, such as , improve reuse efficiency by selectively reusing KVCache beyond prefix matching. To minimize transmission costs, a publish/subscribe architecture is required to distribute KVCache. This document defines how to send KVCache over MoQT.

Conventions and Definitions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here. This document uses the following terms:

LLM: A large language model (LLM) that utilizes the attention mechanism to process and generate text efficiently by capturing long-range dependencies within input sequences.
KVCache: A key-value cache storing intermediate representations used in LLM inference.
Prompt: A prompt consists of two parts: the system prompt and the user prompt. The system prompt is predefined by the LLM model developer to guide the model's behavior, while the user prompt is provided dynamically by the user to specify the task or request.
Token: The smallest unit of processing in LLM inference, typically representing a word or subword.

KVCache Data Model The KVCache data model is structured as follows. Naming: The Track Namespace consisting of following tuples (moq://kvcache.moq.arpa/v1/),(modelName), (prompt) is defined in this specification. The track name identifies the compression level for the KVCache. Thus, a track name can be identified with the tuple (<compression>) and the full track name having the following format (when represented as a string): moq://kvcache.moq.arpa/v1/<modelName>/<compression> Following compressions are defined in this specification, along with their size: Compression of KVCache

Compression	Description	Size per Weight
FP16	Quantized using FP16	2 bytes
BF16	Quantized using BF16	2 bytes
FP8	Quantized using FP8	1 byte
Int8	Quantized using Int8	1 byte
FP4	Quantized using FP4	0.5 byte
Int4	Quantized using Int4	0.5 byte
AC (5x)	Compressed using Arithmetic Coding (5x ratio)	Variable

Group ID: Normally the tokens are split into chunks of uniform length(typical value is 128). The KVCache are organized into groups corresponding into token chunks. The ID of the group represents the index of a token group within the KVCache. Object ID: An identifier for a specific token within a group. Object Payload: The content of the KVCache, which varies based on the compression algorithm used for storage and transmission.

Security Considerations TBD

IANA Considerations TBD