<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<!-- generated by https://github.com/cabo/kramdown-rfc version 1.7.27 (Ruby 3.3.6) -->
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft-shi-moq-kvcache-01" category="info" submissionType="IETF" tocInclude="true" sortRefs="true" symRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.28.1 -->
  <front>
    <title abbrev="KVCache">KVCache over MoQT</title>
    <seriesInfo name="Internet-Draft" value="draft-shi-moq-kvcache-01"/>
    <author initials="H." surname="Shi" fullname="Hang Shi">
      <organization>Huawei Technologies</organization>
      <address>
        <postal>
          <country>China</country>
        </postal>
        <email>shihang9@huawei.com</email>
      </address>
    </author>
    <author initials="S." surname="Yue" fullname="Shengnan Yue">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <country>China</country>
        </postal>
        <email>yueshengnan@chinamobile.com</email>
      </address>
    </author>
    <date year="2025" month="April" day="10"/>
    <area/>
    <workgroup>Media Over QUIC</workgroup>
    <keyword>AI inference, KVCache</keyword>
    <abstract>
      <?line 52?>

<t>Large language model (LLM) inference involves two stages: prefill and decode. The prefill phase processes the prompt in parallel, generating the KVCache, which is then used by the decode phase to produce tokens sequentially. KVCache can be reused if the model and prompt is the same, reducing computing cost of the prefill. However, its large size makes efficient transfer challenging. Delivering these over architectures enabled by publish/subscribe transport like MoQT, allows local nodes to cache the KVCache to be later retrieved via new subscriptions, saving the bandwidth. This document specifies the transmission of KVCache over MoQT.</t>
    </abstract>
    <note removeInRFC="true">
      <name>Discussion Venues</name>
      <t>Discussion of this document takes place on the
    Media Over QUIC  mailing list (moq@ietf.org),
    which is archived at <eref target="https://mailarchive.ietf.org/arch/browse/moq/"/>.</t>
      <t>Source for this draft and an issue tracker can be found at
    <eref target="https://github.com/VMatrix1900/draft-moq-kvcache"/>.</t>
    </note>
  </front>
  <middle>
    <?line 56?>

<section anchor="introduction-kvcache-in-llm-inference">
      <name>Introduction: KVCache in LLM inference</name>
      <t>The inference process of large language models is typically divided into two distinct stages: prefill and decode. The prefill phase processes the input prompt in parallel, generating a KVCache, which serves as an essential input for the decode phase. The decode phase then utilizes the KVCache to generate output tokens sequentially, one at a time. Prefill is a computationally intensive process, whereas decoding is constrained by memory bandwidth. Due to their differing resource requirements, prefill and decode processes are often deployed on separate computing clusters using different hardware chips optimized for computational performance in prefill nodes and memory bandwidth efficiency in decode nodes, with KVCache transferred between them.</t>
      <figure anchor="fig-kvcache">
        <name>LLM inference process</name>
        <artwork><![CDATA[
               +--------------------+
               |    Prompt Input    |
               |  (System + User)   |
               +--------------------+
            Tokenization |
                ---------------------
                |                   |
                v                   |
    +--------------------+          |
    |   Prefill Nodes    |          |
    | (Generate KVCache) |          |
    +--------------------+          |
                |                   |
                v                   |
    +--------------------+          |
    |      KVCache       |<---------+
    | (Stored & Reused)  |
    +--------------------+
                |
      -----------------------------
      |              |            |
      v              v            v
+----------------+       +----------------+
|  Decode Node 1 |  ...  |  Decode Node N |
| (Use KVCache)  |       | (Use KVCache)  |
+----------------+       +----------------+

]]></artwork>
      </figure>
      <t>KVCache is significantly large, with a single token requiring 160KB for a 70B model(8bit quantization). For a prompt of 1000 tokens, the KVCache size reaches 160MB. To reduce the size of KVCache, various quantization and compression algorithm are proposed such as <xref target="CacheGen"/>. Furthermore, KVCache can be reused across sessions if derived from the same prompt and model, as shown in <xref target="fig-kvcache"/>. The most basic reuse strategy is prefix caching, where KVCache is shared among prompts with a common prefix. More advanced methods, such as <xref target="CacheBlend"/>, improve reuse efficiency by selectively reusing KVCache beyond prefix matching. To minimize transmission costs, a publish/subscribe architecture is required to distribute KVCache. This document defines how to send KVCache over MoQT<xref target="I-D.ietf-moq-transport"/>.</t>
    </section>
    <section anchor="conventions-and-definitions">
      <name>Conventions and Definitions</name>
      <t>The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they
appear in all capitals, as shown here.</t>
      <?line -18?>

<t>This document uses the following terms:</t>
      <ul spacing="normal">
        <li>
          <t>LLM: A large language model (LLM) that utilizes the attention mechanism to process and generate text efficiently by capturing long-range dependencies within input sequences.</t>
        </li>
        <li>
          <t>KVCache: A key-value cache storing intermediate representations used in LLM inference.</t>
        </li>
        <li>
          <t>Prompt: A prompt consists of two parts: the system prompt and the user prompt. The system prompt is predefined by the LLM model developer to guide the model's behavior, while the user prompt is provided dynamically by the user to specify the task or request.</t>
        </li>
        <li>
          <t>Token: The smallest unit of processing in LLM inference, typically representing a word or subword.</t>
        </li>
      </ul>
    </section>
    <section anchor="kvcache-data-model">
      <name>KVCache Data Model</name>
      <t>The KVCache data model is structured as follows.</t>
      <t><strong>Naming</strong>: The Track Namespace consisting of following tuples (moq://kvcache.moq.arpa/v1/),(modelName), (prompt) is defined in this specification. The track name identifies the compression level for the KVCache. Thus, a track name can be identified with the tuple (<tt>&lt;compression&gt;</tt>) and the full track name having the following format (when represented as a string):</t>
      <t><tt>
moq://kvcache.moq.arpa/v1/&lt;modelName&gt;/&lt;compression&gt;
</tt></t>
      <t>Following compressions are defined in this specification, along with their size:</t>
      <table anchor="tab-kvcache-compression">
        <name>Compression of KVCache</name>
        <thead>
          <tr>
            <th align="left">Compression</th>
            <th align="left">Description</th>
            <th align="left">Size per Weight</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="left">FP16</td>
            <td align="left">Quantized using FP16</td>
            <td align="left">2 bytes</td>
          </tr>
          <tr>
            <td align="left">BF16</td>
            <td align="left">Quantized using BF16</td>
            <td align="left">2 bytes</td>
          </tr>
          <tr>
            <td align="left">FP8</td>
            <td align="left">Quantized using FP8</td>
            <td align="left">1 byte</td>
          </tr>
          <tr>
            <td align="left">Int8</td>
            <td align="left">Quantized using Int8</td>
            <td align="left">1 byte</td>
          </tr>
          <tr>
            <td align="left">FP4</td>
            <td align="left">Quantized using FP4</td>
            <td align="left">0.5 byte</td>
          </tr>
          <tr>
            <td align="left">Int4</td>
            <td align="left">Quantized using Int4</td>
            <td align="left">0.5 byte</td>
          </tr>
          <tr>
            <td align="left">AC (5x)</td>
            <td align="left">Compressed using Arithmetic Coding (5x ratio)</td>
            <td align="left">Variable</td>
          </tr>
        </tbody>
      </table>
      <t><strong>Group ID</strong>: Normally the tokens are split into chunks of uniform length(typical value is 128). The KVCache are organized into groups corresponding into token chunks. The ID of the group represents the index of a token group within the KVCache.</t>
      <t><strong>Object ID</strong>: An identifier for a specific token within a group.</t>
      <t><strong>Object Payload</strong>: The content of the KVCache, which varies based on the compression algorithm used for storage and transmission.</t>
    </section>
    <section anchor="security-considerations">
      <name>Security Considerations</name>
      <t>TBD</t>
    </section>
    <section anchor="iana-considerations">
      <name>IANA Considerations</name>
      <t>TBD</t>
    </section>
  </middle>
  <back>
    <references anchor="sec-combined-references">
      <name>References</name>
      <references anchor="sec-normative-references">
        <name>Normative References</name>
        <reference anchor="RFC2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
            <abstract>
              <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
        <reference anchor="RFC8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B." surname="Leiba"/>
            <date month="May" year="2017"/>
            <abstract>
              <t>RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>
      </references>
      <references anchor="sec-informative-references">
        <name>Informative References</name>
        <reference anchor="I-D.ietf-moq-transport">
          <front>
            <title>Media over QUIC Transport</title>
            <author fullname="Luke Curley" initials="L." surname="Curley">
              <organization>Discord</organization>
            </author>
            <author fullname="Kirill Pugin" initials="K." surname="Pugin">
              <organization>Meta</organization>
            </author>
            <author fullname="Suhas Nandakumar" initials="S." surname="Nandakumar">
              <organization>Cisco</organization>
            </author>
            <author fullname="Victor Vasiliev" initials="V." surname="Vasiliev">
              <organization>Google</organization>
            </author>
            <author fullname="Ian Swett" initials="I." surname="Swett">
              <organization>Google</organization>
            </author>
            <date day="3" month="March" year="2025"/>
            <abstract>
              <t>   This document defines the core behavior for Media over QUIC Transport
   (MOQT), a media transport protocol designed to operate over QUIC and
   WebTransport, which have similar functionality.  MOQT allows a
   producer of media to publish data and have it consumed via
   subscription by a multiplicity of endpoints.  It supports
   intermediate content distribution networks and is designed for high
   scale and low latency distribution.

              </t>
            </abstract>
          </front>
          <seriesInfo name="Internet-Draft" value="draft-ietf-moq-transport-10"/>
        </reference>
        <reference anchor="CacheGen" target="https://github.com/UChi-JCL/CacheGen">
          <front>
            <title>CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming (SIGCOMM24)</title>
            <author>
              <organization/>
            </author>
            <date year="2024"/>
          </front>
        </reference>
        <reference anchor="CacheBlend" target="https://arxiv.org/abs/2405.16444">
          <front>
            <title>CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion</title>
            <author>
              <organization/>
            </author>
            <date year="2024"/>
          </front>
        </reference>
      </references>
    </references>
  </back>
  <!-- ##markdown-source:
H4sIAAAAAAAAA8VZ63LbxhX+j6fYKjONqJCUqMqJwnGdUKJlq9bNlpxOptMZ
L4EluREIwNgFaUZSnkXPoifrd84uQPBi121/VJOJgd2Dc9/vnLNstVpBYLWN
VVdsvfnlWIZjJdKpysV5+vZmK5CDQa6mi72tIJRWjdJ83hU6GaZBEKVhIif4
PMrl0LbMWLcm6cfW7TQk+tZeJzDFYKKN0Wli5xkIT1/enARJMRmovBtEYNcN
wjQxKjGF6QYQ9pdA5kpC6FYwS/PbUZ4WGd7OVaSluCTl3r4/Pd4KbtUc+1FX
9E5JGZWrJFRN4VUNAlnYcQoZohUI/OnEdMXrtrgea34fFnHsVH8tk1G1nOYj
mejfpYXC2CrkTGlxo8JxksbpSCvDVGoiddwVMHeMj3/8ecx07TCd8HaYFokl
Jx2PdSKXNLhui18LtaLB9Vglo0Qm1dayFswFIRnoWNXFzwtl/Jc/h0QzYZLP
qREkaT4ByylcHlD4qjchTlv9tlZ2yNGzuUxMluYWdEKwO1+ppMs8fba84Jfa
pjiRxopjRFl9suIslZGGVyFDnMFDhRwpGBCpWPSyLNYhW2bEFCF984vjIq4t
4j6hz7avT18dX56f7x80WE4VSuca98C5I/b39g+cYjIfKdsVY2sz093dHWk7
Lgbki933sL/1t+Oz3VLbyqyjWCXR5w1z2860M+K/asy1yqelne96r8QMQt2n
kXiTpLNYRaA9KSj9/2tLZP5JT9ug3pUDs7t/sPes3fn+4OAgCFqtlsAa4hXa
IHAaxqWGE9Zw++zsvLE4H3iapvFUGWFnqTAWhEjKLFdDHcdCJpGIVIgP2+IG
ESnXs7E09JaGyhj6lvfSSWbBT2Qyl3Gs4qYYqUTlCC08QiT+KDbFbKzDsdD8
YSIKA+8M5kzipHkBNiWuURHS4y0QQRj1sVCJ1eA/b5f8RIiTMlAiV8xJD5mT
M5csKDVzehqcsCZowZb0Qj5khXVPiGo69MawoW3xOp0pQExTaGvgSnKo0b+D
u7yF3Wo41KGGQoKPCHwqwjHZnozAsS36KsZ5yr39xmOpzHE4rQptkROPRA5i
54CsGMTajHcBkSbMNUyqTp6I9a1iFG4KCEhn0CYNZSwSmGnIUwywdTc/PWJ1
QAlgITRXNtcwJeIzlqiZ8FIyPnlN+GVaxmkAr810ZMcUdbgNoF5MyEqTqVAP
tY84K+exnBy3VjDaLiMnOooAVME34hTwQ/F0MFbSI2WQlIucDIIbXi1T1OcZ
iYg3pLThyM4zgAiyQkR6qiNKgwTmU1JH2iC+of2fslsnyJJ/l+NyNcMN4AAM
JP5LBDHj1PXMCCNWU94psnwI+IhYHSPtTD2+FHQvHD4vLPHccEyaIk2UkBba
WT2BhCtvJrwmffoz+rL34DUwQNKW9pMpiANMYK3ITHxIFRrh14lL3ImaoAeo
502/YP2grs4RgeHQnQIkfFrkIZ3Vj4XOFWUVRKxHpOZ+VH+EHmphK4vTOUQi
34yiEMDy2gGOC4NUNwAUenVSKW3HMo9mxAbnLkMeIeUn8GbEIVjygMhUznXQ
QWOlmDtlpN6qqRUGhOS8Unumbzr4r8LlISInnyk7U7AI/pnglPzxxx++zFR/
37U2/H23SnVP/7tyWXnKWUWLG6i2r+fwzUR8J94jKRubqL5C4g2ll29D1hmI
TQxaa1T3qwublBHTz1Jt1nOV6p4d46J3wdFbFl1Sbb8qj5CPU2Od6usk/n9s
xF+ZYH7r+UrwYOO1TSnr/izecYlsfFnMujV+ZWN8V+K8Yvn9Jj4rdi+9ToM1
pUq71zcCsO+7E0cxFh2S1263WW594wKy4Qfk/iLMlW7rG/+RDnx677rim6Ee
lYOO6x7/+u1SXSsx7duHIAiq4ge01qMEVRVdjAUCc4nz0CEFIVns2x8PmoRt
ne/33hwxfEnxw96RK4TbhwNtxccCfPwZbbTFCdP4woUK2tnb2/NlorlUTbit
AdDj2RD/8yMUo9Q1Sq6vYIpFnW+Kqcx1WpglkQyShKmAem4MZIz5EMZMGMih
SJZSl2YK1EdUlbu7sgl/eIC2RQ5JOSB2MbmttHcyzFNDJY7ZG2r3IhQX6myG
sLLq8UqbGbTJPU0SZ8bpLCGgvrurhYtE33DPiBZwII0OnTRBRQ4D7pzCxLXg
E/dZCIGvi6IeR9QZUnCSIkJOuinjCI9g2fNoozvCpzKaUp2hmoIhIKIebNkp
PG48PKD5hDvRVXmlahUHtdeoGK0k7Efu0D6lR6nUQM1TboBZc0x3rDrHFXMV
F8HlLo56YOghN7Si9a6VrPUVPKIyTy0WiIoFiq52jhE0SJBYcD99gE4oWu8X
7+42j50ID3WPGCan1NVQ1CmqfeKp+d21jLdqLugKwIit8/fXN1tN96+4uOTn
dy/fvj9997JPz9eve2dn1UPgKa5fX74/6y+eFl/S9Pnyou8+xqpYWgq2znu/
Yoe02rq8ujm9vOidbVGa2SUv0AlwTTn1WTniYilhTID6xF6mtlUcHV89PXYO
kAV/endyvN/p/Pjw4F8OOz8c4AWplzhpaYKwu1ck/jyQWaZkTlzQziFXM21l
bGqpT0kLb+78gzzzz654PgizzsELv0AGLy2WPltaZJ+tr6x97Jy4YWmDmMqb
S+srnl7Wt/fr0nvp99ri859iZJ1odQ5/ehFQjtSDUZS9/TClgYqHH0CP6WJq
oYGkK3ob5w0/Qtsxeuqlvlxa69ITBxpjYKLNxI+wPL5QuKp+nW9FqvEx5pOM
aOFwkR4xEKSF7B/RLJDhrOCw09RFYKITPz+4Jh+sacwqDxMpjWPQmsq4UH4m
NKj+3LdTzk3o2swSlBBCQ7a/eXGj88okxpxde0mMPaJS848DzwMZDVhoxC3G
KgZe12XWoJdWwTv3aw5nl8kctjqEqC4CSA/n7ghTa5yiMeeBp8Bwt5jvvzU4
TGNMrmnOM1esVgU67qmbCaN5Iid+TvRymJQQiUdbt2aluRVpzhCnjGUvcOvb
ddpPaPRDpSiAPuQEH2Hn42UPNmuTaeVyNysSUpEUQCw9MsSVkNiXVrorJYds
5XpE684tVHBsXjAeE4b4NKZs2Nm54HuznR2n8E0uw1uBNWUyGaoygKQGtK+l
f5HBrqfHbaBvd3fXF8c23toyz+TutLPbaG6zeGLWaIpt5+QGaVMG0KPe06O/
LXB3ey7wljWhS06BgMAT1WVCvWWIKeLVfFyrKAXXphoTtAZPj4Mas8gVXI4i
WSO2PzyvsX7xoVFlJV23Ol5Pj8xsvLgBWfjE3YqKbcLYRQSdxyVFAEQNQMaH
Dx+Cz7vteeW0F7tL+vB3wUklrrbnJt8Vr4olp9JdEDUbpc2YtalJgzr3qJYL
h6LF7avqsmfD1LHW8Ytr6g3o0P1d6dHYcv9c73vF/ZfHgdW/dXpmeXLV+X4h
9K1rJGGu62Pquxt03Mchtm6y8/PFvTg6+RLD+u5XMjy5OqxRrGt4uIlVSd5h
hrUVMDxN7OGCYpVhffcrGZ5cHXxRw4MNnCryvfazJZZew4MFxQYNP89xM8Pe
sdh+9qnhKcrMrDj2eD5QFm33sbtiArWgS7W0cS9+wZRBV6QLhjRsWTmoflWq
Y4cfvOrZvxhZaPLa2XlFvyCJ0z7B4wUdbwJnRgx3e0bHzmSxtu4OMRwXyS0X
O+A9wYGg+1073vbILlytxdns7B82HM6VeM13V+63m/JKkn+/ohu0HPplaM99
aU79kOfEOTan/fJCmr96eqwAqLyVjNQnIpH+YyYr24Q6eJLdl4Pf0L57w3vJ
AjRzP0mW2OKZeTbSca2zuJLzOJVRWV9C+pUnqS7PVy5BaUrEmcJY5W7uVtF+
MSByD0KqUMdC/RZDdW0+4Rp5rUJ0SXZO84CBDbksR4CjPl8y9y56m/cIdAbA
++BfB51r4+gcAAA=

-->

</rfc>
