<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<!-- generated by https://github.com/cabo/kramdown-rfc version 1.7.24 (Ruby 3.2.3) -->
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft-shi-moq-kvcache-00" category="info" submissionType="IETF" tocInclude="true" sortRefs="true" symRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.28.0 -->
  <front>
    <title abbrev="KVCache">KVCache over MoQT</title>
    <seriesInfo name="Internet-Draft" value="draft-shi-moq-kvcache-00"/>
    <author initials="H." surname="Shi" fullname="Hang Shi">
      <organization>Huawei Technologies</organization>
      <address>
        <postal>
          <country>China</country>
        </postal>
        <email>shihang9@huawei.com</email>
      </address>
    </author>
    <date year="2025" month="March" day="03"/>
    <area>WIT</area>
    <workgroup>Media over QUIC</workgroup>
    <keyword>AI inference, KVCache</keyword>
    <abstract>
      <?line 46?>

<t>Large language model (LLM) inference involves two stages: prefill and decode. The prefill phase processes the prompt in parallel, generating the KVCache, which is then used by the decode phase to produce tokens sequentially. KVCache can be reused if the model and prompt is the same, reducing computing cost of the prefill. However, its large size makes efficient transfer challenging. Delivering these over architectures enabled by publish/subscribe transport like MoQT, allows local nodes to cache the KVCache to be later retrieved via new subscriptions, saving the bandwidth. This document specifies the transmission of KVCache over MoQT.</t>
    </abstract>
  </front>
  <middle>
    <?line 50?>

<section anchor="introduction-kvcache-in-llm-inference">
      <name>Introduction: KVCache in LLM inference</name>
      <t>The inference process of large language models is typically divided into two distinct stages: prefill and decode. The prefill phase processes the input prompt in parallel, generating a KVCache, which serves as an essential input for the decode phase. The decode phase then utilizes the KVCache to generate output tokens sequentially, one at a time. Prefill is a computationally intensive process, whereas decoding is constrained by memory bandwidth. Due to their differing resource requirements, prefill and decode processes are often deployed on separate computing clusters using different hardware chips optimized for computational performance in prefill nodes and memory bandwidth efficiency in decode nodes, with KVCache transferred between them.</t>
      <figure anchor="fig-kvcache">
        <name>LLM inference process</name>
        <artwork><![CDATA[
               +--------------------+
               |    Prompt Input    |
               |  (System + User)   |
               +--------------------+
            Tokenization |
                ---------------------
                |                   |
                v                   |
    +--------------------+          |
    |   Prefill Nodes    |          |
    | (Generate KVCache) |          |
    +--------------------+          |
                |                   |
                v                   |
    +--------------------+          |
    |      KVCache       |<---------+
    | (Stored & Reused)  |
    +--------------------+
                |
      -----------------------------
      |              |            |
      v              v            v
+----------------+       +----------------+
|  Decode Node 1 |  ...  |  Decode Node N |
| (Use KVCache)  |       | (Use KVCache)  |
+----------------+       +----------------+

]]></artwork>
      </figure>
      <t>KVCache is significantly large, with a single token requiring 160KB for a 70B model(8bit quantization). For a prompt of 1000 tokens, the KVCache size reaches 160MB. To reduce the size of KVCache, various quantization and compression algorithm are proposed such as <xref target="CacheGen"/>. Furthermore, KVCache can be reused across sessions if derived from the same prompt and model, as shown in <xref target="fig-kvcache"/>. The most basic reuse strategy is prefix caching, where KVCache is shared among prompts with a common prefix. More advanced methods, such as <xref target="CacheBlend"/>, improve reuse efficiency by selectively reusing KVCache beyond prefix matching. To minimize transmission costs, a publish/subscribe architecture is required to distribute KVCache. This document defines how to send KVCache over MoQT<xref target="I-D.ietf-moq-transport"/>.</t>
    </section>
    <section anchor="conventions-and-definitions">
      <name>Conventions and Definitions</name>
      <t>The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they
appear in all capitals, as shown here.</t>
      <?line -18?>

<t>This document uses the following terms:</t>
      <ul spacing="normal">
        <li>
          <t>LLM: A large language model (LLM) that utilizes the attention mechanism to process and generate text efficiently by capturing long-range dependencies within input sequences.</t>
        </li>
        <li>
          <t>KVCache: A key-value cache storing intermediate representations used in LLM inference.</t>
        </li>
        <li>
          <t>Prompt: A prompt consists of two parts: the system prompt and the user prompt. The system prompt is predefined by the LLM model developer to guide the model's behavior, while the user prompt is provided dynamically by the user to specify the task or request.</t>
        </li>
        <li>
          <t>Token: The smallest unit of processing in LLM inference, typically representing a word or subword.</t>
        </li>
      </ul>
    </section>
    <section anchor="kvcache-data-model">
      <name>KVCache Data Model</name>
      <t>The KVCache data model is structured as follows.</t>
      <t><strong>Naming</strong>: The Track Namespace consisting of following tuples (moq://kvcache.moq.arpa/v1/),(modelName), (prompt) is defined in this specification. The track name identifies the compression level for the KVCache. Thus, a track name can be identified with the tuple (<tt>&lt;compression&gt;</tt>) and the full track name having the following format (when represented as a string):</t>
      <t><tt>
moq://kvcache.moq.arpa/v1/&lt;modelName&gt;/&lt;compression&gt;
</tt></t>
      <t>Following compressions are defined in this specification, along with their size:</t>
      <table anchor="tab-kvcache-compression">
        <name>Compression of KVCache</name>
        <thead>
          <tr>
            <th align="left">Compression</th>
            <th align="left">Description</th>
            <th align="left">Size per Weight</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="left">FP16</td>
            <td align="left">Quantized using FP16</td>
            <td align="left">2 bytes</td>
          </tr>
          <tr>
            <td align="left">BF16</td>
            <td align="left">Quantized using BF16</td>
            <td align="left">2 bytes</td>
          </tr>
          <tr>
            <td align="left">FP8</td>
            <td align="left">Quantized using FP8</td>
            <td align="left">1 byte</td>
          </tr>
          <tr>
            <td align="left">Int8</td>
            <td align="left">Quantized using Int8</td>
            <td align="left">1 byte</td>
          </tr>
          <tr>
            <td align="left">FP4</td>
            <td align="left">Quantized using FP4</td>
            <td align="left">0.5 byte</td>
          </tr>
          <tr>
            <td align="left">Int4</td>
            <td align="left">Quantized using Int4</td>
            <td align="left">0.5 byte</td>
          </tr>
          <tr>
            <td align="left">AC (5x)</td>
            <td align="left">Compressed using Arithmetic Coding (5x ratio)</td>
            <td align="left">Variable</td>
          </tr>
        </tbody>
      </table>
      <t><strong>Group ID</strong>: Normally the tokens are split into chunks of uniform length(typical value is 128). The KVCache are organized into groups corresponding into token chunks. The ID of the group represents the index of a token group within the KVCache.</t>
      <t><strong>Object ID</strong>: An identifier for a specific token within a group.</t>
      <t><strong>Object Payload</strong>: The content of the KVCache, which varies based on the compression algorithm used for storage and transmission.</t>
    </section>
    <section anchor="security-considerations">
      <name>Security Considerations</name>
      <t>TBD</t>
    </section>
    <section anchor="iana-considerations">
      <name>IANA Considerations</name>
      <t>TBD</t>
    </section>
  </middle>
  <back>
    <references anchor="sec-combined-references">
      <name>References</name>
      <references anchor="sec-normative-references">
        <name>Normative References</name>
        <reference anchor="RFC2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
            <abstract>
              <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
        <reference anchor="RFC8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B." surname="Leiba"/>
            <date month="May" year="2017"/>
            <abstract>
              <t>RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>
      </references>
      <references anchor="sec-informative-references">
        <name>Informative References</name>
        <reference anchor="I-D.ietf-moq-transport">
          <front>
            <title>Media over QUIC Transport</title>
            <author fullname="Luke Curley" initials="L." surname="Curley">
              <organization>Discord</organization>
            </author>
            <author fullname="Kirill Pugin" initials="K." surname="Pugin">
              <organization>Meta</organization>
            </author>
            <author fullname="Suhas Nandakumar" initials="S." surname="Nandakumar">
              <organization>Cisco</organization>
            </author>
            <author fullname="Victor Vasiliev" initials="V." surname="Vasiliev">
              <organization>Google</organization>
            </author>
            <author fullname="Ian Swett" initials="I." surname="Swett">
              <organization>Google</organization>
            </author>
            <date day="1" month="March" year="2025"/>
            <abstract>
              <t>   This document defines the core behavior for Media over QUIC Transport
   (MOQT), a media transport protocol designed to operate over QUIC and
   WebTransport, which have similar functionality.  MOQT allows a
   producer of media to publish data and have it consumed via
   subscription by a multiplicity of endpoints.  It supports
   intermediate content distribution networks and is designed for high
   scale and low latency distribution.

              </t>
            </abstract>
          </front>
          <seriesInfo name="Internet-Draft" value="draft-ietf-moq-transport-09"/>
        </reference>
        <reference anchor="CacheGen" target="https://github.com/UChi-JCL/CacheGen">
          <front>
            <title>CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming (SIGCOMM24)</title>
            <author>
              <organization/>
            </author>
            <date year="2024"/>
          </front>
        </reference>
        <reference anchor="CacheBlend" target="https://arxiv.org/abs/2405.16444">
          <front>
            <title>CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion</title>
            <author>
              <organization/>
            </author>
            <date year="2024"/>
          </front>
        </reference>
      </references>
    </references>
  </back>
  <!-- ##markdown-source:
H4sIAAAAAAAAA8VZ63LbxhX+j6fYKjONqJCUqMqJwnGdUqJlq9YtlpxMptMZ
L4EluREIwNgFaUaSn0XPoifrd84uQPBi121/VJOJgd3FuZ/vnLNstVpBYLWN
VVdsvfnlWIZjJdKpysV5+vPNViAHg1xNF3tbQSitGqX5vCt0MkyDIErDRE7w
eZTLoW2ZsW5N0g+t22lI51t7e4EpBhNtjE4TO89w8PTlzUmQFJOByrtBBHLd
IEwToxJTmG4AZn8JZK4kmP56ChFmaX47ytMiw8K5irR08v387vR4K7hVc+xH
XdE7JXlUrpJQNYWXNghkYccp2IhWIPCnE9MVr9vieqz5fVjEsZP+tUxG1XKa
j2Si/5AWMmOrkDOlxY0Kx0kapyOtDJ9SE6njroDGY3z849/GfK4dphPeDtMi
sWSn47FOZBAkaT4BxSnUDch01ZsQp61+Wys7ZMvZXCYmS3OLc0KwHq9U0mWa
3lMv+KW2KU6kseIYFlYfrThLZaShDniIM4hWyJGCPyMVi16WxTpkxYyYwpZv
fnFUxLWFzSf02fb16avjy/Pz/YMG86ls6CzjHthvYn9v/8AJJvORsl0xtjYz
3d3dkbbjYkC22H0H/Vt/Pz7bLaWt1DqKVRJ9XjG37VQ7I/qrylyrfFrq+bb3
SszA1H0aiTdJOotVhLMnBYXef62JzD/qaRund+XA7O4f7D1rd74/ODgIglar
JbAGf4U2CJyEcSnhhCXcPjs7bywCE0/TNJ4qI+wsFcbiIOIxy9VQx7GQSSQi
FeLDtriBR8r1bCwNvaWhMoa+5b10klnQE5nMZRyruClGKlE5XAuL0BGfA00x
G+twLDR/mIjCwDqDOR9x3DwDmxLVqAjp8RbZKIz6UKjEatCft0t6IpSJGCiR
K6akh0zJqUsalJI5OQ2Sq4mzIEtyIR6ywroneDUdemVY0bZ4nc4UcrsptDUw
JRnU6D9AXd5CbzUc6lBDIMEpApuKcEy6JyNQbIu+ipFPudffeByTeTjWVoW2
yIlGIgexM0BWDGJtxruAJxPmGipVmSdifasYAZsCDNIZpElDGYsEahqyFINb
3cxPj1gdUABYMM2VzTVUiTjHEjUTnkvGmdeEXaalnwaw2kxHdkxeh9kAqMWE
tDSZCvVQe4+zcB5HyXBrYN12ETnRURQD+b4Rp4Af8qdDsfI8QgZBuYjJILjh
1TJEfZwRi3hDSBv27DwDiCAqRKSnOqIwSKA+BXWkDfwb2v8punWCKPl3MS5X
I9wADkBA4r9EEDEOXU+MMGI15J0gy0nAKWJ1jLAzdf+S0z1z2LywRHNDmjRF
mighLaSzegIOV15NWE368Gf0ZevBaiCAoC31J1XgB6jAUpGa+JCqI9yvExe4
EzVB/a3HTb9g+SCuzuGB4dBlAQI+LfKQcvVDoXNFUQUW6x6pmR+VF66HWNjK
4nQOlog3o8gF0LyWwHFhEOoGgEKvjiuF7Vjm0YzIIO8yxBFCfgJrRuyCJQuI
TOVcBx00VoK5LCPxVlWtMCAk45XS8/mmg//KXR4icrKZsjMFjWCfCbLk06dP
vsxUf9+1Nvx9t3rqnv535aLylKOKFjec2r6ewzYT8Z14h6BsbDr1FRxvKLx8
F7JOQGwi0Fo7db+6sEkYMf3sqc1yrp66Z8M4712w95ZZl6e2X5Up5P3UWD/1
dRz/Pzrirwwwv/V8xXnQ8dqmFHV/Fm+5RDa+zGZdG7+y0b8rfl7R/H4TnRW9
l16nwZpQpd7rGwHI913GkY9Fh/i1223mW9+4AG/YAbG/cHMl2/rGfyQDZ+9d
V3wz1KNyyHDd41+/XaprJaZ9+xAEQVX8gNZ6lKCqoouxQGAucR46pCAki337
40GTsK3z/d6bI4YvKX7YO3KFcPtwoK34UICOz9FGW5zwGV+4UEE7e3t7vkw0
l6oJtzUAejwbon9+hGKUukbJ9RV8YlHnm2Iqc50WZoklgyRhKqCeGwMZYzaD
MhMGcgiSpdSlmQL1EVXl7q5swh8eIG2Rg1MOiF2MTCvtnQzz1FCJY/KG2r0I
xYU6myG0rHq8UmcGbTJPk9iZcTpLCKjv7mruItY33DOiBRxIo0PHTVCRw3A5
JzdxLfjIfRZc4OuiqPsRdYYEnKTwkONuSj/CIlj2NNrojvCpjKZUZ6imYAiI
qAdbNgqPGw8PaD5hTnRVXqhaxUHtNSpGKwn9ETu0T+FRCjVQ85QbYJYc0x2L
zn7FXMVFcLmLox4YcsgNrWi9ayVtfQWPqMxTi4VDxQJFVzvHCBIkCCyYnz5A
JxSt94t3d5vHTriHukcMk1Pqasjr5NU+0dT87lpGjN6CZm+Dsfzd9c1W0/0r
Li75+e1LTOhvX/bp+fp17+ysegj8ievXl+/O+ounxZc0fb686LuPsSqWloKt
895v2CGpti6vbk4vL3pnWxRmdskKlAGuKac+K4dfLAWMCVCf2MrUtoqj46un
x84BouBPb0+O9zudHx8e/Mth54cDvCD0EsctTeB294rAnwcyy5TMiQraOcRq
pq2MTS30KWhhzZ1/kGX+2RXPB2HWOXjhF0jhpcXSZkuLbLP1lbWPnRE3LG1g
U1lzaX3F0svy9n5bei/tXlt8/lOMqBOtzuFPLwKKkbozirK3H6Y0UPHwA+gx
XUwtNJB0RW/jvOFHaDtGT73Ul0trXXgioTEGJtpM/AjL4wu5q+rX+VakGh9j
zmR4C8lFcsRAkBaif0SzQIZcQbLT1EVgohM/P7gmH6RpzCqTiYRGGrSmMi6U
nwkNqj/37RRzE7qvsgQlhNDg7W9e3Oi8MokxZddeEmGPqNT8I+F5IKMBC424
xVjFwOu6zBr00ipo537N4ezyMYetDiGqiwCSw5k7wtQap2jMeeApMNwt5vtv
DZJpjMk1zXnmitUqQ0c9dTNhNE/kxM+Jng8fJUTi0datWWluRZozxClj2Qrc
+nad9BMa/VApCqAPGcF72Nl42YLN2mRamdzNioRUxAUQS48McSUk9qWV7krJ
IVu5HtG6MwsVHJsXjMeEIT6MKRp2di743mxnxwl8k8vwVmBNmUyGqnQgiQHp
a+FfZNDr6XEb6Nvd3fXFsY23tswzuTvt7Daa28yeiDWaYtsZuUHSlA70qPf0
6G8L3N2ec7xlSeh+U8AhsER1mVBvGWLyeDUf1ypKwbWpRgStwdPjoEYscgWX
vUjaiO33z2ukX7xvVFFJN62O1tMjExsvbkAWNnG3omKbMHbhQWdxSR7AoQYg
4/3798Hnzfa8MtqL3SV5+LvgpGJX23OT74pVxZJR6S6Imo1SZ8za1KRBnHtU
y4VB0eL2VXXZs2HqWOv4xTX1BpR0vyo9Glvun+t9r7j/8jiw+rd+nkmeXHW+
XzD92TWSUNf1MfXdDTLuI4mtm+z8fHEvjk6+RLC++5UET64OayfWJTzcRKo8
3mGCtRUQPE3s4eLEKsH67lcSPLk6+KKEBxsoVcf32s+WSHoJDxYnNkj4eYqb
CfaOxfazjw1/oozMimKP5wNl0XYfuysmnBZ0qZY27sUvmDLoinRBkIYtKwfV
Lzp17PCDVz36FyMLTV47O6/opxtx2id4vKD0JnBmxHC3Z5R2Jou1dXeI4bhI
brnYAe8JDgTd79rxtkd24WotcrOzf9hwOFfiNd9duZ9uyitJ/uGIbtByyJeh
PfelOfVDnmPnyJz2ywtp/urpsQKg8lYyUh/piPQf87GyTaiDJ+l9Ofgd7btX
vJcsQDP3k2SJLZ6YJyMd1TqJKzmPUxmV9SWkX3mS6vJ85RKUpkTkFMYqd3O3
ivaLAZF7EBKFOhbqtxiqa/MJ18hrFaJLsnOaBwx0yGU5Ahz1+ZK5d9HbvEeg
MwDeB/8COK/GD2QcAAA=

-->

</rfc>
