<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-liu-multicast-for-computing-storage-00"
     ipr="trust200902">
  <front>
    <title abbrev="draft-liu-multicast-for-computing-storage-00">Multicast for
    Computing and Storage</title>

    <author fullname="Yisong Liu" initials="Y." surname="Liu">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>liuyisong@chinamobile.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Xuesong Geng" initials="X." surname="Geng">
      <organization>Huawei</organization>

      <address>
        <email>gengxuesong@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="10" month="July" year="2023"/>

    <abstract>
      <t>This document introduces the multicast use case for computing and
      storage.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>There are applications in data center with point-to-multipoint
      communication patterns that would benefit from network multicast
      service, without which, these applications, when migrating to public
      clouds, will use server based packet replication techniques. This leads
      to CPU load inflation and prevents tenants from sustaining high
      throughputs and low latencies for multicast workloads.</t>

      <t>In order to better show the requirements for computing and storage,
      we list 3 typical potential multicast scenarios with P2MP services:
      Multi-tenant Cloud, Computing and Storage.</t>

      <t>The multicast requirements could be described with the following 3
      aspects:</t>

      <t>- Network Scale: number of switches, number of links, number of
      hosts</t>

      <t>- Multicast Tree Size: number of intermeidate nodes; number of
      receivers</t>

      <t>- Multicast Service Number</t>
    </section>

    <section title="Use in Multi-tenant Cloud">
      <t>As illustrated in the following figure, a data-center may contain: a
      network fabric configured in unicast-only mode, hosts running as virtual
      machines (VMs) managed by tenants, central replicators (C-R) for
      providing MSR6 packet delivery service among the hosts of a tenant.</t>

      <t><figure>
          <artwork><![CDATA[
       +----+   +----+   +----+
       |C-R1|   |C-R2|   |C-Rn| ====>Central-Replicator
       +-+--+   +-+--+   +-+--+ 
         |        |        |    
   +-----+--------+--------+-----+
   | Spine +Leaf +vSwitch Fabric |
   |       (Unicast Only)        |
   +--+--+--+--+--+--+--+--+--+--+
      |  |  |  |  |  |  |  |    
      h1 |  h3 h4 |  h6 |  h8   ====>Tenant-1
         |        |     |       
         h2       h5    h7      ====>Tenant-2
]]></artwork>
        </figure></t>

      <t>Take tenant-1 for example. The host h1 can send multicast flow using
      MSR6 packets to C-R1, the MSR6 packets include one or more of the
      destination hosts h3/h4/h6/h8 encoding in the MSR6 header. An MSR6
      packet may be sent to C-R1 where it is replicated and sends to the
      desired destination hosts. An MSR6 packet may be sent to C-R1 where it
      is replicated and sends to the part of the destination hosts, and
      another copy to C-R2 for replication and delivery to the left
      destination hosts.</t>

      <t>A Tenant may have a dedicated set of C-Rs for its own use, or a
      Tenant may use a shared C-Rs for its replication requirement among
      VMs.</t>
    </section>

    <section title="Typical Multicast Scenario in Computing">
      <section title="AI Training">
        <t>The following figure shows a typical RDMA AI training scenario.</t>

        <t><figure>
            <artwork><![CDATA[                 PS(Parameter Server) Nodes
               +-------+          +-------+
               |  CPU  |          |  CPU  |
               | Server|          | Server|
               +-+-+-+-+          +-+-+-+-+
    ^            | | |              | | |          |
    |         +--|-|-|--------------+ | |          |
    |       +----+ | +----------------------+      |
    |       | |    +--------+ +-------+ |   |      V
Gradients   | |             | |         |   | Parameters
        +---+-+-+       +---+-+-+     +-+---+-+
        |  GPU  |       |  GPU  |     |  GPU  |
        | Worker|       | Worker|     | Worker|
        +-------+       +-------+     +-------+]]></artwork>
          </figure></t>

        <t>Worker-&gt;PS: The gradient of each worker is pushed to PS node</t>

        <t>PS-&gt;Worker: PS will pull the parameters back to all workers
        after aggregation</t>

        <t>In this process, the second stage is information distribution, with
        the same data content. N connections are used to transmit unicast
        separately. The bandwidth efficiency is 1/N, the larger the scale, the
        lower the efficiency.</t>

        <t><figure>
            <artwork><![CDATA[                      +---------------+
                      |     Source    |
                      | +---+   +---+ |
                      | |CPU|   |GPU| |
                      | +-+-+   +-+-+ |
                      |   |       |   |
                      |    \     /    |
                      |   +-V---V-+   |
                      |   |  HCA  |   |
                      |   +-------+   |
                      +--+-+-+-+-+-+--+
                         | | ... | |
                      +--V-V-----V-V--+
                      |     Switch    | 
                      +-+-----------+-+
                       /             \
        +-------------V-+           +-V------------+
        |  Destination  |           |  Destination  |
        |   +-------+   |           |   +-------+   |
        |   |  HCA  |   |           |   |  HCA  |   |
        |   +-V---V-+   |           |   +-V---V-+   |
        |    /     \    |           |    /     \    |
        |   |       |   |           |   |       |   |
        | +-+-+   +-+-+ |           | +-+-+   +-+-+ |
        | |CPU|   |GPU| |           | |CPU|   |GPU| |
        | +---+   +---+ |           | +---+   +---+ |
        +---------------+           +---------------+]]></artwork>
          </figure>If the source only sends 1 copy to the network and the
        switches replicate the packet to different distinations. The use of
        bandwidth is more efficient and the training is faster.</t>

        <t>The large-scale multicast requirement in this scenario is as the
        following:</t>

        <t>- Network Scale: 10-10k GPU</t>

        <t>- Multicast Tree Size: 10-10k receivers</t>

        <t>- Multicast Service Number: depends on the scenario</t>
      </section>

      <section title="HPC">
        <t>The following is an example of MPI in HPC scenario.</t>

        <t><figure>
            <artwork><![CDATA[      +-------------------------------------------+
      |                Dispatcher                 |
      |                  Master                   |
      +---------------------+---------------------+
                            |
          +-----------------+
          |  
      +---+----+  +--------+             +--------+
      |+--V---+|  |+------+|             |+------+|
      ||Dispa-||  ||Dispa-||             ||Dispa-||
      ||Agent ||  ||Agent ||             ||Agent ||
      |+---+--+|  |+---+--+|             |+---+--+|
      |    |   |  |    |   |             |    |   |
      |+---V--+|  |+---V--+|             |+---V--+|
      ||  MPI ||  ||  MPI ||     ...     ||  MPI ||
      ||Proces||  ||Proces||             ||Proces||
      |+---^--+|  |+---^--+|             |+---^--+|
      |    |   |  |    |   |             |    |   |
      |+---V--+|  |+---V--+|             |+---V--+|
      || RoCE |<-->| RoCE |<------------->| RoCE ||
      |+------+|  |+------+|             |+------+|
      +--------+  +--------+             +--------+]]></artwork>
          </figure></t>

        <t>Stage 1: Dispatcher Master senses millions of cores and schedules
        millions of Rank MPI jobs on demand. Dispatcher Master sends the
        scheduling results to Dispatcher Agent</t>

        <t>Stage 2: Dispatcher Agent starts Million Rank MPI on each node The
        Dispatcher Agent that receives the message broadcast the message to
        other Dispatcher Agents and do the initialization before starting the
        MPI application</t>

        <t>Stage 3: Dispatcher Agent broadcaast the message to start the MPI
        application. MPI internal initialization Synchronize the RoCE endpoint
        in allgather way after the MPI application is started</t>

        <t>The last 2 stages could benefit from multicast and reduce task
        completion time.</t>

        <t/>

        <t>The large-scale multicast requirement in this scenario is as the
        following:</t>

        <t>- Network Scale: 1000 k CPU/GUP</t>

        <t>- Multicast Tree Size: 10k~100k receivers</t>

        <t>- Multicast Service Number: 1~100</t>
      </section>
    </section>

    <section title="Typical Multicast Scenario in Computing">
      <t>Ceph is an open-source distributed software platform. It mainly
      focuses on scale-out file system including storage distribution and
      availability, which is widely used in storage.</t>

      <t>Ceph Object Storage Daemons (OSDs) are reponsible for storing objects
      on a local file system on behalf of Ceph clients. Also, Ceph OSDs use
      the CPU, memory, and networking of Ceph cluster nodes for data
      replication, erasure coding, recovery, monitoring and reporting
      functions.</t>

      <t>The following process request P2MP service.</t>

      <t>- Application initiates "write" operation from a client to a
      server.</t>

      <t>- Client finds the server to write in, and 3 copies are sent to 3
      services.</t>

      <t><figure>
          <artwork><![CDATA[               +-------+          +-------+
               |Client1|          |Client2|
               +---+---+          +---+---+
                   |                  |
                   +---------+--------+
                             |
                     +-------+-------+
                     |     Switch    | 
                     +-------+-------+
                             |
            +----------------+----------------+
            |                |                |            
        +---+---+        +---+---+        +---+---+
        | Server|        | Server|        | Server|
        +-------+        +-------+        +-------+]]></artwork>
        </figure></t>

      <t>The large-scale multicast requirement in this scenario is as the
      following:</t>

      <t>- Network Scale: 3k Server (1 Pod)</t>

      <t>- Multicast Tree Size: 3 receivers</t>

      <t>- Multicast Service Number: 10k</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document makes no request of IANA.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>TBD</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.8200"?>

      <?rfc include="reference.RFC.3493"?>

      <?rfc include="reference.RFC.3542"?>

      <?rfc include="reference.I-D.ietf-avtcore-rtp-topologies-update"?>

      <?rfc include='reference.I-D.cheng-spring-ipv6-msr-design-consideration'?>
    </references>
  </back>
</rfc>
