<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std" docName="draft-wang-nof-framework-00" ipr="trust200902">
  <front>
    <title abbrev="Abbreviated-Title">NVMe over Fabric Network
    Framework</title>

    <author fullname="Haibo Wang" initials="H." surname="Wang">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>rainsword.wang@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Lily Zhao" initials="L." surname="Zhao">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 3 Shangdi Information Road</street>

          <city>Beijing</city>

          <region/>

          <code>100085</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>Lily.zhao@huawei.com</email>

        <uri/>
      </address>
    </author>

    <author fullname="Shuanglong Chen" initials="S." surname="Chen">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street>No. 156 Beiqing Road</street>

          <city>Beijing</city>

          <region/>

          <code>100095</code>

          <country>P.R. China</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>chenshuanglong@huawei.com</email>

        <uri/>
      </address>
    </author>

    <date day="7" month="March" year="2022"/>

    <abstract>
      <t>NVMe over Fabrics defines a common architecture that supports a range
      of storage networking fabrics for NVMe block storage protocol over a
      storage networking fabric, such as Ethernet, Fibre Channel and
      InfiniBand. For Ethernet-based networks, RDMA or TCP technology can be
      used to transport NVMe, but the network management mechanism is simple,
      and fault detection is weak.</t>

      <t>This document defines the architecture of the Ethernet-based NVMe
      control optimization technology, including service processes between
      hosts, storage devices and network switches, and fast fault-aware
      switchover.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>For a long time, the key storage applications and high performance
      requirements were mainly based on FC networks. With the increase of
      transmission rates, the medium has evolved from HDDs to solid-state
      storage, and the protocol has evolved from SCSI to NVMe. The emergence
      of new NVMe technologies brings new opportunities.</t>

      <t>Ethernet-based NVMe is an implementation of NVMe over Fabric that
      best fits NVMe semantics. It surpasses FC in terms of performance, cost
      and network management. It is the development trend of high-speed
      storage networks in the future. Ethernet-based NVMe has been defined in
      NVM Express. The specification defined in this document optimizes
      network control in terms of ease of use, maintainability, and
      reliability, making Ethernet-based NVMe more suitable for high
      reliability requirements of key applications. This feature improves
      system usability and maintainability.</t>

      <t><xref target="ODCC-2020-05016">The </xref> defined the basic
      specifications for NVME of RoCEv2, and this document draws on that
      definition.</t>
    </section>

    <section title="Terminology">
      <t>NoF : NVMe of Fabric</t>

      <t>FC : Fiber Channel</t>

      <t>NVMe : Non-Volatile Memory Express</t>
    </section>

    <section title="Reference Models">
      <t>An Ethernet-based NVMe network mainly includes three types of roles:
      an initiator (referred to as a host), a switch, and a target (referred
      to as a storage device). Initiators and targets are also referred to as
      endpoint devices. Hosts and storage devices use the Ethernet-based NVMe
      protocol to transmit data over the network to provide high-performance
      storage services.</t>

      <section title="Basic Model">
        <t><figure align="center">
            <artwork><![CDATA[               +--+       +--+   
    Host       |H1|       |H2|   
 (Initiator)   +-,+       +_.+   
                | `',   _-` |    
                |    _-`    |    
                | _-`   `', |    
   Ethernet  +----+       +----+ 
   Network   | SW |       | SW | 
             +---,+       +_.--+ 
                | `',   _-` |    
                |    `',    |    
                | _-`   `', |    
   Storage     +-`+       +`'+   
   (Target)    |S1|       |S2|   
               +--+       +--+     
     Figure 1 : Basic Model
]]></artwork>
          </figure></t>

        <t>This is the basic model for small-scale storage access networks.
        Hosts and storage devices are dual-homed to different switches.</t>

        <t>After a host or a storage device is connected to a switch, they
        register their information to the switch and obtain registration
        information of other hosts/storage devices from the switch node.</t>
      </section>

      <section title="CLOS Model">
        <t><figure>
            <artwork align="center"><![CDATA[               +--+      +--+      +--+      +--+ 
   Host        |H1|      |H2|      |H3|      |H4| 
(Initiator)    +/-+      +-,+      +.-+      +/-+ 
                |         | '.   ,-`|         |   
                |         |   `',   |         |   
                |         | ,-`  '. |         |   
              +-\--+    +--`-+    +`'--+    +-\--+
              | SW |    | SW |    | SW |    | SW |
              +--,-+    +---,,    +,.--+    +-.--+
                  `.          `'.,`         .`    
                    `.   _,-'`    ``'.,   .`      
  Ethernet          +--'`+            +`-`-+      
  Network           | SW |            | SW |      
                    +--,,+            +,.,-+      
                    .`   `'.,     ,.-``   ',      
                  .`         _,-'`          `.    
              +--`-+    +--'`+    `'---+    +-`'-+
              | SW |    | SW |    | SW |    | SW |
              +-.,-+    +-..-+    +-.,-+    +-_.-+
                | '.   ,-` |        | `.,   .' |  
                |   `',    |        |    '.`   |  
                | ,-`  '.  |        | ,-`  `', |  
  Storage      +-`+      `'\+      +-`+      +`'+ 
  (Target)     |S1|      |S2|      |S3|      |S4| 
               +--+      +--+      +--+      +--+ 
               Figure 2 : CLOS Model
]]></artwork>
          </figure></t>

        <t>This is a relatively large-scale storage network which applies to a
        large-scale storage device access network.</t>

        <t>Hosts and storage nodes connect to different switch nodes and
        register to the switch nodes. The switch needs to flood the
        registration information received locally to other switch nodes on the
        network.</t>
      </section>
    </section>

    <section title="Functional Components">
      <t>The Ethernet-based NVMe network consists of storage devices, hosts
      and switches.</t>

      <section title="Storage Device">
        <t>As the server side, storage devices provide storage access services
        for hosts. When a storage device is connected to a switch, storage
        service information must be registered and periodically notified to
        the switch to ensure the validity of information.</t>

        <t>If the storage device has interest in information of other storage
        device or host in the storage network, it may also receive the
        notification of such information from the switch.</t>

        <t><figure align="center">
            <artwork><![CDATA[  +-------+                  +------+  
  |Storage|                  |Switch|  
  +-------+                  +------+  
      |      Register Msg       |      
      | ----------------------->|      
      |                         |      
      |     Notification Msg    |      
      | <-----------------------|      
      |                         |      
      |                         |
      Figure 3 : Storage Device
]]></artwork>
          </figure></t>
      </section>

      <section title="Host">
        <t>The host is the client of the storage device. When a host accesses
        a switch, it needs to register the host information to the switch and
        periodically publish it.</t>

        <t>As the client side, a host needs to quickly obtain the service
        status of the storage device that provides services. When the host
        obtains the notification message from the switch indicating that the
        storage device goes online, the host may establish a connection to the
        storage device. When the host receives a notification message from the
        switch indicating that the storage device is faulty, the host needs to
        quickly disconnect from the storage device and attempt to establish a
        connection to other redundant storage devices.</t>

        <t><figure>
            <artwork><![CDATA[+-------+                  +------+
|  HOST |                  |Switch|
+-------+                  +------+
    |       Register Msg      |    
    | ----------------------->|    
    |                         |    
    |     Notification Msg    |    
    | <-----------------------|    
    |                         |    
    |                         |    
     Figure 4 : Host Device
]]></artwork>
          </figure></t>
      </section>

      <section title="Network Device">
        <t>Switches manage the registration information of the hosts and
        storage devices, and monitor the network status. Switches will
        synchronize this information to the other switches in the network.</t>

        <t><figure>
            <artwork><![CDATA[+------+                  +------+
|Switch|                  |Switch|
+------+                  +------+
   |    Information Sync     |    
   | ----------------------->|    
   |                         |    
   |                         |    
   |                         |    
    Figure 5 : Network Device
]]></artwork>
          </figure></t>
      </section>
    </section>

    <section title="Procedures">
      <t/>

      <section title="IP Domain Management">
        <t>On an FCoE network, users can control access between nodes through
        zones, improving network security. This zone is used for inter-domain
        isolation and intra-domain communication.</t>

        <t>On the Ethernet-base NVMe network, we also need to implement FC
        zones to isolate and control services between storage devices and
        hosts. On the Ethernet-base NVMe network, IP addresses are used as the
        unique identifiers of hosts and storage devices, and domains are used
        as the attributes of IP addresses. Hosts and storage devices in the
        same domain can access each other. Hosts and storage devices in
        different domains are isolated. Each IP address needs to be assigned
        to one or more domains. Also, there is a default domain. If no
        isolation is required, the IP addresses of these hosts and storage
        devices belong to the default domain. For each domain, we may also
        call it zone.</t>

        <t><figure align="center">
            <artwork><![CDATA[             _,.---.,,         ,,.--.,,            
          .'`         `'.,  .'`        `'.         
       ,-`                ,'              `\       
      /    +--------+   ,'  \     +--------+`.     
    .'     |StorageA|  /     `,   |StorageB|  \    
   /       +---,----+ /        \  +-_.-----+   \   
  /             `.,  /          ,_-`            \  
  '                '/         _-\                , 
 |                  |`',   _-`   |               | 
/                   / +-`-`--+   \               \ 
|                  |  |Switch|    |               |
|                  |  +- .-,,+    |               |
|                  |  ,'` |  '.   |               |
|                  |-`    |    `',|               |
|                .'|      |       |.,             |
 ,            ,-`   \     |      /   ',          / 
 |     +-----`-+    | +---\---+  |   +-`'----+   | 
  ,    | HostA |    \ | HostB | /    | HostC |   ` 
  \    +-------+     \+-------+ `    +-------+  /  
   \                  \        /               /   
    `.                 \      '               /    
      \                 `,  ,'               `     
       `.     Zone1       `.    Zone2      ,'      
         `'.,         _.-`  '.,        _.'`        
             `'''--''`         `''--''`            
    
             Figure 6 : Zone Management]]></artwork>
          </figure>As shown in the figure above, HostA and StorageA belong to
        Zone1, HostC and StorageB belong to Zone2, and HostB belongs to Zone1
        and Zone2.</t>

        <t>StorageA can be accessed only by HostA but not HostC. StorageB can
        be accessed only by HostC, but not by HostA. Because HostB belongs to
        both Zone1 and Zone2, HostB can access StorageA in Zone1 and StorageB
        in Zone2.</t>
      </section>

      <section title="Network Deployment">
        <t>The NoF network uses the standard Ethernet technology, and the
        typical deployment model is the CLOS architecture. Network deployments
        typically use the current IP technologies. For example, OSPF is
        usually deployed as an underlay protocol.</t>
      </section>

      <section title="Storage and Host Access">
        <t>Hosts and storage devices are connected to the ethernet network.
        The administrator assigns access IP addresses to the hosts and storage
        devices. In most scenarios, these routes can be advertised through the
        underlay protocol. In addition, after hosts and storage devices go
        online, they need to register their information to the switches. It is
        recommended that the registration message be completed using LLDP.</t>

        <t>The registration information includes the IP address type, whether
        to subscribe to host or storage device information changes, device
        role, service protocol type and version number, protocol service port
        number, protocol identifier, etc.</t>

        <t>The switch receives and saves the registration information of hosts
        and storage devices. For a host/storage device that subscribes to the
        hosts and storage device information changes, the switch also needs to
        advertise the collected registration information to the subscriber.
        The information to be advertised includes the device status, device
        status change reason, and device attachment information. When
        advertising the subscribed information, it must be ensured that only
        the registration information of the domain to which the node belongs
        is advertised. It is recommended to use a new protocol to implement
        this notification message.</t>
      </section>

      <section title="NoF Information Advertisement">
        <t>Users assign domains for different hosts and storage devices. The
        domain information must be obtained by all access switches on the
        entire storage network. The domain information can be configured on
        each access switch. It can also be configured on some switches and
        then synchronize to all other access switches throughout the storage
        network.</t>

        <t>In addition, the local host and storage device registration
        information stored on each access switch needs to be synchronized
        across the entire switch network so that host/storage devices under
        other access switches can obtain the information.</t>

        <t>The synchronization information about the host and storage devices
        belongs to the application layer's information. A new protocol should
        be defined to implement the information synchronization.</t>

        <t><figure align="center">
            <artwork><![CDATA[+-------+           +----+      +------+      +----+      +-------+
|  HOST |-----------|TOR1|------|Spine1|------|TOR3|------|Storage|
+---/---+           +-/--+      +--/---+      +-/--+      +---/---+
    |---------------->|            |            |<------------|    
    |  Register Msg   |----------->|<-----------| Register Msg|    
    |                 |<-----------|----------->|             |    
    |<----------------|  Info Sync |  Info Sync |             |    
    |Notification Msg |            |            |             |    
    |                 |            |            |             |    
            Figure 7 : Information Advertisement
]]></artwork>
          </figure></t>
      </section>
    </section>

    <section title="Reliability Consider">
      <t/>

      <section title="Storage Failure">
        <t>When a storage device is faulty, the access switch detects the
        fault and spreads the fault on the network. After receiving the fault,
        the host that subscribes to the storage device can switch to another
        storage device. The switchover is performed by the host side. The
        network side needs to quickly notify the host of the fault.</t>
      </section>

      <section title="Host Failure">
        <t>When a host is faulty, the access switch detects the fault and
        floods the fault on the network. Hosts and storage devices determine
        whether to subscribe to the fault status of a specified host based on
        the implementation.</t>
      </section>

      <section title="Access Link Failure">
        <t>When an access link is faulty, the access switch detects the fault
        and spreads the fault on the network. After receiving the fault, the
        host that subscribes to the storage device can switch to another
        storage device.</t>

        <t>To accelerate fault detection, BFD or other fast detection
        technologies can be used to accelerate it.</t>
      </section>

      <section title="Network Link Failure">
        <t>ECMP or redundant link protection is usually deployed to prevent
        this failure.</t>

        <t>When multiple links fail on the network side, the switch network
        may be split. In the two split networks, each host receives the
        corresponding notification and performs different serves on the
        storage devices.</t>
      </section>

      <section title="Network Device Failure">
        <t>The fault is equivalent to a network link fault or an access link
        fault or both.</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document makes no request of IANA.</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>NA</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="References">
      <reference anchor="ODCC-2020-05016">
        <front>
          <title>NVMe over RoCEv2 Network Control Optimization Technical
          Requirements and Test Specifications</title>

          <author>
            <organization>Open Data Center Committe</organization>
          </author>

          <date year="2020"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
