<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="std" docName="draft-talpey-rdma-commit-02" ipr="trust200902"
     submissionType="IETF" updates="5040 7306" xml:lang="en">
  <front>
    <title abbrev="RDMA Placement Extensions">RDMA Extensions for Enhanced
    Memory Placement</title>

    <author fullname="Tom Talpey" initials="T." surname="Talpey">
      <organization>Unaffiliated</organization>

      <address>
        <postal>
          <street/>

          <city/>

          <region/>

          <code/>

          <country/>
        </postal>

        <phone/>

        <facsimile/>

        <email>tom@talpey.com</email>

        <uri/>
      </address>
    </author>

    <date day="25" month="January" year="2023"/>

    <area>Transport</area>

    <workgroup>NFSv4</workgroup>

    <keyword>RDMA</keyword>

    <keyword>Persistent Memory</keyword>

    <abstract>
      <t>This document specifies extensions to RDMA (Remote Direct Memory
      Access) protocols to provide capabilities in support of enhanced
      remotely-directed data placement on memory-addressable devices,
      including persistent memory. The extensions include new operations
      supporting remote commitment to persistence of remotely-managed buffers,
      which can provide enhanced guarantees and improve performance for
      low-latency storage applications, and to the visibility of such buffers
      in support of remote shared memory semantics. This document updates RFC
      5040 (Remote Direct Memory Access Protocol (RDMAP)) and updates RFC 7306
      (RDMA Protocol Extensions).</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119"/>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>The RDMA Protocol (RDMAP) <xref target="RFC5040"/> and RDMA Protocol
      Extensions (RDMAPEXT) <xref target="RFC7306"/> provide capabilities for
      secure, zero copy data communications that preserve memory protection
      semantics, enabling more efficient network protocol implementations. The
      RDMA Protocol is part of the iWARP family of specifications which also
      include the Direct Data Placement Protocol (DDP) <xref
      target="RFC5041"/>, and others as described in the relevant documents.
      For additional background on RDMA Protocol applicability, see
      "Applicability of Remote Direct Memory Access Protocol (RDMA) and Direct
      Data Placement Protocol (DDP)" <xref target="RFC5045"/>.</t>

      <t>RDMA protocols are enjoying good success in improving the performance
      of remote storage and network shared memory access, and have been
      well-suited to semantics and latencies of existing storage solutions.
      However, new storage solutions are emerging with much lower latencies,
      in combination with the ever-increasing speed of network interconnects,
      are driving new workloads and new performance requirements. Also,
      storage programming paradigms <xref target="SNIANVMP"/> are driving new
      requirements of the remote storage layers, driving down latency
      tolerances. Overcoming these latencies, and providing the means to
      achieve persistence and/or visibility without invoking upper layers and
      remote CPUs for each such request, are the motivators for the extensions
      in this document.</t>

      <t>This document specifies the following extensions to the RDMA Protocol
      (RDMAP) and its local memory ecosystem:</t>

      <t><list style="symbols">
          <t>Flush - support for RDMA requests and responses with enhanced
          placement semantics.</t>

          <t>Atomic Write - support for writing certain data elements into
          memory in an atomically visible fashion.</t>

          <t>Verify - support for validating the contents of remote memory,
          through use of integrity signatures.</t>

          <t>Enhanced memory registration semantics in support of persistence
          and visibility.</t>
        </list>The extensions defined in this document do not require the
      RDMAP version to change.</t>

      <section title="Glossary">
        <t>This document is an extension of RFC 5040 and RFC 7306, and key
        words are additionally defined in the glossaries of the referenced
        documents.</t>

        <t>The following additional terms are used in this document as
        defined.<list style="hanging">
            <t hangText="Flush:">The submitting of previously written data
            from volatile intermediate locations for subsequent placement, in
            a persistent and/or globally visible fashion.</t>

            <t hangText="Persistent:">The property that data is present,
            readable and remains stable after recovery from a power failure or
            other fatal error in an upper layer or hardware. <eref
            target="https://en.wikipedia.org/wiki/Durability_(database_systems)">Durability</eref>,
            <eref
            target="https://en.wikipedia.org/wiki/Disk_buffer#Cache_control_from_the_host">Cache
            control</eref>, <xref target="SCSI"/>.</t>

            <t hangText="Globally Visible:">The property of data being
            available for reading consistently by all processing elements on a
            system. Global visibility and persistence are not necessarily
            causally related; either one may precede the other, or they may
            take effect simultaneously, depending on the architecture of the
            platform.</t>

            <t hangText="ULP:">Upper-Layer Protocol, as defined in RFC 5040
            and RFC 7306.</t>
          </list></t>
      </section>

      <section title="Problem Statement">
        <t>RDMA is widely deployed in support of storage and shared memory
        over increasingly low-latency and high-bandwidth networks. The state
        of the art today yields end-to-end network latencies on the order of
        one to two microseconds for message transfer, and bandwidths exceeding
        100 gigabit/s. These bandwidths are expected to increase over time,
        with latencies decreasing as a direct result, constrained of course by
        certain laws of physics.</t>

        <t>In storage, another trend is emerging - greatly reduced latency of
        persistently storing data. While best-of-class Hard Disk Drives (HDDs)
        have delivered average latencies of several milliseconds for many
        years, Solid State Disks (SSDs) have improved this by one to two
        orders of magnitude. Technologies such as NVM Express <eref
        target="https://www.nvmexpress.org">NVMe</eref> and Compute Express
        Link <eref target="https://www.computeexpresslink.org">CXL</eref>
        yield even higher-performing results by eliminating the traditional
        storage interconnect. The latest technologies providing memory-based
        persistence, such as Nonvolatile Memory DIMM <eref
        target="https://www.jedec.org">NVDIMM-N</eref>, places storage-like
        semantics directly on the memory bus, reducing latency to less than a
        microsecond and increasing bandwidth to potentially many tens of
        gigabyte/s.</t>

        <t>RDMA protocols, in turn, are used for many storage protocols,
        including NFS/RDMA <xref target="RFC8881"/> <xref target="RFC8166"/>
        <xref target="RFC8267"/>, SMB3 over SMBDirect <xref target="MS-SMB2"/>
        <xref target="MS-SMBD"/> and iSER <xref target="RFC7145"/>, to name
        just a few. These protocols allow storage and computing peers to take
        full advantage of these highly performant networks and storage
        technologies to achieve remarkable throughput, while minimizing the
        CPU overhead needed to drive their workloads. This leaves more
        computing resources available for the applications, which in turn can
        scale to even greater levels. Within the context of Cloud-based
        environments, and through scale-out approaches, this can directly
        reduce the number of servers that need to be deployed, making such
        attributes highly compelling.</t>

        <t>However, limiting factors come into play when deploying ultra-low
        latency storage in such environments:<list style="symbols">
            <t>The latency of the fabric, and of the necessary message
            exchanges to ensure reliable transfer is now higher than that of
            the storage itself.</t>

            <t>The requirement that storage be resilient to failure requires
            that multiple copies be sent for processing in multiple locations,
            adding extra hops which increase the latency and computing demand
            placed on implementing the resiliency.</t>

            <t>Processing is required at the receiver in order to ensure that
            the storage data has reached a persistent state, and acknowledge
            the transfer so that the sender can proceed.</t>

            <t>Typical latency optimizations, such as polling a receive memory
            location for a key that determines when the data arrives, can
            create both correctness and security issues because this approach
            requires the memory remain open to writes and therefore the buffer
            may not remain stable after the application determines that the IO
            has completed. This is of particular concern in security conscious
            environments.</t>
          </list></t>

        <t>The first issue is fundamental, and due to the nature of serial,
        shared communication channels, presents challenges that are not easily
        bypassed. Communication cannot exceed the speed of light, for example,
        and serialization/deserialization plus packet processing adds further
        delay. Therefore, a solution which offloads and reduces the overhead
        of exchanges which encounter such latencies is highly desirable.</t>

        <t>The second issue requires that outbound transfers be made as
        efficient as possible, so that replication of data can be done with
        minimal overhead and delay (latency). A reliable "push" transfer
        method is highly suited to this.</t>

        <t>The third issue requires that the transfer be performed without an
        upper-layer exchange required. Within security contraints, RDMA
        transfers, arbitrated only by lower layers into well-defined and
        pre-advertised buffers, present an ideal solution.</t>

        <t>The fourth issue requires significant CPU activity, consuming power
        and valuable resources, and may not be guaranteed by the RDMA
        protocols, which themselves make no requirement of the order in which
        certain received data is placed or becomes visible; such guarantees
        are made only after signaling a completion to upper layers.</t>

        <t>The RDMAP and DDP protocols, together, provide data transfer
        semantics with certain consistency guarantees to both the sender and
        receiver. Delivery of data transferred by these protocols is said to
        have been Placed in destination buffers upon Completion of specific
        operations. In general, these guarantees are limited to the visibility
        of the transferred data within the hardware domain of the receiver
        (data sink). Significantly, the guarantees do not necessarily extend
        to the actual storage of the data in memory cells, nor do they convey
        any guarantee that the data integrity is intact, nor that it remains
        present after a catastrophic failure. These guarantees may be provided
        by upper layers, such as the ones mentioned, after processing the
        Completions, and performing the necessary operations.</t>

        <t>The NFSv4.1 and SMB3 are file oriented, and the iSER protocol is
        block oriented; all are used extensively for providing access to hard
        disk and solid state flash drive media. Such devices incur certain
        latencies in their operation, from the millisecond-order rotational
        and seek delays of rotating disk hardware, or the
        100-microsecond-order erase/write and translation layers of solid
        state flash. These file and block protocols have benefited from the
        increased bandwidth, lower latency, and markedly lower CPU overhead of
        RDMA to provide excellent performance for such media, approximately
        30-50 microseconds for 4 kilobyte writes in leading
        implementations.</t>

        <t>The above storage protocols employ a "pull" model for write: the
        client, or initiator, sends an upper layer write request which
        contains an RDMA reference to the data to be written. The upper layer
        protocols encode this as one or more memory regions. The server, or
        target, then prepares the request for local write execution, and
        "pulls" the data with one or more RDMA Read operations. After
        processing the write, a response is returned. There are therefore two
        or more roundtrips on the RDMA network in processing the request. This
        is desirable for several reasons, as described in the relevant
        specifications, but it incurs latency. However, since as mentioned the
        network latency has been so much less than the storage processing,
        this has been a sound approach.</t>

        <t>Today, a new class of Storage Class Memory is emerging, in the form
        of Non-Volatile DIMM and NVM Express devices, among others. These
        devices are characterized by further reduced latencies, in the
        10-microsecond-order range for NVMe, and sub-microsecond for NVDIMM.
        The 30-50 microsecond write latencies of the above file and block
        protocols are therefore from one to two orders of magnitude larger
        than the storage media! The client/server processing model of
        traditional storage protocols are no longer amortizable at an
        acceptable level into the overall latency of storage access, due to
        their requiring request/response communication, CPU processing by the
        both server and client (or target and initiator), and the interrupts
        to signal such requests.</t>

        <t>Another important property of certain such devices is the
        requirement for explicitly requesting that the data written to them be
        made persistent. Because persistence requires that data be committed
        to memory cells, it is a relatively expensive operation in time (and
        power), and in order to maintain the highest device throughput and
        most efficient operation, the device persistence operation is
        explicit. When the data is written by an application on the local
        platform, this responsibility naturally falls to that application (and
        the CPU on which it runs). However, when data is written by current
        RDMA protocols, no such semantic is provided. As a result, upper layer
        stacks, and the target CPU, must be invoked to perform it, adding
        overhead and latency that is now highly undesirable.</t>

        <t>When such devices are deployed as the remote server, or target,
        storage, and when such a persistence can be requested and guaranteed
        remotely, a new transfer model can be considered. Instead of relying
        on the server, or target, to perform requested processing and to reply
        after the data is persistently stored, it becomes desirable for the
        client, or initiator, to perform these operations itself. By altering
        the transfer models to support a "push mode", that is, by allowing the
        requestor to push data with RDMA Write and subsequently make it
        persistent, a full round trip can be eliminated from the operation.
        Additionally, the signaling, and processing overheads at the remote
        peer (server or target) can be eliminated. This becomes an extremely
        compelling latency advantage.</t>
      </section>

      <section title="Memory Placement Semantics">
        <t>In DDP (RFC 5041), data is considered "Placed" when it is submitted
        by the RDMA Network Interface Controller ("RNIC") to the system. This
        operation is commonly an i/o bus write, e.g. via PCI. The submission
        is ordered, but there is no confirmation or necessary guarantee that
        the data has yet reached its destination, nor become visible to other
        devices in the system. The data will eventually become so, but
        possibly at a later time. The act of "Delivery", on the other hand,
        offers a stronger semantic, guaranteeing that not only have prior
        operations been executed, but also guaranteeing any data is in a
        consistent and visible state. Generally however, such "Delivery"
        requires raising a completion event, necessarily involving the host
        CPU. This is a relatively expensive, and latency-bound operation. Some
        systems perform "DMA snooping" to provide a somewhat higher guarantee
        of visibility after delivery and without CPU intervention, but others
        do not. The RDMA requirements remain the same, therefore, upper layers
        may make no broad assumption. Such platform behaviors, in any case, do
        not address persistence.</t>

        <t>The extensions in this document primarily address a new "flush to
        persistence" RDMA operation. This operation, when invoked by a
        connected remote RDMA peer, can be used to request that
        previously-written data be moved into the persistent storage domain.
        This may be a simple flush to a memory cell, or it may require
        movement across one or more busses within the target platform,
        followed by an explicit persistence operation. Such matters are beyond
        the scope of this specification, which provides only the mechanism to
        request the operation, and to signal its successful completion.</t>

        <t>In a similar vein, many applications desire to achieve visibility
        of remotely-provided data, and to do so with minimum latency. One
        example of such applications is "network shared memory", where
        publish-subscribe access to network-accessible buffers is shared by
        multiple peers, possibly from applications on the platform hosting the
        buffers, and others via network connection. There may therefore be
        multiple local devices accessing the buffer - for example, CPUs, and
        other RNICs. The topology of the hosting platform may be complex, with
        multiple i/o, memory, and interconnect busses, requiring multiple
        intervening steps to process arriving data.</t>

        <t>To address this, the extensions additionally provide a "flush to
        global visibility", which requires the RNIC to perform
        platform-dependent processing in order to guarantee that the contents
        of a specific range are visible for all devices that access them. On
        certain highly-consistent platforms, this may be provided natively. On
        others, it may require platform-specific processing, to flush data
        from volatile caches, invalidate stale cached data from others, and to
        drain pending operations. Ideally, but not universally, this
        processing will take place without CPU intervention. With a global
        visibility guarantee, network shared memory and similar applications
        will be assured of broader compatibility and lower latency across all
        hardware platforms.</t>

        <t>Subsequently, many applications will seek to obtain a guarantee
        that the integrity of the data has been preserved after it has been
        flushed to a persistent or globally visible state. This may be
        enforced at any time. Unlike traditional block-based storage, the data
        provided by RDMA is neither structured nor segmented, and is therefore
        not self-describing with respect to integrity. Only the originator of
        the data, or an upper layer, is able to determine that. Applications
        requiring such guarantees may include filesystems, database
        logwriters, replication agents, etc.</t>

        <t>To provide an additional integrity guarantee, a new operation is
        provided by the extension, which will calculate, and optionally
        compare an integrity value for an arbitrary region. The operation is
        ordered with respect to preceding and subsequent operations, allowing
        for a request pipeline without "bubbles" - roundtrip delays to
        ascertain success or failure.</t>

        <t>Finally, once data has been transmitted and directly placed by
        RDMA, flushed to its final state, and its integrity verified,
        applications will seek to commit the result with a transaction
        semantic. The previous application examples apply here, logwriters and
        replication are key, and both are highly latency- and
        integrity-sensitive. They desire a pipelined transaction marker which
        is placed atomically to indicate the validity of the preceding
        operations. They may require that the data be in a persistent and/or
        globally visible state, before placing this marker.</t>

        <t>Together the above discussion argues for a new "one sided" transfer
        model supporting extended remote placement guarantees, provided by the
        RDMA transport, and used directly by upper layers on a data source, to
        control persistent storage of data on a remote data sink without
        requiring its remote interaction. Existing, or new, upper layers can
        use such a model in several ways, and evolutionary steps to support
        persistence guarantees without required protocol changes are explored
        in the remainder of this document.</t>

        <t>Note that is intended that the requirements and concept of these
        extensions can be applied to any similar RDMA protocol, and that a
        compatible model can be applied broadly.</t>
      </section>

      <section title="Requirements for RDMA Flush">
        <t>The fundamental new requirement for extending RDMA protocols is to
        define the property of <spanx style="strong">persistence</spanx>. This
        new property is to be expressed by new operations to extend Placement
        as defined in existing RDMA protocols. The RFC 5040 protocols specify
        that Placement means that the data is visible consistently within a
        platform-defined domain on which the buffer resides, and to remote
        peers across the network via RDMA to an adapter within the domain. In
        modern hardware designs, this buffer can reside in memory, or also in
        cache, if that cache is part of the hardware consistency domain. Many
        designs use such caches extensively to improve performance of local
        access.</t>

        <t>Persistence, by contrast, requires that the buffer contents be
        preserved across catastrophic failures. While it is possible for
        caches to be persistent, they are typically not, or they provide the
        persistence guarantee for a limited period of time, for example, while
        backup power is applied. Efficient designs, in fact, lead most
        implementations to simply make them volatile. In these designs, an
        explicit flush operation (writing dirty data from caches), often
        followed by an explicit operation (ensuring the data has reached its
        destination and is in a persistent state), is required to provide this
        guarantee. In some platforms, these operations may be combined.</t>

        <t>For the RDMA protocol to remotely provide such guarantees, an
        extension is required. Note that this does not imply support for
        persistence or global visibility by the RDMA hardware implementation
        itself; it is entirely acceptable for the RDMA implementation to
        request these from another subsystem, for example, by requesting that
        the CPU perform the flush to persistence, or that the destination
        memory device do so. But, in an ideal implementation, the RDMA
        implementation will be able to act as a master and provide these
        services without further work requests local to the data sink. Note,
        it is possible that different buffers will require different
        processing, for example one buffer may reside in persistent memory,
        while another may place its blocks in a storage device. Many such
        memory-addressable designs are entering the market, from NVDIMM to
        NVMe and even to SSDs and hard drives.</t>

        <t>Therefore, additionally any local memory registration primitive
        will be enhanced to specify new optional placement attributes, along
        with any local information required to achieve them. These attributes
        do not explicitly traverse the network - like existing local memory
        registration, the region is fully described by a { STag, Tagged
        offset, length } descriptor, and such aspects of the local physical
        address, memory type, protection (remote read, remote write,
        protection key), etc are not instantiated in the protocol. Indeed,
        each local RDMA implementation maintains these, and strictly performs
        processing based on them, and they are not exposed to the peer. Such
        considerations are discussed in the RDMAP security model <xref
        target="RFC5042"/>.</t>

        <t>Note, additionally, that by describing such attributes only through
        the presence of an optional property of each region, it is possible to
        describe regions referring to the same physical segment as a
        combination of attributes, in order to enable efficient processing.
        Processing of writes to regions marked as persistent, globally
        visible, or neither ("ordinary" memory) may be optimized
        appropriately. For example, such memory can be registered multiple
        times, yielding multiple different Steering Tags which nonetheless
        merge data in the underlying memory. This can be used by upper layers
        to enable bulk-type processing with low overhead, by assigning
        specific attributes through use of the Steering Tag.</t>

        <t>When the underlying region is marked as persistent, the placement
        of data into persistence is guaranteed only after a successful RDMA
        Flush directed to the Steering Tag which holds the persistent
        attribute (i.e. any volatile buffering between the network and the
        underlying storage has been flushed, and all appropriate platform- and
        device-specific steps have been performed).</t>

        <t>To enable the maximum generality, the RDMA Flush operation is
        specified to act on a set of bytes in a region, specified by a
        standard RDMA { STag, Tagged offset, length } descriptor. It is
        required that each byte of the specified segment be in the requested
        state before the response to the Flush is generated. However,
        depending on the implementation, other bytes in the region, or in
        other regions, may be acted upon as part of processing any RDMA Flush.
        In fact, any data in any buffer destined for persistent storage, may
        become persistent at any time, even if not requested explicitly. For
        example, the host system may flush cache entries due to cache
        pressure, or as part of platform housekeeping activities. Or, a simple
        and stateless approach to flushing a specific range might be for all
        data be flushed and made persistent, system-wide. A possibly more
        efficient implementation might track previously written bytes, or
        blocks with "dirty" bytes, and flush only those to persistence. Either
        result provides the required guarantee.</t>

        <t>The RDMA Flush operation provides a response but does not return a
        status, or can result in an RDMA Terminate event upon failure. There
        are several possibilities for failure. A region permission check is
        performed first, prior to any attempt to process data. If the check is
        successful, the operation proceeds, but the RDMA Flush operation may
        fail to make the data persistent, perhaps due to a hardware failure,
        or a change in device capability (device read-only, device wear, etc).
        The device itself may support an integrity check, similar to modern
        error checking and correction (ECC) memory or media error detection on
        hard drive surfaces, which may signal failure. Or, the request may
        exceed device limits in size or even transient attribute such as
        temporary media failure. The behavior of the device itself is beyond
        the scope of this specification.</t>

        <t>Because the RDMA Flush involves processing on the local platform
        and the actual storage device, in addition to being ordered with
        certain other RDMA operations, it is expected to take a certain time
        to be performed. For these reasons, the operation is required to be
        defined as a "queued" operation on the RDMA device, and therefore also
        the protocol. The RDMA protocol supports RDMA Read (RFC 5040) and
        Atomic (RFC 7306) in such a fashion. The iWARP family defines a "queue
        number" with queue-specific processing that is naturally suited for
        this. Queuing provides a convenient means for supporting ordering
        among other operations, and for flow control. Flow control for RDMA
        Reads and Atomics on any given Queue Pair share incoming and outgoing
        crediting depths ("IRD/ORD"); operations in this specification share
        these values and do not define their own separate values.</t>

        <section title="Non-Requirements">
          <t>The extension does not include a "RDMA Write to persistence",
          that is, a modifier on the existing RDMA Write operation. While it
          might seem a logical approach, several issues become apparent:</t>

          <t><list style="symbols">
              <t>The existing RDMA Write operation is a tagged DDP request
              which is unacknowledged at the DDP layer (RFC 5042). Requiring
              it to provide an indication of remote persistence would require
              it to have an acknowledgement, which would be an undesirable
              extension to the existing defined operation.</t>

              <t>Such an operation would require flow control and therefore
              also buffering on the responding peer. Existing RDMA Write
              semantics are not flow controlled and as tagged transfers are by
              design zero-copy i.e. unbuffered. Requiring these would
              introduce potential pipeline stalls and increase implementation
              complexity in a critical performance path.</t>

              <t>The operation at the requesting peer would stall until the
              acknowledgement of completion, significantly changing the
              semantic of the existing operation, and complicating software by
              blocking the send work queue, a significant new semantic for
              RDMA Write work requests. As each operation would be
              self-describing with respect to persistence, individual
              operations would therefore block with differing semantics and
              complicate the situation even further.</t>

              <t>Even for the possibly-common case of flushing after every
              write, it is highly undesirable to impose new optional semantics
              on an existing operation, and therefore also on the upper layer
              protocol implementation. And, the same result can be achieved by
              sending the Flush merged in the same network packet, and since
              the RDMA Write is unacknowledged while the RDMA Flush is always
              replied-to, no additional overhead is imposed on the combined
              exchange.</t>
            </list>For these reasons, it is deemed a non-requirement to extend
          the existing RDMA Write operation.</t>

          <t>Similarly, the extension does not consider the use of RDMA Read
          to implement Flush. Historically, an RDMA Read has been used by
          applications to ensure that previously written data has been
          processed by the responding RNIC and has been submitted for ordered
          Placement. However, this is inadequate for implementing the required
          RDMA Flush:</t>

          <t><list style="symbols">
              <t>RDMA Read guarantees only that previously written data has
              been Placed, it provides no such guarantee that the data has
              reached its destination buffer. In practice, an RNIC satisfies
              the RDMA Read requirement by simply issuing all PCIe Writes
              prior to issuing any PCIe Reads.</t>

              <t>Such PCIe Reads must be issued by the RNIC after all such
              PCIe Writes, therefore flushing a large region requires the RNIC
              and its attached bus to strictly order (and not cache) its
              writes, to "scoreboard" its writes, or to perform PCIe Reads to
              the entire region. The former approach is significantly complex
              and expensive, and the latter approach requires a large amount
              of PCIe and network read bandwidth, which are often unnecessary
              and expensive. The Reads, in any event, may be satisfied by
              platform-specfic caches, never actually reaching the destination
              memory or other device.</t>

              <t>The RDMA Read may begin execution at any time once the
              request is fully received, queued, and the prior RDMA Write
              requirement has been satisfied. This means that the RDMA Read
              operation may not be ordered with respect to other queued
              operations, such as Verify and Atomic Write, in addition to
              other RDMA Flush operations.</t>

              <t>The RDMA Read has no specific error semantic to detect
              failure, and the response may be generated from any cached data
              in a consistently Placed state, regardless of where it may
              reside. For this reason, an RDMA Read may proceed without
              necessarily verifying that a previously ordered RDMA Flush has
              succeeded or failed.</t>

              <t>RDMA Read is heavily used by existing RDMA consumers, and the
              semantics are therefore implemented by the existing
              specification. For new applications to further expect an
              extended RDMA Read behavior would require an upper layer
              negotiation to determine if the data sink platform and RNIC
              appropriately implemented them, or to silently ignore the
              requirement, with the resulting failure to meet the requirement.
              An explicit extension, rather than depending on an overloaded
              side effect, ensures this will not occur.</t>
            </list>Again, for these reasons, it is deemed a non-requirement to
          reuse or extend the existing RDMA Read operation.</t>

          <t>Therefore, no changes to existing specified RDMA operations are
          proposed, and the protocol is unchanged if the extensions are not
          invoked.</t>
        </section>
      </section>

      <section title="Requirements for Atomic Write">
        <t>The persistence of data is a key property by which applications
        implement transactional behavior. Transactional applications, such as
        databases and log-based filesystems, among many others, implement a
        "two phase commit" wherein a write is made persistent, and <spanx
        style="strong">only upon success</spanx>, a validity indicator for the
        written data is set. Such semantics are challenging to provide over an
        RDMA fabric, as it exists today. The RDMA Write operation does not
        generate an acknowledgement at the RDMA layers. And, even when an RDMA
        Write is delivered, if the destination region is persistent, its data
        can be made persistent at any time, even before a Flush is requested.
        Out-of-order DDP processing, packet fragmentation, and other matters
        of scheduling transfers can introduce partial delivery and ordering
        differences. If a region is made persistent, or even globally visible,
        before such sequences are complete, significant application-layer
        inconsistencies can result. Therefore, applications may require
        fine-grained control over the placement of bytes. In current RDMA
        storage solutions, these semantics are implemented in upper layers,
        potentially with additional upper layer message signaling, and
        corresponding roundtrips and blocking behaviors.</t>

        <t>In addition to controlling placement of bytes, the ordering of such
        placement can be important. By providing an ordered relationship among
        write and flush operations, a basic transaction scenario can be
        constructed, in a way which can function with equal semantics both
        locally and remotely. In a "log-based" scenario, for example, a
        relatively large segment (log "record") is placed, and made
        persistent. Once persistence of the segment is assured, a second small
        segment (log "pointer") is written, and optionally also made
        persistent. The <spanx style="strong">visibility</spanx> of the second
        segment is used to imply the validity, and persistence, of the first.
        Any sequence of such log-operation pairs can thereby always have a
        single valid state. In case of failure, the resulting sequence of
        transactions can therefore be recovered up to and including the final
        state.</t>

        <t>Such semantics are typically a challenge to implement on general
        purpose hardware platforms, and a variety of application approaches
        have become common. Generally, they employ a small, well-aligned atom
        of storage for the second segment (the one used for validity). For
        example, an integer or pointer, aligned to natural memory address
        boundaries and CPU and other cache attributes, is stored using
        instructions which provide for atomic placement. Existing RDMA
        protocols, however, provide no such capability.</t>

        <t>This document specifies an Atomic Write extension, which,
        appropriately constrained, can serve to provide similar semantics. A
        small (64 bit) payload, sent in a request which is ordered with
        respect to prior RDMA Flush operations on the same stream and targeted
        at a segment which is aligned such that it can be placed in a single
        hardware operation, can be used to satisfy the previously described
        scenario.</t>

        <t>Note that the visibility of this atomically written payload can
        also serve as an indication that all prior operations have succeeded,
        enabling a highly efficient application-visible memory semaphore. This
        may benefit remote shared memory models, by providing stronger
        guarantees than previously obtainable with RDMA protocols.</t>
      </section>

      <section title="Requirements for RDMA Verify">
        <t>An additional matter remains with persistence - the integrity of
        the persistent data. Typically, storage stacks such as filesystems and
        media approches such as SCSI <xref target="T10DIF"/> or filesystem
        integrity checks provide for block- or file-level protection of data
        at rest on storage devices. With RDMA protocols and physical memory,
        no such stacks are present. And, to add such support would introduce
        CPU processing and its inherent latency, counter to the goals of the
        remote access approach. Requiring the peer to verify by remotely
        reading the data is prohibitive in both bandwidth and latency, and
        without additional mechanisms to ensure the actual stored data is read
        (and not a copy in some volatile cache), can not provide the necessary
        result.</t>

        <t>To address this, an integrity operation is required. The integrity
        check is initiated by the upper layer or application, which requests
        computation of the hash of a given segment of arbitrary size via an
        RDMA Verify operation targeting the RDMA segment on the responder. The
        operation optionally specifies the expected hash value. The responder
        calculates the value from the contents of the targeted RDMA segment,
        bypassing any volatile copies remaining in caches. The responder
        responds with its computed hash value, or when the optional expected
        hash value was sent and does not match the computed hash value,
        terminates the connection with an appropriate error status. Specifying
        this optional termination behavior enables a transaction to be sent as
        WRITE-FLUSH-VERIFY-ATOMICWRITE, without any pipeline bubble. The
        result (carried by the subsequently ordered ATOMIC_WRITE) will not not
        be marked as valid if any prior operation is terminated, and in this
        case, recovery can be initiated by the requestor immediately from the
        point of failure. On the other hand, an errorless "scrub" can be
        implemented without the optional termination behavior, The responder
        will simply return the computed hash of the contents.</t>

        <t>The hash algorithm is not specified by the RDMA protocol, instead
        it is left to the upper layer to select an appropriate choice based
        upon the strength, security, length, support by the RNIC, and other
        criteria. The size of the resulting hash is therefore also not
        specified by the RDMA protocol, but is dictated by the hash algorithm.
        The RDMA protocol becomes simply a transport for exchanging the
        values.</t>

        <t>It should be noted that the design of the operation, passing of the
        hash value from requestor to responder (instead of, for example,
        computing it at the responder and simply returning it), allows both
        peers to determine immediately whether the segment is considered
        valid, permitting local processing by both peers if that is not the
        case. For example, a known-bad segment can be immediately marked as
        such ("poisoned") by the responder platform, requiring recovery before
        permitting access. [cf ACPI, JEDEC, SNIA NVMP specifications]</t>
      </section>

      <section title="Local Semantics">
        <t>The new operations imply new access methods ("verbs") to local
        persistent memory which backs registrations. Registrations of memory
        which support persistence will follow all existing practices to ensure
        permission-based remote access. The RDMA protocols do not expose these
        permissions on the wire, instead they are contained in local memory
        registration semantics. Existing attributes are Remote Read and Remote
        Write, which are granted individually through local registration on
        the machine. If an RDMA Read or RDMA Write operation arrives which
        targets a segment without the appropriate attribute, the connection is
        terminated.</t>

        <t>In support of the new operations, new memory attributes are needed.
        For RDMA Flush, a new "Flushable" attribute provides permission to
        invoke the operation on memory in the region for persistence and/or
        global visibility. When registering, along with the attribute,
        additional local information can be provided to the RDMA layer such as
        the type of memory, the necessary processing to make its contents
        persistent, etc. If the attribute is requested for memory which cannot
        be persisted, it also allows the local provider to return an error to
        the upper layer, obviating the upper layer from providing the region
        to the remote peer.</t>

        <t>For RDMA Verify, the new "Verifiable" attribute provides permission
        to compute the hash of the contents of the region. Again, along with
        the attribute, additional information such as the hash algorithm for
        the region is provided to the local operation. If the attribute is
        requested for non-verifiable memory, or if the hash algorithm is not
        available, the local provider can return an error to the upper layer.
        In the case of success, the upper layer can exchange the necessary
        information with the remote peer. Note that the verification algorithm
        is not identified by the on-the-wire operation as a result.
        Establishing the choice of hash for each region is done by the local
        consumer, and each hash result is merely transported by the RDMA
        protocol. Memory can be registered under multiple regions, if
        differing hashes are required, for example unique keys may be
        provisoned to implement secure hashing. Also note that, for certain
        "reversible" hash algorithms, this may allow peers to effectively read
        the memory, therefore, the local platform may require additional read
        permissions to be associated with the Verifiable permission, when such
        algorithms are selected.</t>

        <t>The Atomic Write operation requires no new attributes, however it
        does require the "Remote Write" attribute on the target region, as is
        true for any remotely requested write. If the Atomic Write
        additionally targets a Flushable region, the RDMA Flush is performed
        separately. It is never generally possible to achieve persistence
        atomically with placement, even locally.</t>
      </section>
    </section>

    <section title="RDMA Protocol Extensions">
      <t>This document defines the following new RDMA operations.</t>

      <t>For reference, Figure 1 depicts the format of the DDP Control and
      RDMAP Control Fields, in the style and convention of RFC 5040 and RFC
      7306:</t>

      <t>The DDP Control Field consists of the T (Tagged), L (Last), Resrv,
      and DV (DDP protocol Version) fields as defined in RFC 5041. The RDMAP
      Control Field consists of the RV (RDMA Version), Rsv, and Opcode fields
      as defined in RFC 5040. No change or extension is made to these fields
      by this specification.</t>

      <t>This specification adds values for the RDMA Opcode field to those
      specified in RFC 5040. Figure 1 defines the new values of the RDMA
      Opcode field that are used for the RDMA Messages defined in this
      specification.</t>

      <figure title="DDP Control and RDMAP Control Fields">
        <artwork><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |T|L| Resrv | DV| RV|R|  Opcode |
                                | | |       |   |   |s|         |
                                | | |       |   |   |v|         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Invalidate STag                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
      </figure>

      <t>All RDMA Messages defined in this specification MUST carry the
      following values:<list style="symbols">
          <t>The RDMA Version (RV) field: 01b.</t>

          <t>Opcode field: Set to one of the values in Figure 2.</t>

          <t>Invalidate STag: Set to zero.</t>
        </list></t>

      <t>Figure 2 shows the appropriate Queue Number for each Opcode.</t>

      <t>Note: N/A in the figure below means Not Applicable; the STag
      (Steering Tag) and Tagged Offset DDP fields are not used.</t>

      <figure title="Additional RDMA Usage of DDP Fields">
        <artwork><![CDATA[
-------+------------+-------+------+-------+-----------+-------------
RDMA   | Message    | Tagged| STag | Queue | Invalidate| Message
Opcode | Type       | Flag  | and  | Number| STag      | Length
       |            |       | TO   |       |           | Communicated
       |            |       |      |       |           | between DDP
       |            |       |      |       |           | and RDMAP
-------+------------+-------+------+-------+-----------+-------------
-------+------------+------------------------------------------------
01100b | RDMA Flush |  0    |  N/A |  1    |  N/A      |  Yes
       | Request    |       |      |       |           |
-------+------------+------------------------------------------------
01101b | RDMA Flush |  0    |  N/A |  3    |  N/A      |  No
       | Response   |       |      |       |           |
-------+------------+------------------------------------------------
01110b | RDMA Verify|  0    |  N/A |  1    |  N/A      |  Yes
       | Request    |       |      |       |           |
-------+------------+------------------------------------------------
01111b | RDMA Verify|  0    |  N/A |  3    |  N/A      |  Yes
       | Response   |       |      |       |           |
-------+------------+------------------------------------------------
10000b | Atomic     |  0    |  N/A |  1    |  N/A      |  Yes
       | Write      |       |      |       |           |
       | Request    |       |      |       |           |
-------+------------+------------------------------------------------
10001b | Atomic     |  0    |  N/A |  3    |  N/A      |  No
       | Write      |       |      |       |           |
       | Response   |       |      |       |           |
-------+------------+------------------------------------------------
]]></artwork>
      </figure>

      <t>This extension adds RDMAP use of Queue Number 1 for Untagged Buffers
      for issuing RDMA Flush, RDMA Verify and Atomic Write Requests, and use
      of Queue Number 3 for Untagged Buffers for tracking the respective
      Responses.</t>

      <t>All other DDP and RDMAP Control Fields are set as described in RFC
      5040 and RFC 7306.</t>

      <t>Figure 3 defines which RDMA Headers are used on each new RDMA Message
      and which new RDMA Messages are allowed to carry ULP payload.</t>

      <figure title="RDMA Message Definitions">
        <artwork><![CDATA[
-------+------------+-------------------+-------------------------
RDMA   | Message    | RDMA Header Used  | ULP Message allowed in
Message| Type       |                   | the RDMA Message
OpCode |            |                   |
-------+------------+-------------------+-------------------------
-------+------------+-------------------+-------------------------
01100b | RDMA Flush | None              | No
       | Request    |                   |
-------+------------+-------------------+-------------------------
01101b | RDMA Flush | None              | No
       | Response   |                   |
-------+------------+---------------------------------------------
01110b | RDMA Verify| None              | No
       | Request    |                   |
-------+------------+-------------------+-------------------------
01111b | RDMA Verify| None              | No
       | Response   |                   |
-------+------------+---------------------------------------------
10000b | Atomic     | None              | No
       | Write      |                   |
       | Request    |                   |
-------+------------+---------------------------------------------
10000b | Atomic     | None              | No
       | Write      |                   |
       | Response   |                   |
-------+------------+---------------------------------------------]]></artwork>
      </figure>

      <t>RFC5042 Section 2.2.2 defines the "read" and/or "write" remote access
      rights, which are assigned to a memory region at registration. These
      rights are expressed only locally on each platform which exposes regions
      remotely, and are not specified in the RDMA protocols, forming part of
      the security model.</t>

      <t>Two new access rights are required to implement the Responder side of
      this specification:</t>

      <t><list style="symbols">
          <t>Flushable</t>

          <t>Verifiable</t>
        </list></t>

      <section title="RDMA Flush">
        <t>The RDMA Flush operation requests that all bytes in a specified
        region are to be made persistent and/or globally visible, under
        control of specified flags. As specified in section 3 its execution is
        ordered after the successful completion of any previous requested RDMA
        Write or certain other operations. The response is generated after the
        region has reached its specified state. The implementation MUST fail
        the operation and send a terminate message if the RDMA Flush cannot be
        performed, or has encountered an error.</t>

        <t>The RDMA Flush operation MUST NOT be responded to until all data
        has attained the requested state. Achieving persistence may require
        programming and/or flushing of device buffers, while achieving global
        visibility may require flushing of cached buffers across the entire
        platform interconnect. In no event are persistence and global
        visibility achieved atomically, one may precede the other and either
        may complete at any time. A subsequent Atomic Write operation may be
        used by an upper layer consumer to indicate that the requested
        dispositions are available after completion of the RDMA Flush, in
        addition to other approaches.</t>

        <section title="RDMA Flush Request">
          <t>The RDMA Flush Request Message makes use of the DDP Untagged
          Buffer Model. RDMA Flush Request messages MUST use the same Queue
          Number as RDMA Read Requests and RDMA Extensions Atomic Operation
          Requests (QN=1). Reusing the same queue number for RDMA Flush
          Requests allows the operations to reuse the same RDMA infrastructure
          (e.g. Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow
          control) as that defined for RDMA Read Requests.</t>

          <t>The RDMA Flush Request Message carries a payload that describes
          the ULP Buffer address in the Responder's memory. The following
          figure depicts the Flush Request that is used for all RDMA Flush
          Request Messages:</t>

          <figure title="Flush Request Payload">
            <artwork><![CDATA[
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Data Sink STag                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Data Sink Length                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Data Sink Tagged Offset                   |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Flush Disposition Flags             +R+G+P|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
          </figure>

          <t><list style="hanging">
              <t hangText="Data Sink STag: 32 bits">The Data Sink STag
              identifies the Remote Peer's Tagged Buffer targeted by the RDMA
              Flush Request. The Data Sink STag is associated with the RDMAP
              Stream through a mechanism that is outside the scope of the
              RDMAP specification.</t>

              <t hangText="Data Sink Length: 32 bits">The Data Sink Length is
              the length, in octets, of the bytes targeted by the RDMA Flush
              Request. This field MUST be ignored if the 0x04 bit is set in
              the Flags field.</t>

              <t hangText="Data Sink Tagged Offset: 64 bits">The Data Sink
              Tagged Offset specifies the starting offset, in octets, from the
              base of the Remote Peer's Tagged Buffer targeted by the RDMA
              Flush Request. This field MUST be ignored if the 0x04 bit is set
              in the Flags field.</t>

              <t hangText="Flags: 32 bits">Flags specifying the disposition of
              the flushed data and selectivity of the flushed region: <list
                  style="hanging">
                  <t hangText="0x01">Flush to Persistence "P"</t>

                  <t hangText="0x02">Flush to Global Visibility "V"</t>

                  <t hangText="0x04">Flush entire region "R"</t>
                </list></t>
            </list></t>
        </section>

        <section title="RDMA Flush Processing">
          <t>As specified in section 3, RDMA Flush, RDMA Verify and Atomic
          Write are executed by the Responder in strict order. When an RDMA
          Verify follows an RDMA Flush, and because the RDMA Flush MUST ensure
          that all bytes are in the specified state before responding, any
          RDMA Verify that follows can be assured that it is operating on
          flushed data. If unflushed data has been sent to the region segment
          between the operations, and since data may be made persistent and/or
          globally visible on the Data Sink at any time, the result of any
          such RDMA Verify is undefined.</t>

          <t>If an RDMA Flush Operation is attempted on a target ULP Buffer
          address whose region does not permit flush:</t>

          <t><list style="symbols">
              <t>The operation MUST NOT be performed</t>

              <t>The Responder's memory MUST NOT be modified</t>

              <t>A terminate message MUST be generated. (See Section 4.2 for
              the contents of the terminate message.)</t>
            </list></t>

          <t>Any region segment specified by the RDMA Flush operation MUST be
          made persistent and/or globally visible, in a platform-specific
          fashion, before successful return of the operation. If RDMA Flush
          processing is successful on the Responder, meaning the requested
          bytes of the region are, or have been made persistent and/or
          globally visible as requested, the RDMA Flush Response MUST be sent
          to the Requestor.</t>

          <t>If during RDMA Flush processing on the Responder, an error is
          detected which would result in the requested region to not achieve
          the requested disposition:</t>

          <t><list style="symbols">
              <t>A terminate message MUST be generated. (See Section 4.2 for
              the contents of the terminate message.)</t>
            </list></t>

          <t>There are no ordering requirements for the processing of the
          data, nor are there any requirements for the order in which the data
          is made globally visible and/or persistent. Data received by prior
          operations (e.g. RDMA Write) MAY be submitted for placement at any
          time, and persistence or global visibility MAY occur before the
          flush is requested. After placement, data MAY become persistent or
          globally visible at any time, in the course of operation of the
          persistency management of the storage device, or by other platform
          actions resulting in persistence or global visibility.</t>

          <t>There are no atomicity guarantees provided on the Responder by
          the RDMA Flush Operation with respect to any other operations. While
          the Completion of the RDMA Flush Operation ensures that the
          requested data was placed into, and flushed from the target Buffer,
          other operations might have also placed or fetched overlapping data.
          The upper layer is responsible for arbitrating any shared
          access.</t>
        </section>

        <section title="RDMA Flush Response">
          <t>The RDMA Flush Response Message makes use of the DDP Untagged
          Buffer Model. RDMA Flush Response messages MUST use the same Queue
          Number as RDMA Extensions Atomic Operation Responses (QN=3). No
          payload is passed to the DDP layer on Queue Number 3.</t>

          <t>Upon successful completion of RDMA Flush processing, an RDMA
          Flush Response MUST be generated.</t>
        </section>
      </section>

      <section title="RDMA Verify">
        <t>The RDMA Verify operation requests that all bytes in a specified
        region are to be read from the underlying storage and that an
        integrity hash be calculated on their value. As specified in section 3
        its operation is ordered after the successful completion of any
        previous requested RDMA Write and RDMA Flush, or certain other
        operations.</t>

        <section title="RDMA Verify Request">
          <t>The RDMA Verify Request Message makes use of the DDP Untagged
          Buffer Model. RDMA Verify Request messages MUST use the same Queue
          Number as RDMA Read Requests and RDMA Extensions Atomic Operation
          Requests (QN=1). Reusing the same queue number for RDMA Read and
          RDMA Flush Requests allows the operations to reuse the same RDMA
          infrastructure (e.g. Outbound and Inbound RDMA Read Queue Depth
          (ORD/IRD) flow control) as that defined for those requests.</t>

          <t>The RDMA Verify Request Message carries a payload that describes
          the ULP Buffer address in the Responder's memory. The following
          figure depicts the Verify Request that is used for all RDMA Verify
          Request Messages:</t>

          <figure title="Verify Request Payload">
            <artwork><![CDATA[
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Data Sink STag                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Data Sink Length                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Data Sink Tagged Offset                   |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                Hash Value (optional, variable)                |
|                              ...                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
          </figure>

          <t><list style="hanging">
              <t hangText="Data Sink STag: 32 bits">The Data Sink STag
              identifies the Remote Peer's Tagged Buffer targeted by the
              Verify Request. The Data Sink STag is associated with the RDMAP
              Stream through a mechanism that is outside the scope of the
              RDMAP specification.</t>

              <t hangText="Data Sink Length: 32 bits">The Data Sink Length is
              the length, in octets, of the bytes targeted by the Verify
              Request.</t>

              <t hangText="Data Sink Tagged Offset: 64 bits">The Data Sink
              Tagged Offset specifies the starting offset, in octets, from the
              base of the Remote Peer's Tagged Buffer targeted by the Verify
              Request.</t>

              <t hangText="Hash Value: (variable)">The Hash Value is
              optionally an octet string representing the expected result, if
              any, of the hash algorithm on the Remote Peer's Tagged Buffer.
              The length of the Hash Value is variable, and dependent on the
              selected algorithm. When non-zero, any mismatch with the
              calculated value causes the Responder to generate a Terminate
              message, and close the connection. The contents of the Terminate
              message are defined in section 4.2.</t>
            </list></t>
        </section>

        <section title="RDMA Verify Processing">
          <t>As specified in section 3, RDMA Flush, RDMA Verify and Atomic
          Write are executed by the Responder in strict order. When an RDMA
          Verify follows an RDMA Flush, and because the RDMA Flush MUST ensure
          that all bytes are in the specified state before responding, any
          RDMA Verify that follows can be assured that it is operating on
          stable data. If any other data has been sent to the memory
          underlying the region segment between the operations, and since data
          may be made persistent and/or globally visible by the Data Sink at
          any time, the result of any such RDMA Verify is undefined.</t>

          <t>If an RDMA Verify Operation is attempted on a target ULP Buffer
          address whose region does not permit verification:</t>

          <t><list style="symbols">
              <t>The operation MUST NOT be performed</t>

              <t>A terminate message MUST be generated. (See Section 4.2 for
              the contents of the terminate message.)</t>
            </list></t>

          <t>The algorithm specified by the upper layer protocol or
          application when previously registering the target memory region
          MUST be used to calculate a hash value for the specified Buffer. In
          a platform-depemndent manner, the data for the specified Buffer MUST
          NOT be supplied from any intermediate location, for example, from an
          unplaced write, or from a cache.</t>

          <t>If an optional hash value was provided in the Verify Request and
          does not match the calculated result:</t>

          <t><list style="symbols">
              <t>A terminate message MUST be generated. (See Section 4.2 for
              the contents of the terminate message.)</t>
            </list></t>
        </section>

        <section title="RDMA Verify Response">
          <t>The Verify Response Message makes use of the DDP Untagged Buffer
          Model. Verify Response messages MUST use the same Queue Number as
          RDMA Flush Responses (QN=3). The RDMAP layer passes the following
          payload to the DDP layer on Queue Number 3. The RDMA Verify Response
          is not sent when a Terminate message is generated due to a
          mismatch.</t>

          <figure title="Verify Response Payload">
            <artwork><![CDATA[
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Hash Value (variable)                      |
|                              ...                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
          </figure>

          <t><list style="hanging">
              <t hangText="Hash Value:">The Hash Value is an octet string
              representing the result of the hash algorithm on the Remote
              Peer's Tagged Buffer. The length of the Hash Value is variable
              and non-zero, and dependent on the algorithm selected by the
              upper layer consumer, among those supported by the RNIC.</t>
            </list></t>
        </section>
      </section>

      <section title="Atomic Write">
        <t>The Atomic Write operation provides a 64-bit block of data which is
        written to a specified region atomically, and as specified in section
        4 its execution is ordered after the successful completion of all
        previous RDMA Flush and RDMA Verify. The target Buffer MUST be 64-bits
        at 64-bit alignment, and the implementation MUST fail the operation
        and send a terminate message if the data cannot be written to it
        atomically.</t>

        <section title="Atomic Write Request">
          <t>The Atomic Write Request Message makes use of the DDP Untagged
          Buffer Model. Atomic Write Request messages MUST use the same Queue
          Number as RDMA Read Requests and RDMA Extensions Atomic Operation
          Requests (QN=1). Reusing the same queue number for RDMA Flush and
          RDMA Verify Requests allows the operations to reuse the same RDMA
          infrastructure (e.g. Outbound and Inbound RDMA Read Queue Depth
          (ORD/IRD) flow control) as that defined for those Requests.</t>

          <t>The Atomic Write Request Message carries an Atomic Write Request
          payload that describes the ULP Buffer address in the Responder's
          memory, as well as the data to be written. The following figure
          depicts the Atomic Write Request that is used for all Atomic Write
          Request Messages:</t>

          <figure title="Atomic Write Request Payload">
            <artwork><![CDATA[
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Data Sink STag                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Data Sink Length                       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Data Sink Tagged Offset                   |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                              Data                             |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
          </figure>

          <t><list style="hanging">
              <t hangText="Data Sink STag: 32 bits">The Data Sink STag
              identifies the Remote Peer's Tagged Buffer targeted by the
              Atomic Write Request. The Data Sink STag is associated with the
              RDMAP Stream through a mechanism that is outside the scope of
              the RDMAP specification.</t>

              <t hangText="Data Sink Length: 32 bits">The Data Sink Length is
              the length in octets of data to be placed, and MUST be 8.</t>

              <t hangText="Data Sink Tagged Offset: 64 bits">The Data Sink
              Tagged Offset specifies the starting offset, in octets, from the
              base of the Remote Peer's Tagged Buffer targeted by the Atomic
              Write Request. This offset can be any value, but the destination
              ULP buffer address MUST be aligned as specified above. Ensuring
              that the STag and Data Sink Tagged Offset values appropriately
              meet such a requirement is an upper layer consumer
              responsibility, and is out of scope for this specification.</t>

              <t hangText="Data: 64 bits">The data to be written, in
              big-endian format.</t>
            </list></t>
        </section>

        <section title="Atomic Write Processing">
          <t>Atomic Write Operations MUST target ULP Buffer addresses that are
          64-bit aligned, and conform to any other platform restrictions on
          the Responder system. The write MUST NOT be performed prior to all
          previous RDMA Flush and RDMA Verify operations completing
          successfully.</t>

          <t>If an Atomic Write Operation is attempted on a target ULP Buffer
          address that is not 64-bit aligned, whose region does not permit
          remote writing as defined in RFC5042 Section 2.2.2, or due to
          alignment, size, or other platform restrictions cannot be performed
          atomically:</t>

          <t><list style="symbols">
              <t>The operation MUST NOT be performed</t>

              <t>The Responder's memory MUST NOT be modified</t>

              <t>A terminate message MUST be generated. (See Section 4.2 for
              the contents of the terminate message.)</t>
            </list></t>

          <t>Otherwise, the 64-bit data MUST be written to the target ULP
          Buffer address atomically. In a platform-dependent fashion, the
          write MUST NOT be interleaved with other writes to the same buffer,
          and the data MUST NOT be partially visible to other computing
          elements in the responder platform prior to the write being
          complete.</t>

          <t>Note that the Atomic Write operation is not specified to
          additionally perform the RDMA Flush processing of the data. If such
          a result is required by the upper layer, then a subsequent explicit
          RDMA Flush is needed.</t>
        </section>

        <section title="Atomic Write Response">
          <t>The Atomic Write Response Message makes use of the DDP Untagged
          Buffer Model. Atomic Write Response Response messages MUST use the
          same Queue Number as RDMA Flush Responses (QN=3). The RDMAP layer
          passes no payload to the DDP layer on Queue Number 3.</t>
        </section>
      </section>

      <section title="Discovery of RDMAP Extensions">
        <t>As for RFC 7306, explicit negotiation by the RDMAP peers of the
        extensions covered by this document is not specified. Instead, it is
        RECOMMENDED that RDMA applications and/or ULPs negotiate any use of
        these extensions at the application or ULP level. The definition of
        such application-specific mechanisms is outside the scope of this
        specification. For backward compatibility, existing applications
        and/or ULPs should not assume that these extensions are supported.</t>

        <t>In the absence of application-specific negotiation of the features
        defined within this specification, the new operations can be
        attempted, and reported errors can be used to determine a remote
        peer's capabilities. In the case of RDMA Flush and Atomic Write, an
        operation to a previously Advertised buffer with remote write
        permission can be used to determine the peer's support. A Remote
        Operation Error or Unexpected OpCode error will be reported by the
        remote peer if the Operation is not supported by the remote peer. For
        RDMA Verify, such an operation may target a buffer with remote read
        permission.</t>
      </section>
    </section>

    <section title="Ordering and Completions">
      <t>The figure below specifies the ordering relationships for the
      operations in this specification, from the standpoint of their execution
      in the Responder. Note that in the figure, where the "First Operation"
      is not listed, there is no additional placement ordering specified to
      the operations defined in this extension, beyond those already defined
      in the base RFC 5040 specification.</t>

      <t>Note: N/A in the figure below means Not Applicable</t>

      <t>THE CONTENT OF THIS SECTION IS INCOMPLETE</t>

      <figure title="Ordering of Operations">
        <artwork><![CDATA[
----------+------------+-------------+-------------+-----------------
First     | Second     | Placement   | Placement   | Ordering
Operation | Operation  | Guarantee at| Guarantee at| Guarantee at
          |            | Remote Peer | Local Peer  | Remote Peer
----------+------------+-------------+-------------+-----------------
RDMA Write| RDMA Flush | Placement   | N/A         | N/A
          |            | Guarantee   |             |
          |            | between Write             |
          |            | and FLUSH   |             |
----------+------------+-------------+-------------+-----------------
RDMA Flush| RDMA Verify| Execution   | N/A         | N/A
          |            | Guarantee   |             |
          |            | between Flush             |
          |            | and Verify  |             |
----------+------------+-------------+-------------+-----------------
TODO      | TODO       | Etc         | Etc         | Etc
----------+------------+-------------+-------------+-----------------
----------+------------+-------------+-------------+-----------------
]]></artwork>
      </figure>
    </section>

    <section title="Error Processing">
      <t>In addition to error processing described in section 7 of RFC 5040
      and section 8 of RFC 7306, the following rules apply for the new RDMA
      Messages defined in this specification.</t>

      <section title="Errors Detected at the Local Peer">
        <t>The Local Peer MUST send a Terminate Message for each of the
        following cases:<list style="numbers">
            <t>For errors detected while creating an RDMA Flush, RDMA Verify
            or Atomic Write Request, or other reasons not directly associated
            with an incoming Message, the Terminate Message and Error code are
            sent instead of the Message. In this case, the Error Type and
            Error Code fields are included in the Terminate Message, but the
            Terminated DDP Header and Terminated RDMA Header fields are set to
            zero.</t>

            <t>For errors detected on an incoming RDMA Flush, RDMA Verify or
            Atomic Write Request or Response, the Terminate Message is sent at
            the earliest possible opportunity, preferably in the next outgoing
            RDMA Message. In this case, the Error Type, Error Code, and
            Terminated DDP Header fields are included in the Terminate
            Message, but the Terminated RDMA Header field is set to zero.</t>

            <t>For errors detected in the processing of the RDMA Flush or RDMA
            Verify itself, that is, the act of flushing or verifying the data,
            the Terminate Message is generated as per the referenced
            specifications. Even though data is not lost, the upper layer MUST
            be notified of the failure by informing the requester of the
            status, terminating any pending operations, and allow the
            requester to perform further action, for instance, recovery.</t>
          </list></t>
      </section>

      <section title="Errors Detected at the Remote Peer">
        <t>On incoming RDMA Flush and RDMA Verify Requests, the following MUST
        be validated:<list style="symbols">
            <t>The DDP layer MUST validate all DDP Segment fields.</t>
          </list></t>

        <t>The following additional validation MUST be performed:<list
            style="symbols">
            <t>The RDMAP layer MUST validate the Data Sink STag, Data Sink
            Length and Data Sink Tagged Offset fields of the payload.</t>

            <t>If the RDMA Flush, RDMA Verify or Atomic Write operation cannot
            be satisfied, due to transient or permanent errors detected in the
            processing by the Responder, a Terminate message MUST be returned
            to the Requestor.</t>
          </list></t>
      </section>
    </section>

    <section title="IANA Considerations">
      <t>This document requests that IANA assign the following new operation
      codes in the "RDMAP Message Operation Codes" registry defined in section
      3.4 of <xref target="RFC6580"/>.<list style="hanging">
          <t hangText="0xC">RDMA Flush Request, this specification</t>

          <t hangText="0xD">RDMA Flush Response, this specification</t>

          <t hangText="0xE">RDMA Verify Request, this specification</t>

          <t hangText="0xF">RDMA Verify Response, this specification</t>

          <t hangText="0x10">Atomic Write Request, this specification</t>

          <t hangText="0x11">Atomic Write Response, this specification</t>
        </list></t>

      <t>Note to RFC Editor: this section may be edited and updated prior to
      publication as an RFC.</t>
    </section>

    <section title="Security Considerations">
      <t>This document specifies extensions to the RDMA Protocol specification
      in RFC 5040 and RDMA Protocol Extensions in RFC 7306, and as such the
      Security Considerations discussed in Section 8 of RFC 5040 and Section 9
      of RFC 7306 apply. In particular, all operations use ULP Buffer
      addresses for the Remote Peer Buffer addressing used in RFC 5040 as
      required by the security model described in <xref
      target="RFC5042"/>.</t>

      <t>If the "push mode" transfer model discussed in section 2 is
      implemented by upper layers, new security considerations will be
      potentially introduced in those protocols, particularly on the server,
      or target, if the new memory regions are not carefully protected.
      Therefore, for them to take full advantage of the extension defined in
      this document, additional security design is required in the
      implementation of those upper layers. The facilities of <xref
      target="RFC5042"/> can provide the basis for any such design.</t>

      <t>In addition to protection, in "push mode" the server or target will
      expose memory resources to the peer for potentially extended periods,
      and will allow the peer to perform remote requests which will
      necessarily consume shared resources, e.g. network bandwidth, memory
      bandwidth, power, and memory itself. It is recommended that the upper
      layers provide a means to gracefully adjust such resources, for example
      using upper layer callbacks, without resorting to revoking RDMA
      permissions, which would summarily close connections. With the initiator
      applications relying on the protocol extension itself for managing their
      required persistence and/or global visibility, the lack of such an
      approach would lead to frequent recovery in low-resource situations,
      potentially opening a new threat to such applications.</t>
    </section>

    <section title="Acknowledgements">
      <t>The author wishes to thank Gaurav Agarwal, Kobby Carmona, Brian
      Hausauer, Tony Hurson, Jim Pinkerton and Tom Reu, all of whom
      contributed to earlier versions of the specification, and who provided
      significant review and valuable comments.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.5040"?>

      <?rfc include="reference.RFC.5041"?>

      <?rfc include="reference.RFC.5042"?>

      <?rfc include="reference.RFC.6580"?>

      <?rfc include="reference.RFC.7306"?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.RFC.5045"?>

      <?rfc include="reference.RFC.8881"?>

      <?rfc include="reference.RFC.8166"?>

      <?rfc include="reference.RFC.8267"?>

      <?rfc include="reference.RFC.7145"?>

      <reference anchor="MS-SMB2">
        <front>
          <title>Server Message Block (SMB) Protocol Versions 2 and 3
          (MS-SMB2)</title>

          <author fullname="" initials="" surname="Microsoft Corporation">
            <organization>Microsoft Corporation</organization>
          </author>

          <date day="29" month="April" year="2022"/>
        </front>

        <annotation>https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-smb2/5606ad47-5ee0-437a-817e-70c366052962</annotation>
      </reference>

      <reference anchor="MS-SMBD">
        <front>
          <title>SMB2 Remote Direct Memory Access (RDMA) Transport Protocol
          (MS-SMBD)</title>

          <author fullname="" initials="" surname="Microsoft Corporation">
            <organization>Microsoft Corporation</organization>
          </author>

          <date day="5" month="June" year="2021"/>
        </front>

        <annotation>https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-smbd/1ca5f4ae-e5b1-493d-b87d-f4464325e6e3</annotation>
      </reference>

      <reference anchor="SCSI">
        <front>
          <title>SCSI Primary Commands - 3 (SPC-3) (INCITS 408-2005)</title>

          <author fullname="" initials="" surname="ANSI">
            <organization>American National Standards Institute</organization>
          </author>

          <date day="4" month="May" year="2005"/>
        </front>
      </reference>

      <reference anchor="T10DIF">
        <front>
          <title>T10 Data Integrity Field (DIF) / Protection Information
          (PI)</title>

          <author fullname="" initials="" surname="T10 technical subcommittee">
            <organization>INCITS</organization>
          </author>

          <date year="date unknown"/>
        </front>
      </reference>

      <reference anchor="SNIANVMP">
        <front>
          <title>SNIA NVM Programming Model v1.2</title>

          <author fullname="" initials="" surname="SNIA NVM Programming TWG">
            <organization>Storage Networking Industry Association NVM
            Programming TWG</organization>
          </author>

          <date day="19" month="June" year="2017"/>
        </front>

        <annotation>https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf</annotation>
      </reference>
    </references>

    <section title="DDP Segment Formats for RDMA Extensions">
      <t>This appendix is for information only and is NOT part of the
      standard. It simply depicts the DDP Segment format for each of the RDMA
      Messages defined in this specification.</t>

      <t>The DDP Message Offset fields in the Request Messages defined in this
      specification can contain any value and are ignored by the receiver.</t>

      <section title="DDP Segment for RDMA Flush Request">
        <figure title="RDMA Flush Request, DDP Segment">
          <artwork><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |   DDP Control | RDMA Control  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Reserved (Not Used)                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|             DDP (Flush Request) Queue Number (1)              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           DDP (Flush Request) Message Sequence Number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          DDP (Flush Request) Message Offset (Not Used)        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Data Sink STag                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Data Sink Length                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                   Data Sink Tagged Offset                     |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Disposition Flags                  +R+G+P|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
        </figure>
      </section>

      <section title="DDP Segment for RDMA Flush Response">
        <figure title="RDMA Flush Response, DDP Segment">
          <artwork><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |   DDP Control | RDMA Control  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Reserved (Not Used)                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|              DDP (Flush Response) Queue Number (3)            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          DDP (Flush Response) Message Sequence Number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
        </figure>
      </section>

      <section title="DDP Segment for RDMA Verify Request">
        <figure title="RDMA Verify Request, DDP Segment">
          <artwork><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |   DDP Control | RDMA Control  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Reserved (Not Used)                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|             DDP (Verify Request) Queue Number (1)             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           DDP (Verify Request) Message Sequence Number        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          DDP (Verify Request) Message Offset (Not Used)       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Data Sink STag                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Data Sink Length                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                   Data Sink Tagged Offset                     |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                Hash Value (optional, variable)                |
|                              ...                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
        </figure>
      </section>

      <section title="DDP Segment for RDMA Verify Response">
        <figure title="RDMA Verify Response, DDP Segment">
          <artwork><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |   DDP Control | RDMA Control  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Reserved (Not Used)                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|              DDP (Verify Response) Queue Number (3)           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          DDP (Verify Response) Message Sequence Number        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Hash Value (variable)                     |
|                              ...                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
        </figure>
      </section>

      <section title="DDP Segment for Atomic Write Request">
        <figure title="Atomic Write Request, DDP Segment">
          <artwork><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |   DDP Control | RDMA Control  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Reserved (Not Used)                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          DDP (Atomic Write Request) Queue Number (1)          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|        DDP (Atomic Write Request) Message Sequence Number     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       DDP (Atomic Write Request) Message Offset (Not Used)    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Data Sink STag                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                 Data Sink Length (value=8)                    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                   Data Sink Tagged Offset                     |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                       Data (64 bits)                          |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
        </figure>
      </section>

      <section title="DDP Segment for Atomic Write Response">
        <figure title="Atomic Write Response, DDP Segment">
          <artwork><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |   DDP Control | RDMA Control  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Reserved (Not Used)                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           DDP (Atomic Write Response) Queue Number (3)        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|       DDP (Atomic Write Response) Message Sequence Number     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
        </figure>
      </section>
    </section>
  </back>
</rfc>
