<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-dong-fantel-problem-statement-00"
     ipr="trust200902" updates="">
  <front>
    <title abbrev="FaNTEL Problem Statement">Fast Notification Problem
    Statement</title>

    <author fullname="Jie Dong (editor)" initials="J." surname="Dong, Ed.">
      <organization>Huawei Technologies</organization>

      <address>
        <email>jie.dong@huawei.com</email>
      </address>
    </author>

    <author fullname="Mike McBride (editor)" initials="M."
            surname="McBride, Ed.">
      <organization>Futurewei</organization>

      <address>
        <email>mmcbride7@gmail.com</email>
      </address>
    </author>

    <author fullname="Francois Clad (editor)" initials="F."
            surname="Clad, Ed.">
      <organization>Cisco Systems</organization>

      <address>
        <email>fclad@cisco.com</email>
      </address>
    </author>

    <author fullname="Jeffrey Zhang" initials="Z." surname="Zhang">
      <organization>Juniper Networks</organization>

      <address>
        <email>zzhang@juniper.net</email>
      </address>
    </author>

    <author fullname="Yongqing Zhu" initials="Y." surname="Zhu">
      <organization>China Telecom</organization>

      <address>
        <email>zhuyq8@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Xiaohu Xu" initials="X." surname="Xu">
      <organization>China Mobile</organization>

      <address>
        <email>xuxiaohu_ietf@hotmail.com</email>
      </address>
    </author>

    <author fullname="Ran Pang" initials="R." surname="Pang">
      <organization>China Unicom</organization>

      <address>
        <email>pangran@chinaunicom.cn</email>
      </address>
    </author>

    <author fullname="Hao Lu" initials="H." surname="Lu">
      <organization>Tencent</organization>

      <address>
        <email>vickkylu@tencent.com</email>
      </address>
    </author>

    <author fullname="Yadong Liu" initials="Y." surname="Liu">
      <organization>Tencent</organization>

      <address>
        <email>zeepliu@tencent.com</email>
      </address>
    </author>

    <author fullname="Luis M. Contreras" initials="L." surname="Contreras">
      <organization>Telefonica</organization>

      <address>
        <email>luismiguel.contrerasmurillo@telefonica.com</email>
      </address>
    </author>

    <author fullname="Mehmet Durmus" initials="M." surname="Durmus">
      <organization>Turkcell</organization>

      <address>
        <email>mehmet.durmus@turkcell.com.tr</email>
      </address>
    </author>

    <date day="20" month="October" year="2025"/>

    <abstract>
      <t>Modern networks require adaptive traffic manipulation including
      Traffic Engineering (TE), load balancing, flow control and protection
      etc. to support applications like AI training and real-time services. A
      good and timely understanding of network operational status, such as
      congestion and failures, can help improve utilization, reduce latency,
      and enable faster response to critical events. This document describes
      the existing problems and why the IETF may need a new set of fast
      notification related solutions to support any high-throughput,
      low-latency and lossless application.</t>
    </abstract>
  </front>

  <middle>
    <section title="Introduction">
      <t>Modern network applications, ranging from AI training to large-scale
      cloud services, require lossless and adaptive networks to ensure
      reliable, congestion-free data transfer within a single data center or
      across multiple sites. These workloads demand high throughput, low
      latency, and minimal packet loss across dynamically shifting traffic
      patterns. To meet these requirements, networks employ mechanisms such as
      traffic engineering (TE), load balancing, flow control, and protection.
      However, existing solutions often face limitations in responsiveness,
      coverage, and operational complexity, particularly in high-speed,
      large-scale environments.</t>

      <t>This document summarizes the limitations of existing mechanisms that
      prevent rapid notification and action to critical network events,
      including link or node failures and congestion. This document describes
      why the IETF may need a new set of fast notification related solutions
      to support these use cases. <xref
      target="I-D.geng-fantel-fantel-gap-analysis"/> provides a gap analysis
      of existing solutions and where they are deficient in supporting high
      demand services. This document primarily focuses on describing the
      problem space.</t>

      <section anchor="requirements-language" title="Requirements Language">
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
        document are to be interpreted as described in <xref
        target="RFC2119">RFC 2119</xref>.</t>
      </section>
    </section>

    <section title="Glossary">
      <t>FaNTEL: Fast Notification for Traffic Engineering and Load
      Balancing</t>

      <t>FRR: Fast Re-Route</t>

      <t>ECN: Explicit Congestion Notification</t>

      <t>BFD: Bidirectional Forwarding Detection</t>

      <t>IOAM: In-situ Operations, Administration, and Maintenance</t>

      <t/>
    </section>

    <section title="The Problem">
      <t>Current network traffic manipulation mechanisms such as TE, load
      balancing, flow control, and protection, has deficiencies in providing
      the low-latency, high-granularity responsiveness needed in modern,
      dynamic networks, at least in part due to the lack of dynamic network
      state information. This results in suboptimal performance, low
      reliability and delayed recovery. FaNTEL is proposed as a set of
      solutions to address this by enabling fast, real-time, lightweight
      notifications that enhance the responsiveness for traffic engineering,
      congestion mitigation, and rapid failure protection. There is a
      demonstrable need for a standardized framework in IETF to define these
      fast notification mechanisms, requirements and integration
      strategies.</t>

      <t>The following describes a summary of limitations of existing
      notification solutions:<list style="symbols">
          <t>Slow Reaction: Existing control protocols (e.g., routing
          protocol, etc.) may be used for dissemination of dynamic network
          state information, while they usually rely on control plane based
          hop-by-hop distribution, which causes delay when the recipient is
          multiple hops away. With modern high-throughput environments (AI/ML
          clusters, multi-DC WANs), this delay is often prohibitive. Explicit
          Congestion Notification (ECN) <xref target="RFC3168"/> needs
          congestion signals to be sent back to the sender, which can be slow
          if the source node is far away, and it relies on the source node to
          react in the transport layer. What is needed is a lightweight
          signaling method that can provide real-time alerts (e.g., at the
          level of sub-10 ms) on failures, congestion, or threshold breaches,
          enabling immediate actions (e.g., in ms to 10s ms ranges) in the
          network layer.</t>

          <t>Coarse-Grained Signals: ECN and similar mechanisms only provide
          binary or threshold-based feedback, without granularity for rapid,
          fine-tuned adjustments. This leads to either overreaction or
          underutilization of available capacity. What would be useful is a
          set of notifications that aren't just "on-off" state reports but can
          also convey information like congestion level/utilization
          information, latency spikes, queue buildup or flow characteristics,
          so that it can trigger immediate and precise responses like
          rerouting, rate adjustment, or protection switching for specific
          flows.</t>

          <t>Overhead and Churn: IOAM <xref target="RFC9197"/> and similar
          tools provide detailed telemetry information, but the collection and
          feedback loops are controller-centric. They cannot be used to
          deliver lightweight, real-time alerts for immediate action on
          specific network nodes. And carrying dynamic network state
          information in control protocols (e.g. routing protocols) also
          increases the overhead and churn of the control plane, which may
          have negative impact to the core functionality of the protocol. It
          would be useful to have solutions designed to avoid the overhead and
          churn introduced by telemetry flooding or route distribution, so it
          can adapt to large-scale networks and dynamic traffic patterns (e.g.
          AI workloads, cloud WAN bursts).</t>

          <t>Local-Only Decision Making: Current load-balancing, flow-control
          and fast reroute (FRR) techniques often act on local information and
          fail to capture downstream or cross- domain network conditions,
          limiting their effectiveness. The Point of Local Repair (PLR) makes
          its decision based on its local view of the topology and network
          status. It does not know about the state of the entire path of the
          backup route (e.g., if the backup path itself is congested). It
          would be helpful to send fast notifications to upstream nodes which
          can perform the action based on the view of regional or global
          network conditions.</t>

          <t>Scalability Challenges: High-volume information or frequent
          signaling introduces bandwidth and processing overhead. At scale,
          this becomes a bottleneck rather than a solution.</t>
        </list></t>
    </section>

    <section anchor="usecase"
             title="Example: AI Training Cluster with Fiber Link Failure">
      <t>Consider a large-scale AI/ML training job distributed across multiple
      data centers. These clusters exchange terabits per second of data
      between GPU nodes, requiring ultra-low latency and high throughput to
      maintain synchronization.</t>

      <t>In such environments, a single fiber link failure or severe
      congestion event can disrupt the entire training run, leading to:</t>

      <list style="symbols">
        <t>Delays in job completion (hours to days for large models)</t>

        <t>Massive energy and compute cost waste due to resynchronization</t>

        <t>Degraded convergence accuracy if synchronization windows are
        missed</t>
      </list>

      <section anchor="current-limitations"
               title="Limitations of Existing Mechanisms">
        <t>Today's mechanisms provide partial solutions but are not fast or
        precise enough for these scenarios:</t>

        <list style="symbols">
          <t>BFD <xref target="RFC5880"/>: Provides fast forwarding path
          failure detection. It can be used for both link and path failure
          detection, while it cannot be used to detect link or path
          congestion, nor can it notify the failure or congestion to other
          nodes in the network. BFD is preconfigured with periodic message
          exchange, while fast notifications needs to be event-driven. When
          the transmit interval is set to a small value (e.g., at the level of
          ms), frequent BFD message exchange may become a burden to some
          systems.</t>

          <t>FRR <xref target="RFC4090"/><xref target="RFC5714">  </xref>/Route
          convergence: Without fast notification, the failure detection can
          take tens of milliseconds, followed by either local repair (FRR) or
          route convergence. The former lacks of global network situation thus
          may cause congestion on the backup paths, while the latter may
          breach strict synchronization deadlines.</t>

          <t>ECN: Provides binary congestion feedback to the endpoints, which
          is insufficient for granular congestion spikes on high-speed links,
          and the action can be slow.</t>

          <t>Telemetry (e.g., IOAM): Offers detailed information, but relies
          on collection and RTT-based feedback, which delays action.</t>

          <t>Receiver/Sender Flow Control: Tied to RTT or packet loss,
          unsuitable for the bursty nature of AI traffic patterns.</t>
        </list>

        <t>In practice, this means that by the time a fiber link failure is
        detected and recovery mechanisms are invoked, critical GPU
        synchronization barriers may already be missed, forcing rollbacks or
        restarts of the training process.</t>
      </section>

      <section anchor="fantel-solution" title="How Fast Notification Helps">
        <t>Fast notification mechanisms could improve the response to fiber
        link failures and congestion in AI/ML clusters:</t>

        <list style="symbols">
          <t>Real-Time Alerts: Nodes adjacent to the failure or congestion
          could immediately (e.g., at 10 ms level) send lightweight
          notifications to nodes whose fowarding paths can be affected.</t>

          <t>Action-Oriented Response: Upon receiving the notification,
          routing and load balancing mechanisms could instantly shift traffic
          to backup paths or alternative DC interconnects.</t>

          <t>Granularity: Notifications could carry more detailed information
          than "link failure/congestion," e.g., indicating specific link
          utilization, queue buildup or microburst congestion, allowing
          differentiated responses to different traffic flows.</t>

          <t>Complementary: The fast notification solutions are complementary
          to BFD or IOAM, it would bridge the time gap between event onset and
          slower control plane or telemetry-driven responses, and enable
          network-wide optimization.</t>
        </list>

        <t>By deploying fast notifications, large AI/ML workloads can maintain
        synchronization across data centers even during transient failures or
        congestion, protecting job completion time and resource
        utilization.</t>

        <figure anchor="ai-link-failure"
                title="AI Training Cluster with Fiber Link Failure">
          <artwork name="" type="ascii-art"><![CDATA[
    
           +-------------------------+       +-------------------------+
           |   Data Center A (GPU)   |-------|   Data Center B (GPU)   |
           +-------------------------+       +-------------------------+
                        |                              |
                        |     High-speed Fiber Link    |
                        +-----------X  (Failure) ------+
                                    |
                              (Failure Event)
    ]]></artwork>
        </figure>

        <t>Existing Approach:</t>

        <list style="symbols">
          <t>BFD detects failure after tens of ms</t>

          <t>FRR causes congestion on backup paths</t>

          <t>Reroute/convergence delays impact GPU sync</t>

          <t>Result: Training stalls, job wastes compute</t>
        </list>

        <t>Fast Notifications Approach:</t>

        <list style="symbols">
          <t>BFD detects failure after tens of ms</t>

          <t>Fast notification alerts upstream nodes of failure or congestion
          in real time</t>

          <t>Regional or global TE steers traffic quickly to link without
          causing new congestion</t>

          <t>Result: Training continues with minimal disruption</t>
        </list>
      </section>
    </section>

    <section title="Fast Notification Problem Statement">
      <t/>

      <section title="Information of Fast Notifications">
        <t>The information carried in the fast notifications, by the
        originating node, can be one or multiple of the following:</t>

        <t><list style="symbols">
            <t>Failure information: This can include the location of failure,
            and the type of failure.</t>

            <t>Fine-grained Congestion information: This can include link
            utilization, queue length, or the level of congestion, together
            with the location where the congestion happens.</t>

            <t>Fine-grained Performance information: This can include link or
            node delay, jitter, packet loss information etc., together with
            the location where the performance degradation happens.</t>

            <t>Path identification information: This can be used to indicate
            the path along which one service flow is being forwarded.</t>

            <t>Flow identification information: This can include either the
            identification or the 5-tuple of a flow.</t>
          </list>Other information related to the network status and need to
        be timely actioned may also be carried in the fast notifications. Thus
        there is a need to work on the information model of Fast Notifications
        to better understand what needs to be carried in the
        notifications.</t>
      </section>

      <section anchor="fantel-recipients"
               title="Recipients of Fast Notifications">
        <t>Fast notifications may be consumed by two broad forms of
        recipients: (1) recipient nodes that participate directly in
        forwarding or signaling, and (2) functions and applications that
        consume notifications in order to optimize, monitor, or adapt
        behaviors. Separating these categories clarifies which entities are
        physical/ logical nodes versus which are higher-level functional
        consumers.</t>

        <t><figure>
            <artwork><![CDATA[    +==================+======================+=======================+
    | Node Type        | Role                 | Example Benefit       |
    +==================+======================+=======================+
    | Adjacent Routers | Data-plane neighbors | Enable local repair   |
    | / Switches       | that forward packets | (e.g., FRR, ECMP      |
    |                  |                      | adjustments)          |
    +------------------+----------------------+-----------------------+
    | Non-Adjacent     | Remote upstream      | Accelerated awareness |
    | Routers /        | forwarding elements  | of failure/congestions|
    | Switches         |                      | on specific nodes     |
    +------------------+----------------------+-----------------------+
    | Ingress Routers  | Traffic entry points | Re-map affected flows |
    |                  | of a network         | before forwarding     |
    |                  | domain               | into failed regions   |
    +------------------+----------------------+-----------------------+
    | End Hosts / Edge | Optional             | Adapt sending rate,   |
    | Nodes            | subscribers, policy- | select alternate      |
    |                  | driven               | uplinks               |
    +------------------+----------------------+-----------------------+
    | Network Controler| Optional             | Accelerated awareness |
    | / PCE            | subscribers, policy- | of failure/congestion |
    |                  | driven               | for global TE/LB      |
    +------------------+----------------------+-----------------------+

                   Table 1: Recipient Nodes]]></artwork>
          </figure><figure>
            <artwork><![CDATA[   +=======================+===============+===========================+
   | Function /            | Role          | Example Benefit           |
   | Application           |               |                           |
   +=======================+===============+===========================+
   | Routing Protocols     | Control-plane | Accelerated path re-      |
   | (OSPF, IS-IS, BGP)    | convergence   | computation after failure |
   +-----------------------+---------------+---------------------------+
   | Traffic Engineering   | Centralized   | Pre-compute new paths     |
   | Controllers (PCE/     | optimization  | before congestion         |
   | SDN)                  |               | propagates                |
   +-----------------------+---------------+---------------------------+
   | Network Operators     | Operational   | Faster troubleshooting,   |
   | (NMS/OSS)             | visibility    | earlier alerting          |
   +-----------------------+---------------+---------------------------+
   | Telemetry /           | Monitoring    | Predictive analytics, ML- |
   | Analytics Systems     | and           | based congestion          |
   |                       | prediction    | forecasting               |
   +-----------------------+---------------+---------------------------+
   | Applications /        | Critical app  | AI workloads, financial   |
   | Services              | consumers     | apps adapt to degraded    |
   |                       |               | links                     |
   +-----------------------+---------------+---------------------------+

                 Table 2: Recipient Functions and Applications]]></artwork>
          </figure></t>

        <figure anchor="fantel-planes-diagram"
                title="Notification Recipients Across Network Planes">
          <artwork><![CDATA[
                   +-----------------------------+
                   |     Application Plane       |
                   |  - Applications / Services  |
                   |  - End Hosts / Edge Nodes   |
                   +-------------^---------------+
                                 |
                   +-------------|---------------+
                   |  Management Plane           |
                   |  - Operators (NMS/OSS)      |
                   |  - Telemetry / Analytics    |
                   +-------------^---------------+
                                 |
                   +-------------|---------------+
                   |  Control Plane              |
                   |  - Routing Protocols        |
                   |  - TE Controllers (PCE/SDN) |
                   +-------------^---------------+
                                 |
                   +-------------|----------------+
                   |  Data Plane                  |
                   |  - Adjacent Routers/Switches |
                   |  - Non-Adjacent Routers      |
                   |  - Ingress Routers           |
                   +------------------------------+
    ]]></artwork>
        </figure>

        <t/>

        <t>As illustrated above, the latency sensitivity of recipients
        decreases as one moves from the data plane to the application plane.
        Recipient nodes (e.g., adjacent forwarding elements, ingress
        routers,etc.) often require near-instantaneous notification, while
        functions and applications (e.g., routing protocols, analytics, NMS,
        etc.) may tolerate slightly longer timescales but still benefit from
        rapid awareness compared to existing mechanisms. The range of
        recipients of the notification depends on the type of recipients, it
        also depends on what type of action is required. The mechanism to
        determine the type and range of the recipients is something needs
        further consideration.</t>
      </section>

      <section title="Delivery of Fast Notifications">
        <t>Depending on the position and number of the recipient nodes, fast
        notifications may be sent via one of the following delivery modes:</t>

        <t><list style="symbols">
            <t>Unicast directly to the recipient node</t>

            <t>Multicast to a group of recipient nodes</t>

            <t>Hop-by-hop to a series of receipt nodes along a specified
            path</t>

            <t>Flooding in a specified range of the network</t>
          </list></t>

        <t>Additionally, recipient nodes or functions may subscribe to
        specific types of notifications based on their roles or interests. A
        subscription-based approach enables selective delivery, reduces
        unnecessary signaling overhead, and ensures that each recipient
        receives only the information relevant to its function. Mechanisms
        supporting both delivery and subscription must guarantee timely,
        reliable, and secure propagation of notifications. Examples:</t>

        <t><list style="symbols">
            <t>Adjacent routers subscribing to all local failure
            notifications</t>

            <t>Centralized controllers subscribing only to congestion alerts
            exceeding defined thresholds</t>

            <t>Applications or analytics systems subscribing to performance
            degradation events affecting specific flows or services</t>
          </list>The mechanisms to support the above delivery mode needs to
        make sure the notification is always sent to the targeted recipient
        noded in a timely manner. It could be based on existing messaging and
        transport mechanisms, or a new protocol may be introduced.</t>
      </section>
    </section>

    <section title="Summary">
      <t>Current network mechanisms were not designed for the responsiveness
      and scale required by todays' dynamic environments. Techniques such as
      load balancing, protection switching, and flow control rely on telemetry
      and feedback loops that are often too slow, too coarse, or too
      resource-intensive. This results in performance bottlenecks, delayed
      recovery, and inefficiencies in large-scale AI, cloud, and WAN
      deployments. A fast notification mechanism could help to address these
      gaps by providing lightweight, real-time, actionable alerts that
      complement existing tools and enable faster, more accurate network
      management decisions.</t>
    </section>

    <section title="IANA Considerations">
      <t>This document has no IANA actions.</t>
    </section>

    <section title="Security Considerations">
      <t>Fast notifications, </t>

      <t>if not properly authenticated and rate-limited, could be exploited as
      a vector for Denial-of-Service (DoS) attacks. An attacker able to inject
      or flood spurious notifications may trigger unnecessary re-convergence,
      path changes or repeated state updates, overwhelming both recipient
      nodes and higher-level applications. Implementations must therefore
      ensure integrity protection, origin authentication, and appropriate rate
      controls on notification messages.</t>
    </section>

    <section title="Acknowledgement">
      <t>The authors would like to thank XXX for the valuable comments and
      discussion.</t>
    </section>

    <section title="Contributors">
      <t>The following people contributed substantially to the content of this
      document.</t>

      <t><figure>
          <artwork><![CDATA[Zafar Ali
Cisco
zali@cisco.com

Tianran Zhou
Huawei
zhoutianran@huawei.com

Xuesong Geng
Huawei
gengxuesong@huawei.com
]]></artwork>
        </figure></t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='reference.RFC.2119'?>
    </references>

    <references title="Informative References">
      <?rfc include='reference.I-D.geng-fantel-fantel-gap-analysis'?>

      <?rfc include='reference.RFC.3168'?>

      <?rfc include='reference.RFC.4090'?>

      <?rfc include='reference.RFC.5714'?>

      <?rfc include='reference.RFC.5880'?>

      <?rfc include='reference.RFC.9197'?>
    </references>
  </back>
</rfc>
