<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<!-- edited with XMLSPY v5 rel. 3 U (http://www.xmlspy.com)
     by Daniel M Kohn (private) -->
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" category="std" docName="draft-yang-dmsc-distributed-model-04" ipr="trust200902" obsoletes="" updates="" submissionType="IETF" xml:lang="en" tocInclude="true" tocDepth="3" symRefs="true" sortRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.29.0 -->
  <front>
    <title abbrev="DMSC-LMT Architecture">Microservice Communication Resource
    Scheduling for Distributed AI Model</title>
    <seriesInfo name="Internet-Draft" value="draft-yang-dmsc-distributed-model-04"/>
    <author fullname="Hui Yang" initials="H" surname="Yang">
      <organization>Beijing University of Posts and
      Telecommunications</organization>
      <address>
        <postal>
          <street>10 Xitucheng Road, Haidian District</street>
          <city>Beijing</city>
          <code>100876</code>
          <region>Beijing</region>
          <country>China</country>
        </postal>
        <phone/>
        <email>yanghui@bupt.edu.cn</email>
      </address>
    </author>
    <author fullname="Tiankuo Yu" initials="T" surname="Yu">
      <organization>Beijing University of Posts and
      Telecommunications</organization>
      <address>
        <postal>
          <street>10 Xitucheng Road, Haidian District</street>
          <city>Beijing</city>
          <code>100876</code>
          <region>Beijing</region>
          <country>China</country>
        </postal>
        <phone/>
        <email>yutiankuo@bupt.edu.cn</email>
      </address>
    </author>
    <author fullname="Qiuyan Yao" initials="Q" surname="Yao">
      <organization>Beijing University of Posts and
      Telecommunications</organization>
      <address>
        <postal>
          <street>10 Xitucheng Road, Haidian District</street>
          <city>Beijing</city>
          <code>100876</code>
          <region>Beijing</region>
          <country>China</country>
        </postal>
        <phone/>
        <email>yqy89716@bupt.edu.cn</email>
      </address>
    </author>
    <author fullname="Zepeng Zhang" initials="Z" surname="Zhang">
      <organization>Beijing University of Posts and
      Telecommunications</organization>
      <address>
        <postal>
          <street>10 Xitucheng Road, Haidian District</street>
          <city>Beijing</city>
          <code>100876</code>
          <region>Beijing</region>
          <country>China</country>
        </postal>
        <phone/>
        <email>2024140574@bupt.cn</email>
      </address>
    </author>
    <date day="3" month="July" year="2025"/>
    <area>IETF Area</area>
    <workgroup>DMSC Working Group</workgroup>
    <keyword>distributed AI, service architecture</keyword>
    <abstract>
      <t>This document describes the architecture of microservice
      communication resource scheduling for distributed AI model.</t>
    </abstract>
  </front>
  <middle>
    <section anchor="intro" numbered="true" toc="default">
      <name>Introduction</name>
      <t>With the rapid advancement of Large Models such as GPT, Grok, and
      DeepSeek, training workloads have continuously increased in terms of
      scale, complexity, and resource consumption.This trend imposes stricter
      requirements on compute resource scheduling, communication efficiency,
      system resilience, and overall scalability.</t>
      <t>Traditional centralized control and monolithic architectures face
      several challenges, including network congestion, load imbalances, and
      difficulties in failure recovery. While current mainstream training
      systems typically rely on multi-node distributed parallel frameworks to
      improve training throughput through task partitioning and communication
      synchronization, they still encounter significant limitations when
      dealing with ultra-large model parameters, heterogeneous hardware
      clusters, and high-density communication patterns. These limitations
      include coarse scheduling granularity, inefficient communication path
      optimization, and inadequate fault tolerance, which ultimately lead to
      scalability bottlenecks and degraded system performance.</t>
      <t>To address these challenges, this draft proposes the Distributed
      Microservice Communication Architecture for Large Model Training
      (DMSC-LMT). DMSC-LMT is designed to enhance scheduling flexibility,
      communication efficiency, system resilience, and scalability in
      high-concurrency, large-scale, and heterogeneous training environments
      by introducing mechanisms such as task-level microservice decomposition,
      content-semantic addressing, and computation-aware routing. This
      architecture provides a unified, extensible, and evolvable foundation
      for building the next generation of Large Model training platforms.</t>
    </section>
    <section numbered="true" toc="default">
      <name>Conventions used in this document</name>
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref target="RFC2119" format="default">RFC 2119</xref>.</t>
    </section>
    <section numbered="true" toc="default">
      <name>Terminology</name>
      <t>The following terms are defined in this draft:</t>
      <t>* DMSC-LMT: Distributed Micro Service Communication architecture, a
      Distributed Microservice Communication Architecture for large model
      training defined in this draft.</t>
      <t>* MG: Microservice gateway, a core component in the microservice
      architecture that serves as a unified entry point for external requests.
      It is responsible for traffic routing, load balancing, service
      authentication, authorization, request filtering, logging, API
      aggregation, and other tasks to ensure that external requests are
      efficiently and stably dispatched to the service instance.</t>
      <t>* SR: Service Router, responsible for internal communication between
      microservices, dynamically selects the appropriate service instance
      using the service discovery mechanism, and performs traffic routing and
      load distribution based on the traffic scheduling algorithm to ensure
      efficient and stable communication between services.</t>
      <t>* SRD: Service Registry and Discovery, responsible for centrally
      managing the registration and discovery of microservices. Microservices
      register their instance information with the service registry upon
      startup, and SRD provides service discovery and registration information
      to SR and MG, ensuring they can dynamically locate service instances and
      route traffic.</t>
      <t>* AAM: Authentication and Authorization Module,it is the core
      component responsible for handling identity authentication and
      authorization in MG. It ensures that all incoming requests are
      authenticated and have appropriate permissions to protect the system
      from unauthorized access.</t>
      <t>* MI:Microservice Instance,A runtime-executable entity derived from a
      decomposed task, identified by a unique Service ID (SID), and deployed
      within a Microservice Instance Cluster (MIC). It performs a specific
      training-stage function and participates in the service routing and
      discovery ecosystem.</t>
      <t>* MIC: Microservice instance cluster,refers to a collection of
      microservice instances that collaborate in a distributed system to
      provide specific functions or services. Each MIC contains a set of
      microservice instances (such as a /1, B/1), and the instances in the
      cluster can perform the same or different tasks, serving as the basic
      computing unit for data parallelism, model parallelism, or hybrid
      training strategies. The cluster's microservices are coordinated by SRD
      and SR to ensure seamless communication and data synchronization.</t>
      <t>* SMO: Service Mesh Orchestrator, it is a key component in the
      microservice architecture responsible for coordinating, managing and
      optimizing the communication and task scheduling among microservices. It
      ensures the efficient collaborative work among services and dynamically
      adjusts the allocation of service traffic and computing tasks based on
      the actual load, performance requirements and system status.</t>
    </section>
    <section numbered="true" toc="default">
      <name>DMSC-LMT architecture</name>
      <section numbered="true" toc="default">
        <name>Overview</name>
        <t>The Distributed Microservice Communication Architecture for Large
        Model Training (DMSC-LMT) is designed to support high-concurrency,
        high-throughput, and large-scale training scenarios. It decomposes the
        training system into modular microservice components to ensure
        flexible scheduling, efficient communication, and robust fault
        tolerance. The overall architecture of DMSC-LMT is shown in Figure 1.
        Microservice Instance Cluster (MIC) execute core training computations
        through parallelized microservices. Microservice Gateways (MG) serve
        as the unified traffic entry point and manage request routing.
        Authentication and Authorization Module (AAM) validate identity and
        enforce access control for all incoming requests. Service Router (SR)
        coordinate internal service communication and load balancing. Service
        Registry and Discovery (SRD) manages microservice registration and
        discovery, enabling SR and MG to dynamically locate and route to
        service instances. Functional Modules provide reusable,
        loosely-coupled components—such as AI-driven congestion
        prediction and computation-aware routing—that may interact with
        external stateful services while remaining stateless in design.</t>
        <artwork name="Fig. 1  Architecture of DMSC-LMT" type="" align="left" alt=""><![CDATA[+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                           |
|                                MIC-1                                                       MIC-2                                                          |
|                +------------------------------------+                      +------------------------------------+                                         |
|                |      Microservice Instance (MI)    |                      |       Microservice Instance (MI)   |                                         |
|                |   +------+   +------+   +------+   |                      |   +------+   +------+   +------+   |                                         |
|                |   |  A/1 |   |  B/1 |   |  ... |   |                      |   |  A/2 |   |  B/2 |   |  ... |   |                                         |
|                |   +------+   +------+   +------+   |                      |   +------+   +------+   +------+   |                                         |
|                |       |          |         |       |                      |      |           |          |      |                                         |
|                |   +------+   +------+   +------+   |                      |   +------+   +------+   +------+   |                                         |
|                |   |  DB  |   |  ... |   |  ... |   |                      |   |  DB  |   |  ... |   |  ... |   |                                         |
|                |   +------+   +------+   +------+   |                      |   +------+   +------+   +------+   |                                         |
|                +-----------------|------------------+                      +------------------|-----------------+                                         |
|                                  |                                                            |                                                           |
|           +---------+      +-----|------+                                              +------|-----+      +---------+                                    |
|           |   AAM   |------|    MG-1    |                                              |    MG-2    |------|   AAM   |                                    |
|           +---------+      +------|-----+                                              +------|-----+      +---------+                                    |
|                                   |                                                           |                                                           |
|                                     \                                                       /                                                             |
|                                       \                                                    /                                                              |
|          +--------------------+      +--\-------+                                +------/---+     +--------------------+                                  |
|          | Functional modules |------|   SR-1   |--------------------------------|   SR-2   |-----| Functional modules |                                  |
|          +--------------------+      +-----|----+                                +-----|----+     +--------------------+                                  |
|                                            |                                           |                                                                  |
|                                            |                                           |                                                                  |
|                                            |                                           |                                                                  |
|                                            |                                           |                                                                  |
|                                            |                                           |                                                                  |
|                                            |                                           |                                                                  |
|                                            |                                           |                                                                  |
|                                            |                                           |                                                                  |
|          +--------------------+      +-----|----+                                +-----|----+     +--------------------+                                  |
|          | Functional modules |------|   SR-3   |--------------------------------|  SR-...  |-----| Functional modules |                                  |
|          +--------------------+      +--/-------+                                +------\---+     +--------------------+                                  |
|                                       /                                                   \                                                               |
|                                     /                                                       \                                                             |
|                                   |                                                           |                                                           |
|           +---------+      +------|-----+                                              +------|-----+      +---------+                                    |
|           |   AAM   |------|    MG-3    |                                              |   MG-...   |------|   AAM   |                                    |
|           +---------+      +-----|------+                                              +------|-----+      +---------+                                    |
|                                  |                                                            |                                                           |
|                +-----------------|------------------+                      +------------------|-----------------+                                         |
|                |   +------+   +------+   +------+   |                      |   +------+   +------+   +------+   |                                         |
|                |   |  DB  |   |  ... |   |  ... |   |                      |   |  DB  |   |  ... |   |  ... |   |                                         |
|                |   +------+   +------+   +------+   |                      |   +------+   +------+   +------+   |                                         |
|                |       |          |         |       |                      |      |           |          |      |                                         |
|                |   +------+   +------+   +------+   |                      |   +------+   +------+   +------+   |                                         |
|                |   |  A/3 |   |  B/3 |   |  ... |   |                      |   |  ... |   |  ... |   |  ... |   |                                         |
|                |   +------+   +------+   +------+   |                      |   +------+   +------+   +------+   |                                         |
|                |       Microservice Instance        |                      |       Microservice Instance        |                                         |  
|                +------------------------------------+                      +------------------------------------+                                         |
|                                MIC-3                                                       MIC-...                                                        |
|                                                                                                                                                           |
|                                                                                                                                                           |
|                                                                                                                                                           |
|                                    +------+             +-----------+            +------+                                                                 |
|                                    |  MG  | ============|    SMO    |============|  SR  |                                                                 |
|                                    +------+             +-----------+            +------+                                                                 |
|                                                                                                                                                           |
|                                               +------+             +-----------+                                                                          |
|                                               |  MG  | ============|    SRD    |                                                                          |
|                                               +------+             +-----------+                                                                          |
|                                                                                                                                                           |
|                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+]]></artwork>
      </section>
      <section numbered="true" toc="default">
        <name>Function modules</name>
        <t>In DMSC-LMT, Functional Modules are loosely coupled components used
        to perform intelligent policy analysis, path optimization, and
        predictive reasoning. Each function module itself remains stateless,
        but it can access external state services, and has the characteristics
        of pluggable and reusable. To support flexible deployment and
        high-performance inference, Feature modules typically consist of the
        following core components as shown in Fig 2:</t>
        <t>* Data Ingestion Interface: The functional module first obtains the
        input information required by the system during operation through the
        data acquisition interface. This includes data from MG and SR, module
        dependency topology from MIC, and service instance state from SRD.
        This data may be entered in many forms, such as RPC interfaces, REST
        apis, etc. This interface ensures that the module has full awareness
        of the current state of the system.</t>
        <t>* Feature Processing: In order to adapt the data to the input
        requirements of the policy reasoning module, the feature processing
        and modeling components are integrated in the function module. This
        part is responsible for preprocessing the original data, such as
        cleaning, normalization, sliding aggregation and so on. At the same
        time, it encodes the dependency topology between modules (such as
        generating adjacency matrix, graph embedding, etc.), and constructs
        the data structure suitable for graph model or sequence model
        analysis. Good feature modeling directly affects inference accuracy
        and performance.</t>
        <t>* Policy Inference Engine: This functional module integrates a
        series of intelligent algorithms for decision generation. Different
        reasoning logics can be deployed in different modules, such as
        topology scoring based on graph neural network, or selecting the
        optimal routing path based on reinforcement learning strategy.
        Lightweight supervised learning models can also be used to classify
        and predict link congestion risk. This module receives the processed
        feature input and outputs policy suggestions, such as path priority,
        node avoidance suggestions, or traffic control parameters.</t>
        <t>* Policy Output Interface: The inference results are fed back to
        the system control layer, mainly SMO, through this module. The
        interface supports synchronous or asynchronous transmission mode, and
        the generated policy recommendation is returned to the policy
        orchestrator in a standard format, which is adopted by the SMO and
        sent to the communication component (MG/SR). In addition, the output
        interface can also support logging, result echo or policy version
        control to enhance the traceability and interpretability of policy
        reasoning.</t>
        <t>* Lifecycle Control: In order to ensure the independent deployment
        and hot update ability of the module, the life cycle control logic is
        included in the function module. This part is responsible for the
        management of module registration, starting, closing and version
        switching, and supports the dynamic enabling or disabling of
        functional modules according to requirements in different training
        tasks. Through this mechanism, the system can flexibly load different
        policy models according to the operation scenario without interrupting
        the core training process.</t>
        <artwork name="Fig. 2  Functional modules" type="" align="left" alt=""><![CDATA[
                                  RPC  | REST API
                                       |
+--------------------------------------|---------------------------------------+
|                        +-------------|---------------+                       |
|                        |   Data Ingestion Interface  |                       |
|                        +-------------|---------------+                       |
|                                      |                                       |
|                        +-------------|---------------+                       |
|                        |      Feature Processing     |                       |
|                        +-------------|---------------+                       |
|                                      |                                       |
|                        +-------------|---------------+                       |
|                        |   Policy Inference Engine   |                       |
|                        +-------------|---------------+                       |
|                                      |                                       |
|                        +-------------|---------------+                       |
|                        |    Policy Output Interface  |                       |
|                        +-------------|---------------+                       |
|                                      |                                       |
|                        +-------------|---------------+                       |
|                        |      Lifecycle Control      |                       |
|                        +-----------------------------+                       |
|                                                                              |
+------------------------------------------------------------------------------+]]></artwork>
      </section>
      <section numbered="true" toc="default">
        <name>The control signaling messages design of DMSC-LMT</name>
        <t>In this draft, we define the control signaling mechanisms of the
        Distributed Microservice Communication Architecture for Large Model
        Training (DMSC-LMT). This architecture introduces multiple
        communication entities and signaling types to support efficient task
        distribution, dynamic routing, security verification, and real-time
        system optimization in large-scale training environments.</t>
        <t>DMSC-LMT introduces the following key communication entities:</t>
        <t>* Service Registry and Discovery (SRD)</t>
        <t>* Microservice Gateway (MG)</t>
        <t>* Authentication and Authorization Module (AAM)</t>
        <t>* Service Router (SR)</t>
        <t>* Functional Modules</t>
        <t>* Microservice Instance Cluster (MIC)</t>
        <t>The control signaling messages exchanged among these entities are
        outlined in Figure 4. The signaling types and their respective
        functions are described as follows:</t>
        <t>Service Instance Registration - Type 1: This signaling is initiated
        by microservice instances within MICs upon startup. Each instance
        registers its metadata (e.g., service prefix, resource type, compute
        capability,) with the SRD. This ensures that MG and SR can discover
        and route to active instances dynamically.</t>
        <t>Service Route Advertisement - Type 2: This signaling is exchanged
        between SRD and SR. It distributes route availability and service
        prefix topology across the system. The SRD periodically broadcasts the
        status of all reachable service instances and their location to
        connected SRs, enabling accurate route calculation and service
        reachability.</t>
        <t>Service Access Authentication - Type 3: When an external or
        internal request reaches MG, the MG sends an authentication signaling
        to AAM to validate the identity and permissions of the request. Based
        on policy rules, AAM returns an access decision, ensuring secure
        access control for all service invocations.</t>
        <t>Compute-Aware Routing Notification - Type 4: This signaling is used
        by SRs to report current load conditions, hardware heterogeneity, and
        latency statistics of target service instances to MGs. Based on this
        feedback, MGs and SRs collaboratively determine the most efficient
        path for routing compute-intensive tasks.</t>
        <t>QoS Telemetry and Policy Update - Type 5:Microservice Gateways
        (MGs) and Service Routers (SRs) periodically report communication
        quality metrics (e.g., throughput, error rate, latency) to the Service
        Mesh Orchestrator (SMO). The SMO then updates Quality-of-Service (QoS)
        policies and informs relevant entities to dynamically adjust traffic
        distribution strategies, ensuring optimal performance across the
        system.</t>
      </section>
      <section numbered="true" toc="default">
        <name>DMSC-LMT communication flow</name>
        <t>This section illustrates the communication flow among key
        components in the system during training task execution. It highlights
        how messages, control signals, and data are exchanged across
        Microservice Gateways (MGs), Service Routers (SRs), Service Mesh
        Orchestrator (SMO), and other entities to ensure coordinated training
        and resource-efficient operation.</t>
        <t>1)Service registration and communication path configuration</t>
        <t>After the training task is initiated, MG and SR first register
        their own node information, computing resources and service
        capabilities with SMO. The SRD synchronously completes the allocation
        of routing identifiers and maintains the logical mapping relationship
        of microservices. SMO combines the module topology information
        provided by MIC to determine the initial configuration scheme of the
        communication path and synchronously issues the strategy to each
        communication node.</t>
        <t>2)Data exchange during the model training process</t>
        <t>After the training task enters the execution stage, MG and SR
        conduct actual data transmission according to the configured
        communication strategy. MG is mainly responsible for the data entry
        and aggregation within the module, while SR is responsible for the
        forwarding of intermediate results and path selection between nodes.
        During the communication process, Functional Modules can participate
        in traffic prediction to avoid congestion and real-time optimization
        decisions.</t>
        <t>3)Telemetry of communication quality and policy feedback</t>
        <t>MG and SR continuously report communication indicators during
        operation, such as bandwidth occupancy, link delay, packet loss rate,
        etc. These telemetry information are aggregated and fed back to the
        SMO. Meanwhile, Functional Modules are also involved in the parsing of
        telemetry data to evaluate the current performance status of the
        system and predict its trends.</t>
        <t>4)QoS policy update and traffic adjustment</t>
        <t>Based on the observed data and the analysis results of functional
        modules, SMO will conduct a dynamic evaluation of the current QoS
        strategy. When problems such as path performance degradation, sudden
        increase in computing delay, or node load imbalance occur, the SMO
        will update the communication policy and distribute it to each MG and
        SR for execution.</t>
        <t>5)Exception handling and fault recovery</t>
        <t>During the communication process, once a node or link fails, MG or
        SR will actively report the abnormal state to SMO. SMO immediately
        links SRD and MIC for fault location and path reconstruction.
        Meanwhile, the functional module can conduct real-time analysis of
        events, such as determining whether it is an instantaneous fluctuation
        or a structural failure, thereby guiding the system on whether to
        retry in the short term, migrate tasks, or switch to a backup
        link.</t>
        <t>6)Communication termination and resource cleanup</t>
        <t>When the training task is completed, MG and SR will report the
        status to SMO and SRD. SMO is responsible for cleaning up
        communication routing information and deactivating service instances.
        SRD updates the service state mapping and releases port and routing
        resources. Meanwhile, MIC removes the module dependency topology of
        the current task.</t>
      </section>
      <section numbered="true" toc="default">
        <name>Task-Level Microservice Decomposition</name>
        <t>This section defines the implementation mechanisms of task-level
        microservice decomposition in DMSC-LMT. The decomposition process
        translates a large model training task into a set of loosely coupled
        microservice units, each corresponding to a specific computational
        stage, model layer, or data-parallel shard. These microservices are
        scheduled and deployed to Microservice Instance Clusters (MICs),
        registered with the SRD, and coordinated by the SMO for communication
        and resource optimization.</t>
        <t>1)Decomposition Principles</t>
        <t>Task-level microservice decomposition follows the principle of
        decoupling monolithic training workflows into independent
        computational sub-tasks with well-defined input and output boundaries.
        Each decomposed unit encapsulates a specific stage of the training
        pipeline, such as data loading, feature extraction, forward pass,
        backward propagation, or optimization. These units are designed to
        operate independently with minimal state sharing, thereby enabling
        stateless execution where possible. Furthermore, they are schedulable
        on heterogeneous hardware platforms, allowing flexible mapping based
        on compute capabilities or affinity constraints. The granularity of
        decomposition is determined by analyzing the model structure, the
        parallelism strategy employed—such as data, model, pipeline, or
        hybrid parallelism—and the trade-offs between communication
        overhead and task isolation. The overarching goal is to strike an
        effective balance between atomic execution units and manageable
        coordination complexity.</t>
        <t>2)Service Unit Generation and Mapping</t>
        <t>Once the decomposition strategy is determined, each sub-task is
        instantiated as a microservice unit, assigned a unique service
        identifier (SID), and bundled with its corresponding execution
        context. These microservice units are described through metadata,
        which includes information such as the SID, input/output schema,
        compute requirements, and dependency tags. Upon creation, each unit is
        registered with the SRD to facilitate dynamic service discovery. The
        units are then mapped onto a logical task graph, where the edges
        represent communication dependencies between services. This mapping
        process may incorporate various constraints, including affinity
        requirements (such as GPU placement), execution order enforcement
        (e.g., layer-wise propagation), and co-location preferences to
        minimize cross-node latency. The resulting task graph abstraction is
        consumed by the SMO, which uses it to generate an initial deployment
        and routing plan for efficient execution across the system.</t>
        <t>3)Deployment to MICs</t>
        <t>Microservice instances are deployed to one or more MICs based on
        available resources, physical proximity, and the overall communication
        layout. The deployment process involves allocating container or
        runtime slots on MIC nodes, establishing routing configurations
        between dependent service instances via the SR and SRD, and
        registering each instance’s runtime status—such as active,
        pending, or faulted—to support system-wide monitoring and
        orchestration. The deployment mechanism is designed to be elastic,
        allowing new instances to be dynamically spawned or migrated across
        MICs in response to load fluctuations, node failures, or strategy
        updates issued by the SMO.</t>
        <t>4)Dynamic Recomposition and Rescheduling</t>
        <t>To handle runtime dynamics such as failures, congestion, or
        workload imbalance, DMSC-LMT supports dynamic recomposition and task
        rescheduling. Recomposition allows microservice instances to be
        regrouped or restructured, altering the communication topology to
        improve system performance or enhance fault isolation. Rescheduling,
        on the other hand, involves reallocating microservice instances to
        different MICs based on updated resource availability, QoS
        constraints, or degradation detected along communication paths. These
        adjustments are triggered by feedback from Functional
        Modules—such as congestion prediction or node pressure
        estimation—policy updates from the SMO based on runtime metrics
        and operational reports from SR and MG, as well as failure detectors
        within MICs that identify stalled or underperforming service chains.
        During the recomposition process, the SRD updates routing identifiers
        and service reachability maps, while the SMO coordinates configuration
        synchronization across affected nodes. This mechanism enables DMSC-LMT
        to maintain high throughput, adaptivity, and resilience under dynamic
        runtime conditions.</t>
      </section>
    </section>
    <section numbered="true" toc="default">
      <name>Realization of key functions</name>
      <section numbered="true" toc="default">
        <name>Efficient Microservice Communication and Orchestration</name>
        <t>The system orchestrates and dynamically adjusts the communication
        strategy through SMO. SMO generates the routing strategy based on the
        task topology, system status, and the real-time analysis results
        provided by Functional Modules. The SRD is responsible for the
        registration and discovery of microservices, ensuring that MG and SR
        can dynamically locate communication targets and establish available
        paths. The functional modules play an intelligent auxiliary role in
        it. For example, through means such as congestion prediction, load
        awareness and path optimization, the adaptability and efficiency of
        routing decisions are improved.</t>
      </section>
      <section numbered="true" toc="default">
        <name>Intelligent decision support for functional modules</name>
        <t>To enhance the adaptive ability of the communication strategy, the
        system integrates a set of loosely coupled Functional Modules, such as
        AI-driven congestion predictors, computationally aware routers, etc.
        These modules access the telemetry data stream (from MG/SR) during the
        system operation and output policy suggestions for use by the SMO.
        Although functional modules may rely on external stateful services
        (such as historical data analysis), they are designed to be reusable,
        pluggable, and have good scalability and service independence.</t>
      </section>
      <section numbered="true" toc="default">
        <name>Task topology and dependency control</name>
        <t>MIC is responsible for analyzing the execution dependencies among
        the model modules and modeling them as an interaction topology
        diagram. During the task initiation stage, this topology is passed to
        the SMO as the basis for communication path design and routing
        strategy generation. Each microservice (such as MG and SR) registers
        its own instance information with the SRD at startup. The SRD is
        responsible for centrally maintaining the accessibility information of
        these services to support dynamic discovery and location during
        subsequent communications. MIC also participates in exception judgment
        during operation (such as identifying whether communication failures
        have disrupted the dependencies between modules), assisting the system
        in fault handling and module-level resscheduling.</t>
      </section>
      <section numbered="true" toc="default">
        <name>Computationally aware routing and congestion avoidance mechanisms</name>
        <t>To cope with the high concurrent communication and dynamic changes
        in computing power load in large-scale training tasks, DMSC-LMT
        introduces a set of computationally aware routing and AI-driven
        congestion avoidance mechanisms based on functional modules. This
        mechanism combines node computing pressure, communication link status
        and topological structure information to achieve dynamic perception,
        prediction and optimal scheduling of communication paths between
        services.</t>
        <t>A series of artificial intelligence learning algorithms have been
        deployed in the functional modules for predictive modeling and
        decision optimization of communication states. For example,
        topology-aware modeling based on Graph Neural Network (GNN); Routing
        optimization strategy based on reinforcement learning (RL); Path
        classification model based on supervised learning; Link trend modeling
        based on time series prediction.</t>
        <t>These policies operate in a stateless, loosely coupled manner,
        analyzing the system state in real time and providing policy
        recommendations to the SMO. Based on this, SMO dynamically generates
        and issues communication strategies to realize real-time adaptive
        adjustment of service routes, avoid congestion paths, and improve the
        overall training efficiency and communication stability.</t>
      </section>
      <section numbered="true" toc="default">
        <name>Fault handling and communication resilience</name>
        <t>The DMSC-LMT system constructs a multi-level fault handling
        mechanism in the communication path and module interaction process,
        which is used to deal with node failure, link interruption and
        performance degradation, so as to ensure the continuity and stability
        of the training task.</t>
        <t>The system monitors the status of services and links in real time
        through the exception reporting mechanism of MG and SR. Once an
        anomaly is detected, the information is immediately reported to the
        SMO. SMO combined with MIC analyzes the impact range of the fault. If
        the impact is limited, the local recovery strategy is preferred, such
        as switching the standby instance or adjusting the communication path,
        to avoid interrupting the task execution.</t>
      </section>
    </section>
    <section numbered="true" toc="default">
      <name>Local Database Integration and State Management in DMSC-LMT</name>
      <t>In DMSC-LMT, each microservice instance is optionally equipped with a
      lightweight local database to support task-specific state persistence,
      intermediate result caching, runtime metadata storage, and execution
      traceability. While this decentralized storage model aligns with the
      architecture’s goals of modularity, scalability, and low-latency
      operation, it introduces several technical considerations in terms of
      data consistency, recovery, and coordination.</t>
      <section numbered="true" toc="default">
        <name>Roles and Functions of Local Databases</name>
        <t>Local databases within microservice instances serve the following
        core purposes:</t>
        <t>*State Persistence: Retain intermediate training results, feature
        maps, local model slices, or cached gradients to avoid redundant
        computation and support resumption.</t>
        <t>*Configuration and Dependency Storage: Hold task topology metadata,
        execution context, and module-specific parameters.</t>
        <t>*Resilience and Fault Recovery: Enable fast recovery using local
        snapshots during instance restarts, crashes, or task migrations.</t>
        <t>*Operational Metrics and Event Recording: Record policy inference
        decisions, communication statistics, fault signals, and historical
        performance metrics.</t>
        <t>*Heterogeneous Execution Support: Cache localized execution
        parameters that vary by hardware or role.</t>
        <t>*Version Control and Auditing: Track data versions, strategy
        revisions, and training progress checkpoints.</t>
      </section>
      <section numbered="true" toc="default">
        <name>Coordination Across Instances</name>
        <t>Although each microservice instance maintains its own local
        database, coordination is still necessary when tasks span across
        service boundaries. In DMSC-LMT, such coordination is achieved through
        topology-driven synchronization. The dependency graph of the
        decomposed training task is reflected in the service routing layer,
        where the SMO and SRD maintain a global view of service relationships
        and control data exchange accordingly. For example, when one
        microservice instance produces intermediate results required by a
        downstream module, the communication layer ensures timely delivery
        based on this task graph, without requiring direct access to each
        other's databases. This strategy simplifies implementation, minimizes
        coupling, and ensures consistency by enforcing dependency-aware data
        movement. It also allows SMO to monitor inter-instance flows and
        adjust paths or schedules when necessary, further enhancing the
        robustness of coordinated execution.</t>
      </section>
      <section numbered="true" toc="default">
        <name>Integration with Control Plane Components</name>
        <t>In the DMSC-LMT architecture, local databases are not isolated
        components but are designed to work in concert with the control plane.
        They serve as a critical bridge between the execution layer of
        training tasks and the decision-making logic of system orchestration.
        The SMO can instruct a microservice instance to create and save a
        persistent checkpoint, or to export a portion of its runtime state for
        use in global scheduling. A persistent checkpoint refers to a snapshot
        of the instance’s key training state—such as model
        parameters, optimizer states, or execution progress—stored in
        the local database or stable storage. This enables the system to
        resume tasks from the last saved point in the event of migration or
        failure, avoiding redundant computation and ensuring continuity.
        Additionally, Functional Modules may access structured state data from
        the local store to support analytical tasks such as anomaly detection,
        retry path inference, or policy validation. These interactions
        typically occur through lightweight, read-only APIs that expose only
        necessary information. This approach preserves microservice autonomy
        and security while enhancing system observability and cross-module
        coordination. Through this integration, the local database not only
        handles internal state management but also plays an active role in the
        broader control logic of the system, supporting the adaptive,
        resilient, and intelligent orchestration goals of DMSC-LMT.</t>
      </section>
    </section>
    <section numbered="true" toc="default">
      <name>The specific process of DMSC-LTM</name>
      <t>DMSC-LTM aims to achieve efficient and flexible distributed training
      by means of microservices. The architecture decomposes the training task
      of large-scale model into multiple microservice units, and uses the
      dynamic communication and scheduling between microservices to achieve
      efficient task allocation, load balancing and fault tolerance.</t>
      <section numbered="true" toc="default">
        <name>Overview of Large Model Training Parallel Strategies</name>
        <t>With the increasing complexity of large models, the computing power
        of a single node can no longer meet their training requirements.
        Therefore, the parallel training strategy becomes the key to solve
        this challenge. Different parallel strategies offer solutions based on
        the model characteristics and training task requirements. Common large
        model training parallel strategies include Data Parallelism (DP),
        Pipeline Parallelism (PP), Tensor Parallelism (TP), and Hybrid
        Parallelism. These parallel strategies can be used independently or in
        combination to meet more complex training needs.</t>
        <t>1)Data Parallelism (DP)</t>
        <t>Data parallelism is one of the most commonly used parallel training
        methods. The basic idea is to divide the training dataset into
        multiple subsets, with each node processing a subset of the data. Each
        node computes gradients locally and synchronizes the gradients at the
        end of each training step to update the model parameters. Data
        parallelism is suitable for large-scale datasets, but when the model
        is large, the computation and storage capacity of a single node may
        become a bottleneck.</t>
        <t>2)Pipeline Parallelism(PP)</t>
        <t>Pipeline parallelism splits a large model into several stages, with
        each stage assigned to a different computation node. Each node
        processes its assigned part of the model sequentially, and as data
        flows through the pipeline, different stages of the model are computed
        in parallel. This method is particularly useful for large models where
        each stage of the model requires substantial computation, allowing the
        model to be processed in a pipeline fashion, with each node focusing
        on a specific layer or subset of the model. Pipeline parallelism
        allows for better utilization of computational resources and reduces
        idle time between stages.</t>
        <t>3)Tensor Parallelism (TP)</t>
        <t>Tensor parallelism divides the tensors (e.g., weight matrices,
        gradients) in the model into smaller parts, with each node computing
        operations on a portion of the tensor. The advantage of tensor
        parallelism is that it effectively handles high-dimensional tensors,
        avoiding memory bottlenecks when a single node computes large,
        high-dimensional data. It is suitable for large-scale neural networks
        where parameter size and computational complexity are significant.</t>
        <t>4) Hybrid Parallelism</t>
        <t>Hybrid parallelism combines the advantages of data parallelism,
        pipeline parallelism, and tensor parallelism, dynamically selecting
        the most appropriate parallel strategy based on task requirements. In
        some complex large-scale model training scenarios, multiple parallel
        strategies may need to be used simultaneously to optimize the use of
        computational resources. For instance, data parallelism can be
        combined with pipeline parallelism to handle both data partitioning
        and model splitting, thus improving training efficiency. Hybrid
        parallelism is particularly useful for complex tasks that involve
        processing both large datasets and massive models.</t>
        <artwork name="Fig. 3  Pipeline Parallelism" type="" align="left" alt=""><![CDATA[+----------------------------------------------------------------------------------------------------+
|       +-----------------------------+      +-------------------------------+                       |                                                                                    |
|       |  +------+       +------+    |      |     +------+      +------+    |                       |
|       |  | FWD1 |-----> | FWD2 |----|------|---> | FWD3 |----> | FWD4 |----|-                      |
|       |  +------+       +------+    |      |     +------+      +------+    |   \                   |
|       |      |              |       |      |         |             |       |      +-------+        |
|       |      |              |       |      |         |             |       |      |  Loss |        |
|       |      |              |       |      |         |             |       |      +-------+        |
|       |  +------+       +------+    |      |     +------+      +------+    |   /                   |
|       |  | BWD1 |<----- | BWD2 |<---|------|-----| BWD3 |<---- | BWD4 |<---|-                      |
|       |  +------+       +------+    |      |     +------+      +------+    |                       |
|       +-----------------------------+      +-------------------------------+                       |
|                      MI                                     MI                                     |
+----------------------------------------------------------------------------------------------------+]]></artwork>
        <artwork name="Fig. 4  Hybrid Parallelism" type="" align="left" alt=""><![CDATA[+------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                        |
|                                                                                                                        |
|                    |--------------------------------------------DP-------------------------------------------|         |
|    --------------  +---------+     +---------+     +---------+     +---------+     +---------+     +---------+  ----   |
|       |       |    | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |    |    |
|       |       |    | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |    |    |
|       |       |    | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |    |    |
|       |       |    | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |    |    |
|       |       |    | | MI2 | |     | | MI2 | |     | | MI2 | |     | | MI2 | |     | | MI2 | |     | | MI2 | |    |    |
|       |    stage 1 | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |    TP   |
|       |       |    |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |    |    |
|       |       |    |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |    |    |
|       |       |    |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |    |    |
|       |       |    | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |    |    |
|       |       |    | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |    |    |
|       |       |    | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |    |    |
|       |  --------  +---------+     +---------+     +---------+     +---------+     +---------+     +---------+  ----   |
|       |                MIC-1           MIC-2           MIC-3          MIC-4            MIC-5          MIC-6            |
|       |                                                                                                                |
|       |                                                                                                                |
|       |  --------  +---------+     +---------+     +---------+     +---------+     +---------+     +---------+         |
|       |      |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |         |
|       |      |     | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |         |
|       |      |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |         |
|       |    stage 2 |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |         |
|       |      |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |         |
|       |      |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |         |
|       |      |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |         |
|       |      |     | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |         |
|       |      |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |         |
|       |  --------  +---------+     +---------+     +---------+     +---------+     +---------+     +---------+         |
|       |                ...             ...             ...            ...              ...             ...             |
|       |                 .               .               .              .                .               .              |
|       |                 .               .               .              .                .               .              |
|       PP                .               .               .              .                .               .              |
|       |                                                                                                                |
|       |  --------  +---------+     +---------+     +---------+     +---------+     +---------+     +---------+         |
|       |      |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |         |
|       |      |     | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |     | | MI1 | |         |
|       |      |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |         |
|       |  stage ... |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |         |
|       |      |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |         |
|       |      |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |     |    .    |         |
|       |      |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |         |
|       |      |     | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |     | | MIn | |         |
|       |      |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |     | +-----+ |         |
|    --------------  +---------+     +---------+     +---------+     +---------+     +---------+     +---------+         |
|                        ...             ...             ...            ...              ...             MIC...          |                                                                                           |
|                                                                                                                        |
|                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------+]]></artwork>
      </section>
      <section numbered="true" toc="default">
        <name>Detailed mapping of distributed training to DMSC-LTM</name>
        <t>Distributed training for large models involves various stages such
        as data preprocessing, model partitioning, forward and backward
        propagation, gradient synchronization, error recovery, and training
        completion. These stages are efficiently handled by DMSC-LTM through
        its microservices-based architecture and dynamic communication,
        routing, and resource scheduling capabilities. This section details
        how the components of DMSC-LTM map to each stage of the distributed
        training process, with Pipeline Parallelism (PP) used as an
        example.</t>
        <t>1)Data Preprocessing and Loading</t>
        <t>The first step in the distributed training process is data
        preprocessing, which typically involves normalization, augmentation,
        and feature extraction. In the DMSC-LTM architecture, preprocessing
        tasks are distributed across multiple MIs, with each instance
        responsible for a portion of the data. The MG acts as the entry point,
        receiving incoming data and routing it to the appropriate MI for
        processing. The SRD mechanism ensures that new preprocessing instances
        are dynamically registered and available for task assignment,
        providing flexibility and scalability during data processing.</t>
        <t>2)Model Partitioning</t>
        <t>Once the data is ready, the model needs to be partitioned into
        smaller, manageable components, which can then be distributed across
        different computing nodes. In DMSC-LTM, each MI is assigned a specific
        part of the model to process. The SR mechanism ensures efficient
        communication between microservices. Here, functional
        modules—such as AI-driven congestion prediction and
        computation-aware routing—are employed to optimize the routing
        of data and tasks. These modules predict potential bottlenecks and
        dynamically adjust routing paths to avoid delays, ensuring the model
        is partitioned and executed efficiently. The SMO ensures that the
        model partitioning is done optimally, based on current system
        conditions, balancing the workload and minimizing resource
        wastage.</t>
        <t>3)Forward Propagation</t>
        <t>In DMSC-LTM, each MI is responsible for executing forward
        propagation for its specific part of the model. Using PP, data flows
        sequentially through each stage of the model, with each MI computing
        its assigned segment in parallel with other stages. The output of one
        MI is routed to the next using SR, ensuring smooth data transfer and
        efficient computation. Intermediate results can be cached for reuse in
        backward propagation, minimizing redundant computations and optimizing
        training performance.</t>
        <t>4) Backward Propagation</t>
        <t>After the forward pass is completed, backward propagation is
        performed to compute the gradients needed to update the model
        parameters. In DMSC-LTM, each MI calculates the gradients for its
        assigned model portion and sends the gradients to the previous MI in
        the pipeline using SR. Gradient synchronization is critical in
        distributed training, and DMSC-LTM leverages functional modules to
        optimize the synchronization process, ensuring efficient communication
        and minimal delay. This enables smooth backpropagation across multiple
        microservices and helps maintain high training performance.</t>
        <t>5)Gradient Update and Parameter Synchronization</t>
        <t>Following backward propagation, each MI updates its portion of the
        model using the computed gradients. To maintain consistency across the
        distributed system, parameter synchronization is required. In
        DMSC-LTM, SR are used to ensure that model parameters are updated
        across all microservices, guaranteeing that the model remains
        synchronized. The MG ensures that the final updated parameters are
        accessible to external systems for evaluation, deployment, or further
        training.</t>
        <t>6)Error Handling and Recovery</t>
        <t>Distributed systems are prone to errors such as node failures,
        network issues, or computation errors. In DMSC-LTM, the SMO is
        responsible for managing error handling and recovery. If a failure
        occurs, the SMO dynamically reallocates tasks to available
        microservices to minimize disruption. It also adjusts the task
        allocation based on system performance and load. Additionally,
        functional modules monitor the health of the system and apply
        AI-driven predictions to detect failures before they occur, enabling
        proactive recovery actions. This dynamic error recovery ensures that
        the distributed training process continues without significant
        interruptions.</t>
      </section>
    </section>
    <section numbered="true" toc="default">
      <name>Conclusion and Outlook</name>
      <t>The DMSC-LMT architecture supports the scalability and flexibility of
      large-scale model training through task-level microservices,
      content-aware communication, functional module inference, and local
      state management. Its decentralized scheduling mechanism and
      microservice autonomy eliminate the centralized bottlenecks typical of
      traditional architectures, enhancing both flexibility and resource
      utilization efficiency. The pluggable intelligent decision-making
      mechanism enables dynamic adjustments based on varying training demands,
      further improving adaptability. Looking ahead, DMSC-LMT will continue to
      evolve in areas such as cross-tenant isolation, enhanced security, and
      cross-cluster deployment optimization, aiming to improve resource
      management capabilities in multi-tenant environments, ensure the
      security of training data and models, and optimize the performance and
      efficiency of large-scale training across global deployments.</t>
    </section>
    <section anchor="iana" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>TBD</t>
    </section>
    <section numbered="true" toc="default">
      <name>Acknowledgement</name>
      <t>TBD</t>
    </section>
  </middle>
  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119" xml:base="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
            <abstract>
              <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
      </references>
      <references>
        <name>Informative References</name>
        <reference anchor="InfRef">
          <front>
            <title/>
            <author>
              <organization/>
            </author>
            <date year="2004"/>
          </front>
        </reference>
      </references>
    </references>
    <section numbered="true" toc="default">
      <name>An Appendix</name>
      <t/>
    </section>
  </back>
</rfc>
