Internet-Draft | Network Performance Digital Twin | July 2022 |
Paillisse, et al. | Expires 12 January 2023 | [Page] |
This draft introduces the concept of a Network Digital Twin (NDT) for performance evaluation. A Performance NDT is able to produce performance estimates (delay, jitter, loss) of a given input network with a specific topology, traffic demand, and routing and scheduling configuration. Also, this draft discusses the interface of the digital twin, how it relates to existing control plane elements, use cases, and possible implementation options.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 12 January 2023.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
A Digital Twin for computer networks is a virtual replica of an existing network with a behavior equivalent to that of the real one. The key advantage of a Network Digital Twin (NDT) is the ability to recreate the complexities and particularities of the network infrastructure without the deployment cost of a real network. Hence, network administrators can test, deploy and modify network configurations safely, without worrying about the impact on the real network. Once the administrator has found a configuration that fulfills the expected objectives, it is deployed to the real network. In addition, a NDT is faster, safer and more cost-effective than interacting with the physical network. All these characteristics make NDT useful for different network management tasks ranging from network planning or troubleshooting to optimization.¶
The concept of a NDT has been proposed for different approaches: network management [I-D.draft-zhou-nmrg-digitaltwin-network-concepts], 5G networks [digital-twin-5G], Vehicular networks [digital-twin-vanets], artificial intelligence [digital-twin-AI], or Industry 4.0 [digital-twin-industry], among others.¶
This draft proposes a Digital Twin for network management with a focus on performance evaluation. That is, given several input parameters (topology, traffic matrix, etc), a Network Performance Digital Twin (NPDT) predicts network performance metrics such as delay (per path or per link), jitter, or loss. This draft defines the inputs and outputs of such Digital Twin, the associated interfaces with other modules in the network control plane, and details use cases.¶
In addition, this draft discusses possible implementation options for the NPDT, with a special emphasis on those based on Machine Learning. The aim of Section 7 (Implementation Challenges) is describing the advantages and limitations of these techniques. For example, most Machine Learning technologies rely heavily on large amounts of data to achieve acceptable accuracy. Other considerations include adjusting the architecture of the Neural Network to successfully understand the structure of the input data.¶
In order to use a Network Performance Digital Twin (NPDT) in practical scenarios (c.f. Section 6), such as network optimization, it should meet certain requirements:¶
Note that the inputs and outputs described here are an example, but other inputs and outputs are possible depending on the specificities of each scenario.¶
Figure 1 presents an overview of the architecture of a Network Performance Digital Twin (NPDT).¶
Each element is defined as:¶
And the functions of each interface are:¶
This interface can be a simple CLI or a state-of-the-art GUI, depending on the final product. In summary, it has to offer the network administrator the following options/features:¶
This interface is used to configure the Physical Network with the configuration parameters obtained from the optimizer. It can be composed of one or more IETF protocols for network configuration, a non-exhaustive list is: NETCONF [RFC6241], RESTCONF/YANG [RFC8040], PCE [RFC4655], OVSDB [RFC7047], or LISP [RFC6830]. It is also possible to use other standards defined outside the IETF that allow the configuration of elements in the forwarding plane, e.g. OpenFlow [OFspec] or P4 Runtime [P4Rspec].¶
This interface can be defined with any widespread data format, such as CSV files or JSON objects. There are two groups of data. We are assuming a network with N nodes.¶
Note that this is an example of the inputs/outputs of a performance NPDT, but other inputs and outputs are possible depending on the specificities of each scenario.¶
Since the NPDT is a type of Network Digital Twin, its elements can be mapped to the reference architecture of a NDT described in [I-D.draft-zhou-nmrg-digitaltwin-network-concepts]. Table 1 maps the elements of the NDT reference architecture to those of the NPDT. Note that the Physical Network is the same for both architectures.¶
NDT Reference Architecture | This draft | |
---|---|---|
Application Layer | Intent-Based Interface | |
Optimizer | ||
Digital Twin Layer | Management | Management Plane |
Service Mapping Models | Network Performance Digital Twin | |
Data Repository | Optional in production deployments | |
Physical Network | Data Collection | Measurement Interface |
Control | Configuration Interface |
The size and traffic of networks has doubled every year [network-capacity]. To accommodate this growth in users and network applications, networks need periodical upgrades. For example, ISPs might be willing to increase certain link capacities or add new connections to alleviate the burden on the existing infrastructure. This is typically a cumbersome process that relies on expert knowledge. Furthermore, modern networks are becoming larger and more complex, thus exacerbating the difficulty of existing solutions to scale to larger networks [planning-scalability].¶
Since the NPDT models large infrastructures and can produce accurate and fast performance estimates, it can help in different tasks related to network capacity and planning:¶
The NPDT is a unique tool to perform what-if analysis, that is, analyze the impact of potential scenarios and configurations safely without any impact on the real network. In this context, the NPDT acts as a safe sandbox where different configurations are applied to the NPDT to understand their impact on the network. Some examples of What-if analysis are:¶
There are many factors that cause network failures (e.g., invalid network configurations, unexpected protocol interactions). Debugging modern networks is complex and time consuming. Currently, troubleshooting is typically done by human experts with years of experience using networking tools.¶
Network operators can leverage a NPDT to reproduce previous network failures, in order to find the source of service disruptions. Specifically, network operators can replicate past network failure scenarios and analyze their impact on network performance, making it easier to find specific configuration errors. In addition, the NPDT helps in finding more robust network configurations that prevent service disruptions in the future.¶
Since the NPDT models the behaviour of a real-world network, network operators have access to an estimation of the expected network behaviour. When the real-world network behaviour deviates from the NPDT's behaviour, it can act as an indicator of an anomaly in the real-world network. Such anomalies can appear at different places in a network (e.g., core, edge, IoT), and different data sources can be used to detect such anomalies.¶
As discussed before, the NPDT can be understood as a safe playground where misconfigurations don't affect the real-world system performance. In this context, the NPDT can play an important role in improving the education and certification process of network professionals, both in basic networking training and advanced scenarios. For example:¶
Since the DT can provide performance estimates in short timescales, it is possible to pair it with a network optimizer (Figure 2). The network administrator defines one or more optimization objectives e.g. maximum average delay for all paths in the network. The optimizer can be implemented with a classical optimization algorithm, like Constraint Programming [DEFO], or Local Search [LS], or a Machine-Learning one, such as Deep Neural Networks [DNN-TM], or Multi-Agent Reinforcement Learning [MARL-TE]. Regardless of the implementation, the optimizer tests various configurations to find the network configuration parameters that satisfy the optimization objectives. In order to know the performance of a specific network configuration, the optimizer sends such configuration to the NPDT, that predicts the performance metrics of such configuration.¶
An example of optimization use case would be multi-objective optimization scenarios: commonly, the network administrator defines a set of optimization goals that must be concurrently met [DEFO], for example:¶
This section presents different technologies that can be used to build a NPDT, and details the advantages and disadvantages of using them to implement a NPDT. It takes into account how they perform with respect to the requirements of accuracy, speed, and scale of the NPDT predictions.¶
Packet-level simulators, such as OMNET++ [OMNET] and NS-3 [ns-3] simulate network events. In a nutshell, they simulate the operation of a network by processing a series of events, such as the transmission of a packet, enqueuing and dequeuing packets in the router, etc. Hence, they offer excellent accuracy when predicting network performance metrics (delay, jitter and loss), but they take a significant amount of time to run the simulation. They scale linearly with number of packets to simulate.¶
In fact, the simulation time depends on the number of events to process [limitations-net-sim]. This limits the scalability of simulators, even if the topology does not change: increasing traffic intensities will take longer to simulate because more packets enter the network per unit of time. Conversely, simulating the same traffic intensity in larger topologies will also increase the simulation time. For example, consider a simulator that takes 11 hours to process 4 billion events (these values are obtained from an actual simulation). Although 4 billion events may appear a large figure, consider:¶
These figures show that, despite the high accuracy of network simulators, they take too much time to calculate performance estimations.¶
Network emulators run the original network software in a virtualized environment. This makes them easy to deploy, and depending on the emulation hardware, they can produce reasonably fast estimations. However, for large scale networks their speed will eventually decrease because they are not using specific hardware built for networking. For fully-virtualized networks, emulating a network requires as many resources as the real one, which is not cost-effective.¶
In addition, some studies have reported variable accuracy depending on the emulation conditions, both the parameters and underlying hardware and OS configurations [emulation-perf]. Hence, emulators show some limitations if we want to build a fast and scalable NPDT. However, emulators are useful in other use cases, for example in training, debugging, or testing new features.¶
Queueing Theory (QT) is an analytical tool that models computer networks as a series of queues. The key advantage of QT is its speed, because the calculations rely on mathematical equations. QT is arguably the most popular modeling technique, where networks are represented as interconnected queues that are evaluated analytically. This represents a well-established framework that can model complex and large networks.¶
However, the main limitation of QT is the traffic model: although it offers high accuracy for Poisson traffic models, it presents poor accuracy under realistic traffic models [qt-precision]. Internet traffic has been extensively analyzed in the past two decades, and despite the community has not agreed on a universal model, there is consensus that in general aggregated traffic shows strong autocorrelation and a heavy-tail [inet-traffic].¶
Finally, Neural Networks (NN) and other Machine Learning (ML) tools are as fast as QT (in the order of milliseconds), and can provide similar accuracy to that of packet-level simulators. They represent an interesting alternative, but have two key limitations. First, they require training the NN with a large amount of data from a wide range of network scenarios: different routings, topologies, scheduling configurations, as well as link failures and network congestion. This dataset may not be always accessible, or easy to produce in a production network (see Section 8). Second, in order to scale to larger topologies and keep the accuracy, not all NN provide sufficient accuracy, therefore, some use cases need custom NN architectures.¶
A MultiLayer Perceptron [MLP] is a basic kind of NN from the family of feedforward NN. In short, input data is propagated unidirectionally from the input layer of neurons through the output. There may be an arbitrary number of hidden layers between the input and output layer. They are widely used for basic ML applications, such as regression.¶
Recurrent Neural Networks [RNN] are a more advanced type of NN because they connect some layers to the previous ones, which gives them the ability to store state. They are mostly used to process sequential data, such as handwriting, text, or audio. They have been used extensively in speech processing [RNN-speech], and in general, Natural Language Processing applications [NLP].¶
Convolutional Neural Networks (CNN), are a Deep Learning NN designed to process structured arrays of data such as images. CNNs are highly performant when detecting patterns in the input data. This makes them widely used in computer vision tasks, and have become the state of the art for many visual applications, such as image classification [CNN-images]. Hence, their current design presents limited applicability to computer networks.¶
Graph Neural Networks [GNN] are a type of neural network designed to work with graph-structured data. A relevant type of GNN with interesting characteristics for computer networks are Message Passing Neural Networks (MPNN). In a nutshell, MPNN exchanges a set of messages between the graph nodes in order to understand the relationship between the input graph and the expected outputs of the training dataset. They are composed of three functions, that are repeated several iterations, depending on the size of the graph:¶
Note that the internal architecture of a MPNN is re-build for each input graph.¶
Such ability to understand graph-structured data naturally renders them interesting for a Network Performance Digital Twin. Since computer networks are fundamentally graphs, they have the potential to take as input a graph of the network, and produce as output performance estimations of such the input network [qt-precision].¶
Figure 3 presents a comparison of different types of NN that predict the delay of a given input network. We use a dataset of the performance of different network topologies, created with simulation data (i.e, ground truth) from OMNET++. We measure the error relative to the delay of the simulation data. In order to evaluate how well the different NN deal with different network topologies, we train each NN in three different scenarios:¶
We can see that all NNs predict with excellent accuracy the network delay if we don't change the topology used during training. However, when it comes to new topologies, the error of the MLP is unacceptable (1150 %), as well as the RNN, around 30%. On the other hand, the GNN can understand new topologies, with an error below 2%. Similarly, if a link fails, the RNN has difficulties offering accurate predictions (60% error), while the GNN maintains the accuracy (4.2%). These results show the potential of GNNs to build a Network Performance Digital Twin.¶
In the context of Digital Twins based on Machine Learning, they require a training process before they can be deployed. Commonly, the training process makes use of a dataset of inputs and expected outputs, that guides the training process to adjust the internal architecture of e.g. the neural network. There are some caveats regarding the training process:¶
This memo includes no request to IANA.¶
An attacker can alter the software image of the NPDT. This could produce inaccurate performance estimations, that could result in network misconfigurations, disruptions or outages. Hence, in order to prevent the accidental deployment of a malicious NPDT, the software image of the NPDT MUST be digitally signed by the vendor.¶