Internet-Draft | Abbreviated-Title | July 2022 |
Guo, et al. | Expires 12 January 2023 | [Page] |
NVMe over Fabrics defines a common architecture that supports a range of storage networking fabrics for NVMe block storage protocol over a storage networking fabric, such as Ethernet, Fibre Channel and InfiniBand. For IP-based network, RDMA or TCP technology can be used to transport NVMe, but the network fault detection is weak.¶
This document describes the solution requirements for fast fault detection to improve reliability.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 12 January 2023.¶
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
For a long time, the key storage applications and high performance requirements are mainly based on FC networks. With the increase of transmission rates, the medium has evolved from HDDs to solid-state storage, and the protocol has evolved from SATA to NVMe. The emergence of new NVMe technologies brings new opportunities. With the development of the NVMe protocol, the application scenario of the NVMe protocol is extended from PCIe to other fabrics, solving the problem of NVMe extension and transmission distance. The block storage protocol uses NoF to replace SCSI, reducing the number of protocol interactions from application hosts to storage systems. The end-to-end NVMe protocol greatly improves performance.¶
Fabrics of NoF includes Ethernet, Fibre Channel and InfiniBand. Comparing FC-NVMe to Ethernet- or InfiniBand-based Network alternatives generally takes into consideration the advantages and disadvantages of the networking technologies. Fibre Channel fabrics are noted for their lossless data transmission, predictable and consistent performance, and reliability. Large enterprises tend to favor FC storage for mission-critical workloads. But Fibre Channel requires special equipment and storage networking expertise to operate and can be more costly than IP-based alternatives. Like FC, InfiniBand is a lossless network requiring special hardware. IP-based NVMe storage products tend to be more plentiful than FC-NVMe-based options. Most storage startups focus on IP-based NVMe. But unlink FC, The Ethernet switch does not notify the Change of device status. When the device is faulty, relying on the NVMe link heartbeat message mechanism , the host takes tens of seconds to complete service failover.¶
+--------------------------------------+ | NVMe Host Software | +--------------------------------------+ +--------------------------------------+ | Host Side Transport Abstraction | +--------------------------------------+ /\ /\ /\ /\ /\ / \ / \ / \ / \ / \ FC IB RoCE iWARP TCP \ / \ / \ / \ / \ / \/ \/ \/ \/ \/ +--------------------------------------+ |Controller Side Transport Abstraction | +--------------------------------------+ +--------------------------------------+ | NVMe SubSystem | +--------------------------------------+¶
This document describes the application scenarios and capability requirements of the IP-based NVMe that implements fast fault detection similar to FC. The proposal is already under discussion in working group of NVMe organization.¶
IP-based NVMe: using RDMA or TCP to transport NVMe through Ethernet¶
FC: Fiber Channel¶
NVMe: Non-Volatile Memory Express¶
NoF: NVMe of Fabrics¶
The NVMe over RDMA or TCP IP-based network in storage is as follows, the network mainly includes three types of roles: an initiator (referred to as a host), a switch, and a target (referred to as a storage device). Initiators and targets are also referred to as endpoint devices.¶
+--+ +--+ +--+ +--+ Host |H1| |H2| |H3| |H4| (Initiator) +/-+ +-,+ +.-+ +/-+ | | '. ,-`| | | | `', | | | | ,-` '. | | +-\--+ +--`-+ +`'--+ +-\--+ | SW | | SW | | SW | | SW | +--,-+ +---,, +,.--+ +-.--+ `. `'.,` .` `. _,-'` ``'., .` IP +--'`+ +`-`-+ Network | SW | | SW | +--,,+ +,.,-+ .` `'., ,.-`` ', .` _,-'` `. +--`-+ +--'`+ `'---+ +-`'-+ | SW | | SW | | SW | | SW | +-.,-+ +-..-+ +-.,-+ +-_.-+ | '. ,-` | | `., .' | | `', | | '.` | | ,-` '. | | ,-` `', | Storage +-`+ `'\+ +-`+ +`'+ (Target) |S1| |S2| |S3| |S4| +--+ +--+ +--+ +--+¶
Hosts and storage devices are connected to the network separately and In order to achieve high reliability, each host and storage device are connected to dual network planes simultaneously. The host can read and write data services when an NVMe connection is established between the host and the storage device.¶
When a storage device link is faulty during running, the host cannot detect the fault status of the indirectly connected device at the transport layer. Based on the IP-based NVMe protocol, the host uses the NVMe heartbeat to detect the status of the storage device. The heartbeat message interval is 5s. Therefore, it takes tens of seconds to determine whether the storage device is faulty and perform service switchover using the multipath software. Failure tolerance time for core applications cannot be reached. In order to obtain the best customer experience and business reliability requirement, we need to enhance fault detection and failover for IP-based NVMe.¶
In this proposl, a fast fault detection solution with switch participation is proposed. This scheme utilizes the ability of switches to detect faults quickly at the physical layer and link layer, and allows the switch to synchronize the detected fault information in the IP network, and then notify the fault status to the endpoint devices.¶
Fault detection procedure: The host can detect the fault status of the storage device and quickly switch to the standby path.¶
+----+ +-------+ +-------+ +-------+ |Host| |Switch | |Switch | |Storage| +----+ +-------+ +-------+ +-------+ | | |-+ | | | |1| | | | |-+ | | |<----2------| | | | | | |<----3-------| | | | | | | |<----4-------|------------|-----------> | | | | |¶