February 12, 2011
As data centers continue to scale, it is inevitable the data center network has to migrate from a tree structure to a fabric architecture. Some people have given a narrow definition of Ethernet Fabric as an alternative term of Fiber Channel over Ethernet, while I believe there will be a generic fabric architecture built with Ethernet technology to scale the data center network.
We have seen a lot of discussions on the benefit of fabric architecture, including higher bandwidth, lower latency, lower cost, and enabling the migration of virtual machine. However, there is no practical way or guidelines in building such a fabric. Most of the proposals are proprietary and require special hardware or software to build the fabric.
The most popular example of fabric architecture is CLOS, A.K.A. fat-tree architecture, which theoretically can scale to any number of nodes. There are quite a few good papers illustrating the required nodes and cables to build multiple levels of fat-tree topology.
There are several challenges in building the fat-tree topology with top-of-rack switches.
- Cables – the number of cables to build a fat-tree far exceeds the external port. For a two level fat-tree, you will need 2X cables to build a fat-tree for X nodes. For three level fat-tree, you will need 4X.
- Stability – the number of cables could trigger stability concern. The failure rate of cables far exceeds that of switches themselves.
- Troubleshooting – the massive number of switches and cables creates a tough troubleshooting environment. Imagine tracing 512 cables for a 256 node cluster.
Building an Ethernet Fabric with Standard Protocol?
Compared to the abundant examples of fat-tree hardware wiring, there are much less guidelines on how to configure the switches to distribute the traffic through Ethernet Fabric. Ethernet is a plug-and-play protocol which rely heavily on broadcast packet to discover the route. When users start to build fabric with Ethernet, these broadcast packets create storm and make the fabric essentially not usable.
There have been some innovation to migrate Ethernet to a fabric architecture. Some have been tested and deployed into production environment, while others are still paper design. Some are already approved by standard committees, while some still remain as proprietary.
So far ECMP (Equal Cost Multi-Pathing) is the most used standard protocol to distribute traffic through a fat-tree Ethernet fabric. ECMP can be configured over static route or over OSPF. Depending on the size of your fabric and cable failure rate, you can decide whether to use OSPF (theoretically more failure resilient) or static route (simpler to configure and manage). The configuration of ECMP is non-trivial, but feasible. You can find an example of configuration here.
Even though ECMP is a matured protocol and many switch vendors support the interoperability, users should know this is a layer-3 protocol and will partition the network into IP subnets. This might impact some of the virtualization functionality, such as vMotion, in a data center environment.
TRILL (Transparent Interconnect of Lots of Links) has been proposed as the new spanning tree standard for Ethernet fabric. The protocol is close to approval of standard committee, and many vendors, such as Cisco, have started to roll out proprietary design of TRILL. We expect TRILL to be standardized and vendors will start to support interoperability at the end of 2011.
The beauty of TRILL is it is a Layer-2 protocol and will make Ethernet fabric transparent to virtual machines. However, it is still not clear how the load balancing will play out in TRILL.
Pronto switches are TRILL capable, but the software patch is not available yet. We expect to distribute TRILL patch by the end of 2011.
There are many other proprietary designs of fat-tree Ethernet protocols. For example, UCSD proposed an elegant design called PortLand, which proposed positional pseudo MAC addresses in the data center and implemented a centralized Fabric Manager to handle ARP resolution.
Another uprising project is Openflow, which separates the controller from the switching devices. The Openflow controller centralizes all the topology intelligences and can manage thousands of switching devices. Pronto is one of the sponsors of the Openflow project.
Microsoft also has published a paper describing a data center network architecture called VL2, which separates and maps application IP addresses to physical IP addresses.
While we have identified the topology to build Ethernet Fabric and protocols to distribute the traffic, there are still many factors we need to tune up in order to make the fabric work.
- Congestion control – One of the big issue in a fabric environment is how to load balance the traffic and avoid congestion. It is also important to make sure the flow control is enabled to avoid packet drops in a Ethernet fabric.
- Broadcast suppression– Even if we can avoid broadcast storms by using new protocols such as TRILL, it is important to reduce the amount of Broadcast traffic in a fabric topology. A popular trick is to implement ARP proxy at the leaf switches, so the ARP packets are intercepted and managed at the edge.