Earlier this September, I attended the Tech Field Day Networking Field Day 8 event. Over the course of three days, we saw presentations from many very interesting vendors including a mix of startups and established market leaders. One trend that really stuck out to me more this time around than at any previous NFD event was a nearly ubiquitous emphasis on data center network fabric management. In other words, truly managing an entire data center network (or at least a sub-block of it) as a single unit.
Just of the NFD8 presenters who were providing this option, we had Cisco with their ACI model (but it stands to reason that even the now-well-established FEX model has very similar capabilities), Big Switch Networks with their Big Cloud Fabric, Pluribus Networks’ Netvisor Software Defined Fabric, and Nuage Networks Virtual Services Platform. Each of these products has unique value propositions, so I’m not suggesting they’re all the same but rather pointing out that this concept of fabric-level management is clearly at the forefront of most, if not all, leading-edge data center solutions at this point. The concept has been building for a couple years, and other vendors are also pursuing this model besides those mentioned above. And I have to say — it’s about damn time!
Fabric-wide management is wholly different from past attempts to manage fleets of network equipment in a unified manner. In the past, network management tools used SNMP or even Telnet or SSH screen-scraping and session interaction to try to apply identical policy/configuration to a number of devices, but at the end of the day those devices were, by design, discrete components. It was often not hard to get a device out of sync when using these techniques and in many networks I’ve seen, such tools weren’t even deployed in a way that they could hope to manage the entire data center network in a cohesive fashion. Fabric management mechanisms typically rely on a Three Phase Commit (3PC) mechanism, which essentially ensures that a change does not get applied to any device in the fabric until it can be applied to all devices in the fabric. This is an assurance an Expect script can never provide. Even if a CLI-interaction script is run in a parallel, rather than a serial, fashion, the complexity of discovering a failure to apply a change, and then backing that change out on all devices that it had already been completed on is very difficult.
So what’s the perfect storm that is suddenly helping this seemingly-obvious concept take hold right now? I think there are several key factors:
First, OpenFlow and related technologies, by their very nature, promote the management of a distributed forwarding plane using a unified control plane. This is one of the primary definitions of “Software Defined Networking.” Most of these new fabric-managed solutions are built on (or using) commodity merchant-silicon chipsets like the Broadcom Trident series, which are ready-built for this mode of operation.
Second, the push for strong API support in newer equipment for network automation has resulted in a need for a central network control point. The fewer points an automation tool has to interact with, the fewer places something can get out of sync and result in a Bad Day ©.
Third, the simple fact that we are starting to build data center networks as a true fabric. In other words, the new (well, newly applied to data networking) spine-leaf Clos network architecture is resulting in massively-multipathed data center network designs, which drives the complexity of managing many interswitch links and a spine layer of multiple switches (rather than a core layer of typically just two). It may well be untenable to manage such a design with individual touch points over the long haul. Consider adding a new VLAN or changing the QOS policy on 64 leaf switches, 6 spine switches, and the 384 interswitch links that connect the two layers! By treating the fabric as a complete thing, we can ensure high levels of consistency that previously required manual scrutiny or the aforementioned ill-fitted tools to try to maintain.
Lastly, scale. Now, here’s what I think is interesting. Early on (that is, a year or two ago), most vendors seemed to be looking at fabric management as something that was basically necessary for managing a data center that could grow to thousands of ports in each of multiple pods, etc. But I think there’s also an argument to be made for fabric management even at the small end of the networking spectrum.
Consider a server environment with, say, 6 server racks, each with twin top-of-rack switches (for dual-homing hosts in that rack) connected up to a small spine of two switches or a traditional collapsed-core design (very typical for a small data center network of this size). That’s still a total of 14 switches which typically must be configured identically with things like VLAN definitions, QOS, management controls, security parameters and features, and service-specific settings. Even in that small environment, adding a new server VLAN can take a surprising amount of time to do manually (with risks that the VLAN has inconsistent name attributes on different devices, etc.), or falls back to the old automation methods which, as noted above, have serious short-comings in their ability to reliably execute a change across the entire environment.
In talking with the vendors at NFD8, I found that they’ve started to recognize the applicability of data center network fabric deployment even in smaller settings and several of the vendors confirmed that their solutions are both cost-effective and reasonable to deploy in environments of fewer than 8-10 racks of computing equipment (which is a decent portion of environments I work with regularly).
I really think that fabric management will be a boon even to smaller data center networks due to the simplified (and automatically consistent) configuration capabilities, and the fabric-wide monitoring and troubleshooting features. No more hunting a MAC address switch-by-switch to find out where something is connected — just query the fabric for the location of MAC address X. There should no longer be a question of whether the proper QOS policy is assigned to an uplink — all uplinks are treated the same by the fabric controller. How about changing a destination log server or an authentication server for administrative access? A single change at the control plane level covers the entire fabric.
To be sure, there are potential pitfalls here too. For one thing, a configuration goof instantly applies everywhere. Also, I (and I think we, as an industry) frequently assume there will be no problems with this distributed control architecture in terms of the controller(s) actually communicating with all the fabric switches. We hope that these network controllers don’t become a bottleneck of some sort (think CAM table updates) or a single-point-of-failure.
There will be challenges to overcome, and changes in thinking to be made, but I’m really excited by the potential of this new network management paradigm for data centers of all sizes.