Inside a TOR switch

It is not a secret that most of the top-of-rack (TOR) L2/L3 switches (Cisco, Juniper, Force10, Arista, Blade Networks, Extreme) use off-the-shelf switch chips from Broadcom, Marvell, or Fulcrum. They all share similar hardware design. The only difference is the software. Each vendor has its own legacy software stack, and many of them are still using embedded software architecture, i.e. one process running dozens of threads.

In fact, with the fast advance of off-the-shelf switch silicon, many of the aggregation switches or core switches are designed in similar way to the TOR switches. They just use higher speed of ASIC and run a little more protocols. However, to avoid the protests from the vendors of overpriced “high performance” switches, let me just limit the scope on TOR switches.

Switch Hardware Design

These TOR switches all come with similar architecture.

  • A low-power controller CPU – this CPU handles the control-plane protocols and should be pretty light-loaded most of the time. Most of the TOR switches use Power PC SoC chip, while some uses ARM or MIPS as the processor.
  • One or multiple switch ASICs to handle the data plane traffic. These ASICs handle L2 learning, packet exchange, L3 routing, data traffic security and filtering, etc.
  • A PCI or PCI-E bus to connect the controller CPU and the switch ASIC. This bus typically has much lower bandwidth than the data ports in the ASIC. It is used to pass the PDU (Protocol Data Unit) between the CPU and the ASIC. it is NOT fast enough to handle any data traffic.
  • Ethernet PHYs – the PHY chip translate serial signals to different media protocols, such as GbT, fiber, or 10GbT. Each PHY can handle 1, 2, 4, or even 8 ports. Depending on the number of ports in the switch and the speed of ports, each switch will require different number of PHYs.
  • Storage – there are typically two types storage, a boot flash to store the booting image and a USB/Compact-Flash type of storage card. The boot ROM is typically around 32M or 64M. The USB/CF storage can now easily scale up to tens of GB.

How about OS?

The term OS in switching industry is slightly different from the computer industry. In switching, OS actually includes the Operating System (OS to drive the hardware) and Protocol Engine (application to communicate with other switches). For example, Cisco calls its software iOS. Arista calls it EOS. Force10 calls it FTOS. This is probably because the legacy switch software was always built with embedded system OS, where application and “OS” is so tightly coupled and behave as one process.

With advance operating systems like Linux, the switch software does not need to be squeezed into one process. Since the CPU only handles the control protocols, there is really no reason to use proprietary scheduler, or “manually optimized” thread priorities to run the application.

It is a trend that switch software to migrate to modern operating system like Linux. This will give better protection, better scheduling, and better resource sharing to the application. More importantly, this will allow users to leverage device drivers and available tools on the switch platforms.

Protocol Engine

The most critical application on switches is the protocol engine, which communicates with other switches in the network and “guesses” the topology of the network. A switch might need multiple protocol engines for different protocols. Ideally, we want to run these protocol engines in different process space so they won’t be easily hacked or lead to the crash of the whole switch. However, since most of the legacy switch software was more than 15-year-old architecture, it is difficult to port those software to the modern OS.

While protocol engine is a critical core of the switch application, it is only a small portion of the whole implementation. The majority of the software is the management interface, including command line interface (CLI), configuration file management, event log, and alerts. Many of these implementation can be replaced with existing tools in Linux. For example, the event log system of Linux is much more structural and extensible than the legacy proprietary implementation, and the format of Linux event is already used in existing management tools.

The Differences between Switches

Since most of the switches are developed with similar architecture, what is the difference between these switches? Well, it is the software and the chip.

Depending on the requirement of the market, the switch chips come with different bandwidth, buffer, and features. However, under the Moore’s law, the ASICs are more and more powerful and can handle more and more protocols. This is an area we are seeing the commoditization. At the low-end market, there is really not much difference between the chips from Marvell, Broadcom, Fulcrum, and Vitesse. At the high-end market, the density of Marvell, Broadcom, Fulcrum have been a close race. At the feature side, Broadcom leads the pack, but not much. We can fairly say most of these high-end chips have more than the features you need (or you want to pay).

The software determines what features are available to users. Since the chips have almost all the features you need, it is really up to software to decide what features to open up to users. For example, most of the new chips from Broadcom, Marvell, and Fulcrum have claimed to be “TRILL” capable. However, no vendor has released any TRILL software yet, so users can only wait until the software is done before they can really try out TRILL.

How Would TOR Switch Evolve?

Since the hardware is commoditized, software will become the key differentiation of the switches.

Soon we will start seeing switch volume manufacturers to sell low-cost while high-performance switches.  This will give users the incentive to buy hardware and software from different vendors, just like how we buy laptops today. The hardware could be shipped with Linux (or modern OS) as the operating system, and users can order the software with the right feature and right cost to install on those low cost switch hardware.

The cost of the software should be associated to the features, such as better management package or special-purpose protocols, instead of the speed of the hardware. The 10GE switch hardware might be more expensive than the GE switch hardware, but the software price should be the same if users need the same features.

This change should happen soon, which was why we founded Pronto, and position as an open switch platform. We are seeing interest from software provider. We already see Redhat-of-switch (Pica8 Xorplus), and believe soon we will see a Microsoft-of-switch to show up and lead the evolution. If you are interested in low-cost high-performance TOR switches, check our Pronto 3290 (48GE with 4 10GE uplinks) and Pronto 3780 (48 ports of 10GE).


About James Liao
James is a data center architect, focusing on the scalability and operation of data center infrastructure.

Comments are closed.

%d bloggers like this: