How can Facebook's data center design apply to your data center plans?
Over the past year, Facebook has thrown some interesting wrenches into the
gears of the traditional networking industry. While mainstream thinking is to
keep most details of your network operations under wraps, Facebook has been
freely sharing its innovations. For a company whose business model is built on
people sharing personal information, I suppose this makes perfect sense.
What makes even more sense is the return Facebook gets on their openness.
Infrastructure VP Jason Taylor estimates that over the past three years Facebook
has saved some $2 billion by letting the members of its Open Compute Project
have a go at its design specifications.
But what really turned heads was last year's announcement of Wedge, an open
top-of-rack switch developed with the OCP community. Wedge was followed eight
months later by 6-Pack, a modular version of Wedge purposed for the network
core. Added to these bare-metal switches is FBOSS, an open Linux-based network
operating system (well, not exactly an operating system – more on that in a
later post), and OpenBNC for system management.
Why this openness matters to the rest of us is that all of this is not just a
mad-science project within Facebook's lair. You can soon buy Wedge through
Taiwanese switch manufacturer Accton, bringing switches into your data center
for a fraction of the cost of proprietary switches with integrated operating
systems. And you're not locked in to running FBOSS on the switch either. You can
shop around, choosing the NOS that makes the most sense to you, such as Open
Network Linux, Cumulus Linux, Big Switch Switch Light, and possibly others such
as Pica8's PicOS or even Juniper's JUNOS. If you have an intrepid team of
developers with time on their hands you can even build your own.
I'll write more about open switches and open software in subsequent articles,
but for now I want to focus on what Facebook has been sharing about their
innovations in data center network design and what it means for you. Last
November, between the announcements of Wedge and 6-Pack, Facebook opened its
newest data center in Altoona, Iowa. And as it has done with its other network
innovations, Facebook openly shared its new design.
It turns out that there are some valuable takeaways from the Altoona design that
can be applied to data centers of any size.
Hyperscale Misconceptions
Say "hyperscale data center" to most anyone who keeps up with such things,
and they'll reflexively name Facebook, Google, and Amazon. And because of this
association, people think of hyperscale as something that applies only to
mammoth data centers supported by an army of developers.
In reality, hyperscale just means the ability to scale out very rapidly. A
hyperscale data center network might be small, but it can grow exponentially
larger without changing the fundamental components and structures of the
network. You should be able to use the same switches and the same interconnect
patterns as you grow – just more of them. You do not need to throw out one class
of switches for another just to accommodate growth.
You can have a data center consisting of just a few racks, and if the network is
designed right it is a hyperscale data center. Hyperscale is a capability, not a
size.
Another misconception about hyperscale data centers is that they are optimized
for one or a relatively few applications at massive scale across the entire data
center. This stems particularly from the Facebook and Google associations.
Hyperscale designs are in fact ideal for very heavy east-west workloads, but
hyperscale design principles can apply to an average enterprise data center,
supporting hundreds of business applications just as easily as it supports a
single social media, big data, or search app.
Hyperscale also conjures up images of do-it-yourself networks built from the
silicon up by a cadre of brilliant young architects commanding salaries far out
of reach of the average network operator. That might be true of the innovators,
but because Facebook has laid its work right out on the table, mere mortals like
you and I can put their design principles to work in our own data centers.
To appreciate the significance of the Altoona network, let's first have a look
at the network architecture Facebook is using in its earlier data centers.
Good is not good enough: Facebook's cluster design
Figure 1 shows Facebook's pre-Altoona aggregated cluster design, which they
call the "4-post" architecture. Up to 255 server cabinets are connected through
ToR switches (RSW) to high-density cluster switches (CSW). The RSWs have up to
44 10G downlinks and four or eight 10G uplinks. Four CSWs and their connected
RSWs comprise a cluster.
041415 figure 1
Four "FatCat" (FC) aggregation switches interconnect the clusters. Each CSW has
a 40G connection to each of the four FCs. An 80G protection ring connects the
CSWs within each cluster, and the FCs are connected to a 160G protection ring.
This is a good design in several ways. Redundancy is good; oversubscription is
good (generally 10:1 between RSWs and CSWs, 4:1 between CSWs and FCs); the
topology is reasonably flat with no routers interconnecting clusters; and growth
is managed simply, at least up to the 40G port capacity of the FCs, by adding
new clusters.
But Facebook found that good is not good enough.
Most of the problems with this architecture stem from the necessity of very
large switches for the CSWs and FCs:
With just four boxes handling all intra-cluster traffic and four boxes handling
all inter-cluster traffic, a switch failure has a serious impact. One CSW
failure reduces intra-cluster capacity by 25%, and one FC failure reduces
inter-cluster capacity by 25%.
Very large switches restrict vendor choice – there are only a few "big iron"
manufacturers. And because these few vendors sell relatively fewer big boxes,
the per-port CapEx and OpEx is disproportionately high when compared to smaller
switches offered by a larger number of vendors.
The proprietary internals of these big switches prevent customization,
complicate management, and extend waits for bug fixes to months or even years.
Large switches tend to have oversubscribed switching fabrics, so all ports
cannot be used simultaneously.
The cluster switches' port densities limit the scale and bandwidth of these
topologies, and make transitions to next-generation port speeds too slow.
Facebook's distributed application creates machine-to-machine traffic that is
difficult to manage within an aggregated network design.
The individual pods are connected via 40G uplinks to four spine planes, as shown
in Figure 3. Each spine plane can have up to 48 switches. Key to this topology
is that the fabric switches each have an equal number of 40G downlinks and
uplinks – maxing out at 48 down an 48 up – so the fabric is non-blocking and
there is no oversubscription between pods. Bisectional bandwidth, running to
multi petabits, is consistent throughput the data center.
The diagram in Figure 3 shows the color-coded connections between fabric
switches and their corresponding spine planes, but doesn't do justice to how it
all ties together. And something that surely strikes you is that there are a lot
of links between fabric switches and spine switches. Optics and cables can
become expensive, so it's important to manage the distances between pods and
spine planes. (If you're interested in learning more about Facebook's
architectures, here are the source documents I used for cluster architecture
(PDF) and the Altoona architecture.)
If you rotate the pods and line them up, the way the 48 racks of each pod would
be arranged into rows in the data center, and then do the same with the spine
planes – but lining them up perpendicular to the pods – you get the
three-dimensional diagram shown in Figure 4, with the fabric switches becoming
part of the spine planes. Distance between fabric switches and spine switches
are reduced. Note that there are also edge pods, which provide external
connectivity to the fabric.
Facebook network engineer Alexey Andreyev describes the fabric this way: "This
highly modular design allows us to quickly scale capacity in any dimension,
within a simple and uniform framework. When we need more compute capacity, we
add server pods. When we need more intra-fabric network capacity, we add spine
switches on all planes. When we need more extra-fabric connectivity, we add edge
pods or scale uplinks on the existing edge switches."
If you want to hear Andreyev describe the Altoona architecture himself, here's
an excellent video:
Altoona Takeaways
You might be wondering by now what any of this has to do with you and your data
center. After all, Facebook is supporting more or less a single distributed
application generating machine-to-machine traffic spanning its entire data
center. You probably don't. And while a 48-rack pod is a scale-down from their
earlier clusters, most enterprise data centers in their entirety are smaller
than 48 server racks.
So why should you care? Because it's not the scale. It's the scalability.
The fundamental takeaways from the Altoona design are the advantages of
building your data center network using small open switches, in an architecture
that enables you to scale to any size without changing the basic building
blocks. First look at the switches. You don't have to wait for Wedge or 6-Pack
to go on the market (Accton will be selling Wedge soon). You can pick up
bare-metal switches from Accton, Quanta, Celestica, Dell, and others for a
fraction of the cost a big-name vendor will charge. For example, a Quanta switch
with 32 40G ports lists for $7,495. A Juniper QFX5100 with 24 40G ports lists
for a little under $30,000. Is that a fair comparison? That JUNOS premium gives
you a pretty awesome operating system, but the bare-metal switch gives you a
bunch of options for loading an OS of your choice.
As for the pod and core design, that can be adjusted to your own needs. The pod
can be whatever size you want; while the "unit of network" is a wonderful
concept, it's not a rule. You can create a number of pod designs to fit specific
workflow needs, or just to start a migration away from older architectures. Pods
can also be application specific. As your data center network grows, or you
adopt newer technologies, you can non-disruptively "plug in" new pods.
The same goes for the core part. You can build it at layer 2, or at layer 3. It
all depends on the workflows you're supporting. Using a simple pod and core
design you can manageably grow your data center network at whatever rate makes
sense to you, from a new pod every few years to an explosive growth of new pods
every few months.