Oct 27 2016

Facebook Network Engineering Philosophy

Facebook's data centers are among the most advanced, energy-efficient facilities in the world and feature the latest in hyper-efficient Open Compute Project (OCP) hardware. It's our hope that an open, disaggregated network stack will enable a faster pace of innovation in the development of networking hardware and software, and ultimately provide infrastructures that are flexible, scalable, and efficient. When building this technology, we have several guiding principles:
  • Focus on the big picture
  • Iterate vs architect: keep it simple
  • Operations before features
  • Fail fast vs fail safe: reliability before features
  • Be bold and open
In the following sections, we'll expand on our guiding principles and dive into practical examples of our work.

Global Scale and Impact

As Facebook rapidly grows, the biggest challenge for the networking team is scaling our infrastructure to match the ever-growing demands, while keeping the network reliable and efficient. This challenge applies equally to the data center networks and the global backbone—calling for fast decision making and innovative solutions in each domain.
Tremendous scale—as evident by the data below—coupled with a global user base and unpredictable growth caused by constant product innovations on the Facebook platform presents a unique challenge. Along with scaling, the focus of the networking team is to improve user experience and stay ahead of any product changes by never impeding their deployment or growth.
The need to stay ahead of the curve requires the team to think ahead and work with the research community and industry innovators to find solutions that fit our future infrastructure demand. For example, take the networking silicon evolution. In order to meet the power envelope of our data centers, and at the same time provide expected capacity and functionality, we have to work closely with silicon vendors—both established and new ones. With vendors, we work together and collaborate on decisions that benefit the industry in the future. There are technologies we may not need today, but in a few years—and working with the research community and industry innovators, we can work toward those network evolutions together.

Move Fast and Keep it Simple

A strategy would be meaningless without effective execution. Building networks in the traditional network engineering world revolves around using well-known vendor hardware and software stacks, and solving problems based on recipes from the vendor “cookbooks.” Effectively, it looks like using Lego blocks to build complex solutions, while having little control over the blocks themselves. Following this approach would have limited our ability to move fast toward our long-term goals.
The key to efficiently reaching our networking goals is to iterate fast. We are re-building major parts of our network infrastructure on our own hardware and software to have minimal features. For example, Wedge and 6-pack lay the foundation of the network hardware that we use to build the data center fabric, a scale-out topology that provides resiliency via design and not by increasing reliability of individual components. Our own hardware and software stack may not be as feature rich as a network vendor would offer—but that's the beauty of it; we don't need the full complex set of functions.
Not every network component lends itself to be easily replaced with our hardware and software. Evolving our global backbone and edge interconnection networks has proven to be trickier, due to the significant amount of specific functionality deployed and interoperability requirements that are present in the wide area networks (WAN). We built a new parallel network, running on our own software, thus allowing for gradual introduction of the new solution. The goal behind this “FB-scale” backbone is delivering a simpler platform, with stripped feature set, solving the problems of intelligent capacity management and optimum resource utilization using our software component: the network controller.

Infrastructure Reliability

We prioritize the reliability of our infrastructure, and see it as a system property—achieved by creative and innovative infrastructure design, while incorporating reliability at the forefront, and efficient monitoring and remediation systems. We do not strive to achieve high reliability for each component, but rather build the system in such a way that it can withstand failures of individual components efficiently. For example, we design the network in isolated partitions (e.g. spine plane in the data-center fabric), such that an issue in one partition does not affect another. We have also built systems for rapid network fault detection, and we are actively working on more efficient systems for fault isolation that allow us to localize an issue of any nature in the network automatically.
An important aspect of improving reliability is our engineering culture. We hold the belief that the people responsible for building the systems should also share the responsibility of operating and supporting them. We don't separate architecture, engineering, and operations roles; every engineer is expected to be hands-on with the production network and/or code. The end result is simpler, more reliable system designs, which are built based on need and not just theoretical thinking.

Keys to Scale

The key to a scalable network design is reflected in the ease and efficiency of our network operations. With our architecture, we always keep operations in mind and build features that lend themselves to this model. One example of this is the work we did on router buffer size measurements. Academia has questioned large buffers, but mainstream, companies are using them. Our team did multiple measurements in production, and we found that in nearly every case, we can use much smaller buffer memory (compared to what traditional backbone routers support) for our network device. In fact, reducing buffer sizes has improved performance in many cases, as validated by our active network probing system (NetNorad). This simple change opened multiple possibilities for using much less complicated hardware in our edge and backbone networks, and it led to simplifying operations paradigm in the network.

Be Bold, Be Open

Our team is not afraid to rethink fundamental aspects of the networking infrastructure to better meet our goals around efficiency and speed. For example, for a long time, our routing protocol of choice was Border Gateway Protocol (BGP); it's widely supported, well-tested, and overall, could be used in very creative ways. We use BGP for routing in our data center fabric and to implement route injection to solve some of the traffic engineering goals. Then, we also built our own BGP stack to support some of the functions and integrate our hardware into existing networks. However, as we began to work with the Connectivity Lab at Facebook on Terragraph, we realized that we needed more than BGP or any traditional IGP could offer. We were looking for things like fast rerouting and intelligent topology discovery services—and most importantly, simple code we could extend quickly and support by ourselves. This is how Open/R was born. Open/R is our distributed platform for building network application, and it was developed for the Terragraph mesh network initially but found its way to parts of our internal backbone. Its modular and extensible design is based on modern software components, and allows for additional applications on top of the basic routing function.
And last but not least, we do not shy away from sharing the results of our work with the industry. Our switch designs (Wedge and 6-pack) are openly available to anyone willing to build their own gear. We open-sourced FBOSS, the software stack that runs on our networking gear, and we plan to open source even more solutions. We open source technologies because we know from experience that the best way to accelerate the pace of innovation is for companies to collaborate and work in the open.

Looking Ahead

Our family of apps and services continues to grow, and so does our infrastructure. Our goal is to remain agile in our ever-changing environment, while being proactive in planning our future growth. We believe that continuously innovating, as well as challenging traditional networking approaches, is the way to achieve this goal—like looking at the new ways submarine cables are built, re-inventing programmability in optical layer, questioning the value of Fast Reroute in the backbone networks, and so on. Working on the Facebook network is exciting, as we challenge the paradigms and create simple yet efficient system designs that span different layers of infrastructure—from software to hardware. We believe that we have just scratched the surface in every area that we have set out to solve, and truly believe the journey is just 1% finished.

We’re Hiring!

We're looking for people to join us and help in our mission to connect the world. View opportunities within Infrastructure.

Stay Connected!

Meta logo, homepage link

Careers

Follow us

LinkedIn icon
Instagram icon
facebook icon
Threads icon
YouTube icon
Twitter icon

Equal Employment Opportunity

Meta is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics. You may view our Equal Employment Opportunity notice here.

Meta is committed to providing reasonable support (called accommodations) in our recruiting processes for candidates with disabilities, long term conditions, mental health conditions or sincerely held religious beliefs, or who are neurodivergent or require pregnancy-related support. If you need assistance or an accommodation due to a disability, fill out the Accommodations request form .