Meta currently operates 14 data centers around the world. This rapidly expanding global data center footprint poses new challenges for service owners and for our infrastructure management systems. Systems like Twine, which we use to scale cluster management, and RAS, which handles perpetual region-wide resource allocation, have provided the abstractions and automation necessary for service owners to be machine-agnostic within a region. However, as we expand our number of data center regions, we need new approaches to global service and capacity management.
That’s why we’ve created new systems, called Global Reservations Service and Regional Fluidity, that determine the best placement for a service based on intent, needs, and current congestion.
Twine and RAS allowed service owners to be machine-agnostic within a region and to view the data center as a computer. But we want to take this concept to the next level — beyond the data center. Our goal is for service owners to be region-agnostic, which means they can manage any service at any data center. Once they’ve become region-agnostic, service owners can operate with a computer’s level of abstraction — making it possible to view the entire world as a computer.
Designing a new approach to capacity management
As we prepare for a growing number of regions with different failure characteristics, we have to transform our approach to capacity management. The solution starts with continuously evolving our disaster-readiness strategy and changing how we plan for disaster-readiness buffer capacity.
Having more regions also means redistributing capacity more often to increase volume in new regions, as well as more frequent hardware decommissions and refreshes. Many of today’s systems provide only regional abstractions, which limits an infrastructure’s ability to automate movements across regions. Additionally, many of today’s service owners hard-code specific regions and manually compute the capacity required to be disaster-ready.
We often don’t know service owners’ intentions for using specific regions. As a result, our infrastructure provides less flexibility for shifting capacity across regions to optimize for different goals or to be more efficient. At Meta, we realized we needed to start thinking more holistically about the solution and develop a longer-term vision of transparent, automated global-capacity management.
Latency-tolerant vs. latency-sensitive services
When scaling our global capacity, Meta initially scoped approaches to two common service types:
- Stateless and latency-sensitive services power products that demand a fast response time, such as viewing photos/videos on Facebook and Instagram mobile apps.
- Latency-tolerant services power products in which a slight delay in fulfilling requests is tolerable, such as uploading a large file or seeing a friend’s comments on a video.
When considering placement of a latency-sensitive service, we must also take into account the placement of the service’s upstream and downstream dependencies. Collectively, our infrastructure needs to place these services near one another to minimize latency. However, for latency-tolerant services, we do not have this constraint.
By using infrastructure that has the flexibility to change the placement of a service, we can improve the performance of the products. Our Global