How Meta Is Reinforcing its Global Network for AI Traffic

In 2022, Meta engineers realized they needed to deal with the incoming tsunami of AI data traffic that was about to overwhelm their networks.

Sep 12th, 2024 11:45am by Joab Jackson

Featued image for: How Meta Is Reinforcing its Global Network for AI Traffic

Image of Jyotsna Sundaresan of Meta, presenting at the company’s Networking @Scale 2024 conference.

It was in 2022 when Meta engineers started to see the first clouds of an incoming storm, namely how much AI would change the nature —and volume — of the company’s network traffic.

“Starting 2022, we started seeing a whole other picture,” said Jyotsna Sundaresan, a Meta network strategist, in a talk Wednesday for the Meta’s Networking @Scale 2024 conference, being held this week both virtually and at Santa Clara Convention Center, in Calif.

Mind you, Meta owns one of the world’s largest private backbones, a global network physically connecting 25 data centers and 85 points of presence with millions of miles of fiber optic cable, buried under both land and sea. Its reach and throughput allows someone on an Australian beach to see videos being posted by their friend in Greece nearly instantaneously.

And for the past five years, this global capacity has grown consistently by 30% a year.

Yet, the growing AI demands on the backbone is bumpy and difficult to predict.

“The impact of large clusters, GenAI, and AGI is yet to be learned,” Sundaresan said. “We haven’t yet fully flushed out what that means for the backend.”

Nonetheless, the networking team has gotten creative in coming up with ways to mitigate the ravenous networking demands of AI, while still increasing its throughput.

The Full AI Data Life Cycle

Back in 2022, Facebook, WhatsApp, Instagram and other product groups all started requesting fleets of GPUs for their AI efforts.

There had been requests in earlier years, but they resulted in smaller clusters that did not generate a lot of traffic across data centers, so they largely went unnoticed by Meta’s networking team.

But in 2022, demand for the GPUs grew by 100% year over year.

And this resulted, for the networking department, “a higher-than-anticipated uptick in growth of traffic on the backbone,” Sundaresan said. “We were not ready for this.”

Initially, the group assumed most AI traffic these clusters would generate only moved from storage to the GPUs. “We had missed several critical elements of this AI life cycle,” she said.

“AI workloads are just not as fungible with hardware heterogeneity”
— Jyotsna Sundaresan, of Meta

Data replication and data placement turned out to be two considerable challenges. They did not fully anticipate how much traffic this would cause.

AI requires fresh data, and this data is generated from everywhere, both by users and by machines, all the time.

All this data then has to be placed somewhere, usually to any one or more remote data centers. It also has to be backed up to other locations to meet various quality control and regulatory mandates.

As a result, “There’s a lot of movement across different regions,” Sundaresan said, noting it can total each day in the Exabytes.

Worse, AI workloads are not quite as fungible as other workloads, Sundaresan noted, meaning the hardware and software requirements can be a lot more fussy than a more generic workload.

For instance, an A100 Nvidia GPU has different network interface preferences than an H100 GPU.

Bending the Demand Curve

You can think of this process as an AI data life cycle, said Abishek Gopalan, a Meta network engineer working on global infrastructure who also presented this talk.

The network management team set out to find ways to optimize the network to better handle the characteristics of these data movements, a process called “backbone dimensioning” Gopalan noted.

Abishek Gopalan, of Meta, presenting at the company’s Networking @Scale 2024 conference.

“The backbone is a precious resource, and it’s a shared infrastructure that we’re using to support all of Meta’s products and platforms,” Gopalan said.

The networking team had to take a “holistic view,” of the traffic, looking at not only network capacity, but even at computer and storage resources.

Once it’s created, fresh data would need to be copied to so many other locations that it would exceed the cost of the original computation.

So the networking team worked with the storage team to work on caching and better data placement strategies, as well as deploying instrumentation to help figure out what data doesn’t get used by AI, so it does not have to be moved at all.

Data placement chart.

Also, not all data needs to get moved right away, so better understanding latency requirements of the data itself helped smooth the flow of traffic across the network. Not every batch of data comes with the same service-level requirements. This allows networking to be divided into differentiated classes for different workloads, through the use of advanced scheduling tools.

Without this work, the amount of traffic that would end up on the backbone would be “untenable,” Gopalan said.

Llama 3 Differed from Earlier AI Requirements

What casual observers may not readily appreciate is how much the creation of large language models (LLMs) requires specialized cluster and networking topologies, ones that differ from those used in earlier AI work.

In another talk at the conference, given by a pair of Meta production networking engineers, Pavan Balaji and Adi Gangidi, examined the network optimizations needed for supporting Facebook’s own 405 billion parameter Llama 3 LLM.

Adi Gangidi, of Meta, speaks at the company’s Networking @Scale 2024 conference.

To run LLMs, you need both accuracy and speed, hence the need for the large clusters of GPUs attached to fast network and storage systems, Balaji said.

To train and serve the first two iterations of Llama, Meta used existing GPU clusters originally built for ranking and recommendation work. Ranking works best in a mesh-like communication topology across all the GPUs.

These proved to be un-optimal for the larger Llama 3 however, which thrives best with a more hierarchical communication pattern.

So the company built two additional clusters, each with 24,000 GPUs each, just for Llama 3.

Instead of a full mesh model, these clusters separated GPUs into zones of 3,000 each with full bisection bandwidth.

These partitions were connected together with “aggregate training switches,” which did not offer full bisection bandwidth, but rather oversubscription, which assumed not all nodes would be using the switch at once.

This was just fine, because “generative AI workloads have hierarchical collectives which produce traffic patterns like trees or rings, and for these patterns, they can tolerate oversubscription just fine,” Gangidi added.

Further fine-tuning was still required, through load balancers, routing techniques such as advanced Equal Cost Multi-Path (ECMP) and other methods of traffic engineering.

Bigger Backbone Still Needed

While all this work aims to reduce the demand curve quite a bit, Gopalan acknowledged there is still work to be done to work on the other side of the equation, that of fortifying the supply curve.

In other words, the core network will still have to be aggressively expanded, reinforced with more fiber optic cable, and given more storage and power support as well.

“We intentionally design our backbone to allow for more flexible demand patterns, as well as allow for more workload optionality,” Gopalan said, “so that it allows our backbone to really serve potential spikes or changes in demand patterns, which aren’t always easy to predict.”

Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 25 years, including stints at IDG and Government Computer News. Before that, he...