In today's fast-paced digital world, even the tiniest performance bottlenecks can have a significant impact on user experience. This is especially true for streaming giants like Netflix, where every second counts in delivering seamless content to millions of viewers. So, when Netflix engineers encountered a mysterious slowdown in their containerized environment, they embarked on a quest to optimize performance, uncovering some fascinating insights along the way.
The Container Conundrum
Containers have revolutionized the way applications are deployed and scaled, offering a hardware-agnostic approach. However, as Netflix discovered, when it comes to speed and performance, understanding the underlying hardware becomes crucial.
When Netflix upgraded its container runtime from Docker to containerd, a subtle yet critical issue emerged. As operations scaled up, certain nodes began to stall, causing delays in container initialization. This was a major concern, as a single Netflix user decision could trigger hundreds of containers, putting immense pressure on the system.
The Root of the Problem
After some sleuthing, the Netflix engineers identified the culprit: the way containers were initialized. Specifically, they found that the delay was exacerbated on certain CPU architectures more than others. This led to an interesting observation: hardware matters, even in a containerized world.
Too Many UIDs, Too Many Problems
Under the older Docker setup, containers shared the host's user ID (UID) pool, providing a certain level of simplicity. However, when Netflix migrated to containerd and adopted a more secure User Namespace model, each container was isolated with its own unique UID range. While this enhanced security, it also introduced a new challenge.
The kernel now had to individually 'idmap' every layer of a container image to that specific UID range. For multi-layered images, this resulted in a massive spike in kernel calls, slowing down the process of instantiating a single container. Containerd, in its quest to assemble a container's root filesystem, became a CPU hog, especially for containers with more than 50 layers.
Multi-Core Bottlenecks
Netflix isn't alone in facing these challenges. Meta engineers, managing thousands of containers for AI inferencing, have also encountered performance drops due to the management of containers with thousands of UIDs. This issue is not unique to these tech giants; it's a common system design problem in multi-core processing.
Data structures often become the bottleneck, as observed by researchers in a widely-cited paper presented at the 2008 USENIX Operating System and Design conference. They found that many commonly-used applications suffer from kernel-level bottlenecks in multi-core servers.
The CPU Bottleneck Unveiled
But the story gets even more intriguing when we delve into the hardware differences. The timeouts were predominantly occurring on Intel Xeon-based AWS r5.metal instances with the Intel Skylake/Cascade Lake architecture and 96 virtual CPUs. Interestingly, these issues were less frequent on the later 7th generation Intel m7i.metal-24xl or the AMD EPYC-based m7a.24xlarge instances, which were also part of Netflix's Kubernetes deployments.
The engineering team developed a microbenchmark to compare lock contention across different multi-core systems. They discovered that the older mesh architecture used in r5.metal chips was the bottleneck. The design struggled to synchronize the global mount lock across multiple cores, leading to massive cache-line contention.
In contrast, the other AWS instances use a distributed architecture, where multiple cores have their own local last-level cache. Lock contention is much rarer in these designs, allowing for smoother scaling under load.
The Fix: A Software and Hardware Approach
The Netflix engineering team tackled the issue at the software level by reducing the number of kernel system calls made by containerd. They created a pull request that changed how containerd handled global lock usage, leveraging the recursive binds feature introduced in the Linux kernel 6.3 release.
This innovative solution transformed the number of mount operations from O(n) to O(1) per container, where n is the number of layers in the image. The pull request, backed by metrics from the microbenchmark, was merged into containerd version 2.2, released in November.
But Netflix didn't stop there. They also considered the hardware aspect, opting to route workloads away from the problematic r5.metal instances and towards architectures that scaled better under these conditions.
The Takeaway
This real-world example highlights the importance of holistic performance engineering. While containers offer a flexible and scalable approach, optimizing performance requires a deep understanding of both the software stack and the underlying hardware. It's a delicate dance between software and hardware, and the key to delivering seamless user experiences at scale.
So, the next time you binge-watch your favorite show on Netflix, remember the intricate engineering that goes on behind the scenes to ensure a smooth and uninterrupted experience.