Advertisement

Automatic tuning for weighted interleaving

阅读量:

It is common, on NUMA systems, to try to allocate all memory on the local node, since it will be the fastest. That is not the only possible policy, though; another is weighted interleaving, which seeks to distribute allocations across memory controllers to maximize the bandwidth utilization on each. Configuring such policies can be challenging, though. At the 2025 Linux Storage, Filesystem, Memory-Management, and BPF Summit, Joshua Hahn ran a session in the memory-management track about how that configuration might be automated.
在 NUMA 系统中,通常会尝试将所有内存分配到本地节点,因为这样最快。但这并不是唯一的策略;还有一种叫做加权交错(weighted interleaving)的策略,旨在将分配任务分布到多个内存控制器上,从而最大化各控制器的带宽利用率。然而,配置这种策略可能非常复杂。在 2025 年的 Linux 存储、文件系统、内存管理和 BPF 峰会上,Joshua Hahn 在内存管理子议题中主持了一个会议,讨论如何自动化这种配置。

The core purpose of memory-allocation policy, he began, is to determine which node any given virtual memory area (VMA) should allocate pages from. One possible policy is simple round-robin interleaving to distribute allocations evenly across the system. Weighted interleaving modifies the round-robin approach to allocate more heavily from some nodes than others. Properly distributing allocations can maximize the use of the available memory bandwidth, improving performance overall.
他首先指出,内存分配策略的核心目的是决定某个虚拟内存区域(VMA)应从哪个节点分配页面。其中一种策略是简单的轮询交错,即在整个系统中均匀分配。而加权交错是在此基础上进行调整,使某些节点获得更多分配。合理分布分配任务可以最大程度地利用可用的内存带宽,从而整体提升系统性能。

The question Hahn had is: how can this interleaving be made to work out of the box? The system, he said, should provide good defaults for the interleaving weights. He had a couple of heuristics to help with the setting of those defaults. The weight ratios, intuitively, should be similar to the bandwidth ratios for the controllers involved. Bandwidth is the ultimate limit on the performance of the system, he said; it is more important than the latency of any given memory access. Weights should also be small numbers; the weight of each node is, in the end, the number of pages to allocate from that node before moving on, so smaller weights will lead to faster distribution of allocations.
Hahn 提出的问题是:如何让这种交错策略开箱即用?他表示,系统应该为交错权重提供合理的默认值。他提出了几个启发式规则来帮助设定这些默认值。直观上,权重比例应接近各内存控制器的带宽比例。他强调,带宽才是限制系统性能的最终因素,比单次内存访问的延迟更重要。权重值也应尽可能小;因为权重实际上是指在切换节点之前从该节点分配的页面数量,权重越小,分配分布得就越快。

A problem arises, though, when new memory is added to the system; the kernel has to respond and recalculate all of the weights. How that should be done is not entirely clear, especially if the administrator has adjusted the defaults. The administrator should be able to tell the system what to do in that case, he said, with the available options being to recalculate all of the weights from the beginning, or to just set the weight for the new memory to one.
但当系统新增内存时会出现问题;此时内核需要重新计算所有的权重。如果管理员修改过默认权重,那么该如何处理就变得不太清楚了。他认为,管理员应该能够指定系统在这种情况下该如何处理,可选项包括:从头开始重新计算所有权重,或者仅为新内存节点设置权重为 1。

Reprising a theme from an earlier session, Hahn brought up the sort of complications that hardware can bring. Given a system with two host bridges, each of which has two CXL nodes, how many NUMA nodes should the system have? The hardware can present the available resources in a few different ways, with effects that show up in the configuration problem at the kernel level.
延续早前一个会议的主题,Hahn 提到了硬件可能带来的复杂性。比如在一个系统中有两个主桥,每个主桥下有两个 CXL 节点,那么这个系统应该有多少个 NUMA 节点?硬件可以以多种方式呈现这些资源,而这些差异最终都会影响到内核层面的配置问题。

Ideally, of course, the tuning of the weights should be dynamic, based on some heuristic, but Hahn said that he is not entirely convinced that bandwidth use is the right metric to optimize for. He wondered if the kernel should be doing the tuning, or whether it should be delegated to user space, which might have more information. Liam Howlett said that the responsibility for this tuning definitely belongs in user space; the kernel cannot know what the user wants. Gregory Price (who did the original weighted-interleaving work) pointed out that there is currently no interface that allows one process to adjust another's weights; that would be needed for a user-space solution.
理想情况下,权重的调整应当是动态的,基于某种启发式算法。但 Hahn 表示,他并不完全相信“带宽利用率”是最值得优化的指标。他提出疑问:这种调整应由内核完成,还是应交给可能掌握更多信息的用户空间?Liam Howlett 回应称,这种调整的职责显然应在用户空间,因为内核不可能知道用户的意图。Gregory Price(最初设计加权交错功能的人)指出,目前还没有接口允许一个进程调整另一个进程的权重,而这正是用户空间方案所需要的。

Michal Hocko said that problems like this show that the kernel's NUMA interfaces are not addressing current needs. That problem needs to be addressed; it presents a good challenge that can lead to the creation of better interfaces. Jonathan Cameron said that user space currently does not have enough data to solve this problem.
Michal Hocko 表示,这类问题表明内核目前的 NUMA 接口已经无法满足当下的需求。这个问题亟需解决,它也是推动设计更好接口的一个契机。Jonathan Cameron 则指出,目前用户空间并没有足够的数据来应对这一问题。

Price said that users may want to interleave a given VMA from the moment it is instantiated, and wondered whether the NUMA abstraction is able to handle that. Hocko answered in the negative, saying that the NUMA interface was designed for static hardware, and falls short even on current systems. The kernel's memory-policy interface is constraining; it is really time to create a new NUMA API, hopefully one that will handle CXL as well.
Price 提到,用户可能希望从 VMA 创建之初就立即对其进行交错分配,他质疑当前 NUMA 抽象能否支持这种需求。Hocko 的回答是否定的,他表示 NUMA 接口最初是为静态硬件设计的,甚至在当前的系统中也表现不佳。内核的内存策略接口存在很多限制;现在确实是时候设计一个新的 NUMA API,最好还能支持 CXL。

Howlett said that kernel developers were never able to get the out-of-memory killer right, so now that problem is usually handled in user space. He was not convinced that the kernel community would be any more successful with interleaving policy. Hocko responded that user-space out-of-memory killers did not work well either until the pressure-stall information interface was added; before then, nobody had thought that it would be the necessary feature that would enable a solution to that problem.
Howlett 表示,内核开发者从未真正把 OOM killer 做好,因此现在这个问题通常由用户空间处理。他并不相信内核社区在交错策略方面会更成功。对此,Hocko 回应说,用户空间的 OOM killer 其实在压力暂停信息接口(pressure-stall information)出现之前也得不好;在那之前,没人想到这项功能是解决该问题的关键。

The session ran out of time; it ended with a general consensus that a better interface for controlling memory policies is needed. Now all that is needed is for somebody to actually do that work.
该会议最后因时间用尽而告一段落;与会者普遍达成共识:我们确实需要一个更好的接口来控制内存策略。现在只缺一个人真正去把它实现出来了。

全部评论 (0)

还没有任何评论哟~