Advertisement

A plan to make BPF kfuncs polymorphic

阅读量:

David Vernet kicked off the BPF track at 2024's BPF track at the Linux Storage, Filesystem, Memory Management, and BPF Summit with a talk about polymorphic kfuncs — or, with less jargon, kernel functions that can be called from BPF which use different implementations depending on context. He explained how this would be useful to the sched_ext BPF scheduling framework, but expected it to be helpful in other areas as well.
David Vernet 在 2024 年的 Linux 存储、文件系统、内存管理与 BPF 峰会中拉开了 BPF 分会的序幕,他的演讲主题是“多态 kfuncs”——也就是可以从 BPF 中调用、并根据上下文使用不同实现的内核函数。他解释了这在 sched_ext BPF 调度框架中会非常有用,并认为这项机制也会对其他领域有所助益。

Alexei Starovoitov gave a talk later in the conference about the history of BPF, including the origin and motivation for kfuncs — stay tuned for an article on that. For now, knowing more about kfuncs is not really needed to understand Vernet's problem and proposed solution.
Alexei Starovoitov 在会议稍后做了一场关于 BPF 历史的演讲,介绍了 kfunc 的起源与动机——我们会在后续的文章中详细介绍。就目前而言,了解更多 kfunc 的细节并不是理解 Vernet 所描述的问题与方案的必要前提。

There are 151 kfuncs in the kernel as of version 6.9, so it should probably not be too surprising that they vary wildly. Some kfuncs, Vernet pointed out, are used for extremely common, basic functionality — such as the functions for acquiring and releasing locks. These kfuncs have the same meaning and implementation in every possible context, because what they do is fairly simple. Other kfuncs, however, can have context-specific semantics. Some may only have "any meaning at all [...] within specific contexts", Vernet said.
截至 Linux 内核 6.9 版本,共有 151 个 kfunc,这些函数的使用场景自然也是千差万别。Vernet 指出,一些 kfunc 用于非常常见的基本功能,例如加锁和解锁。这些函数在所有上下文中含义一致,实现也相同,因为它们本身非常简单。而另一些 kfunc 则具有上下文相关的语义,Vernet 表示,有些函数“只有在特定上下文中才有意义”。

One example of this is the functions for manipulating dispatch queues — structures used in sched_ext to store lists of pending tasks. Vernet called them the basic building blocks of scheduler policy. One of the main functions for manipulating them from BPF, scx_bpf_dispatch(), always has the same meaning: adding a task to a different queue. But when called from different BPF callbacks, there are subtle variations in how the function can be used.
其中一个例子是用于操作 dispatch 队列的函数——这些结构在 sched_ext 中用于存储待调度任务的列表。Vernet 将它们称为调度策略的基本构建模块。其中一个主要的 BPF 操作函数是 scx_bpf_dispatch(),它的含义始终是将一个任务添加到另一个队列中。但当它在不同的 BPF 回调中被调用时,其用法存在细微差别。

When called from a select_cpu() or enqueue() callback, scx_bpf_dispatch() cannot drop the run-queue lock relevant to the task, and can only dispatch tasks to the CPU that triggered the call. Furthermore, only tasks that are being woken or enqueued can be dispatched.
scx_bpf_dispatch()select_cpu()enqueue() 回调中被调用时,它不能释放与任务相关的队列锁,而且只能将任务派发到触发该回调的 CPU。此外,它也只能派发被唤醒或刚被放入队列的任务。

In contrast, when called from a dispatch() callback, scx_bpf_dispatch() is free to drop the run-queue lock, dispatch to any CPU, and dispatch multiple tasks. The difference is that dispatch() is called by a CPU that is about to otherwise go idle, and so there is no existing work on the CPU that needs to be carefully worked around.
相反,当 scx_bpf_dispatch()dispatch() 回调中被调用时,它可以自由地释放队列锁,可以将任务派发到任意 CPU,甚至可以派发多个任务。这种差异的原因在于,dispatch() 是由即将空闲的 CPU 所触发,因此该 CPU 上没有必须要避免干扰的其他工作。

In both cases, scx_bpf_dispatch() presents the same logical API, but the differing constraints mean that the implementation in these two cases is quite different. Right now, the code tracks which case it is in with a per-CPU variable, and then uses that to choose which implementation to use. "So you can work around it," Vernet admitted, but he wanted to see if the implementation could be better.
尽管 scx_bpf_dispatch() 在这两种场景下暴露的是相同的逻辑 API,但由于调用约束不同,其底层实现差别很大。目前,代码通过每个 CPU 的变量来追踪当前处于哪种情况,并据此选择具体实现。Vernet 承认,“你可以通过这种方式绕过去”,但他希望能有更好的实现方式。

Vernet's proposal
Vernet 的提案

Right now, every kfunc is associated with a BPF Type Format (BTF) ID. This is an ID used to represent the kfunc in the debugging information for the BPF program, but it is also used with the BPF instruction that calls a kfunc to indicate which one it wants to invoke. When the BPF program is loaded and then just-in-time compiled, the BTF IDs get resolved, and the resulting code can call them directly.
当前,每个 kfunc 都与一个 BPF 类型格式(BTF)ID 相关联。这个 ID 不仅用于在 BPF 程序的调试信息中表示该 kfunc,同时也用于 BPF 指令中,指定要调用的具体函数。当 BPF 程序加载并进行 JIT 编译时,这些 BTF ID 会被解析为具体函数,从而实现直接调用。

Vernet suggests extending this mechanism by having the BPF verifier support multiple kfuncs with the same ID — whenever it encounters a call to a kfunc, it would ask the subsystem associated with that kfunc ID what the real kfunc should be (using a new callback). The subsystem would then reply with a "concrete" kfunc ID, and loading would proceed in the same way. This approach moves the tracking of the context of a call from run-time to load-time, and eliminates the need for tracking the state in a per-CPU variable.
Vernet 提议扩展这一机制,使得 BPF verifier 可以支持多个具有相同 ID 的 kfunc —— 每当 verifier 遇到对某个 kfunc 的调用时,它会通过一个新的回调接口询问与该 ID 相关的子系统:应该使用哪个具体的函数实现。子系统返回“具体的” kfunc ID 后,加载过程照常进行。该机制将调用上下文的判断从时前移到加载时,省去了对 per-CPU 状态变量的依赖。

Vernet said that the advantage of this approach is the ergonomic API it presents, and the control it gives subsystems over how their kfuncs can be called. But the approach does have its drawbacks. For one thing, adding additional callbacks in the verifier threatens to make one of the most complicated parts of BPF even more so. For another, it would use load-time logic for what is really a static configuration — if the compiler understood the different contexts that the kfuncs care about, the correct kfunc implementation could be chosen at build-time.
Vernet 表示,这种方式的优点在于能提供更符合使用习惯的 API,并赋予子系统对 kfunc 调用方式的控制权。但这一方法也有其弊端:首先,在 verifier 中引入新的回调,会让 BPF 中本就复杂的部分变得更加难以维护;其次,它在加载时做出的是一种本应静态配置的决策——如果编译器能理解 kfunc 所依赖的不同上下文,就可以在构建时就选择合适的实现。

A build-time configuration would be nicer, Vernet stated, but it would be "kind of a pain in the neck to implement". He suggested that implementing it statically was probably not a high priority. Vernet did think any mechanism for polymorphic kfuncs would probably be useful to areas of the kernel other than sched_ext.
Vernet 承认,构建时配置当然更理想,但实现起来“相当麻烦”。他认为将其静态实现可能不是当务之急。同时他也指出,多态 kfunc 的机制对 sched_ext 以外的其他内核子系统也很可能是有价值的。

Discussion
讨论环节

The other attendees had questions about Vernet's proposal. One member of the audience pointed out that there is already a similar mechanism for BPF helper functions (a different kind of kernel function callable from BPF programs, with a different interface), and asked that Vernet "look at this more holistically". Vernet replied that the equivalent aspect of helper functions lets the implementations differ depending on the BPF program type — so the same helper function can be implemented differently for a BPF program attached to a trace point or registered as a callback. But that approach won't work for his use case, because the program types in question are not sufficiently granular. As far as the verifier is concerned, all of the callbacks involved in sched_ext are of the same type, because they are all struct_ops programs (a mechanism where different parts of the kernel can define a struct full of function pointers to which BPF programs can be attached). He wants to be able to handle calls from different struct_ops programs differently — which almost certainly requires information the verifier doesn't have, since it is the other subsystems or modules which define struct_ops callbacks that would know which functions should be handled differently.
与会者对 Vernet 的提案提出了问题。一位观众指出,BPF helper functions(另一种可以从 BPF 调用的内核函数,具有不同的接口)已经有类似机制,并建议 Vernet 从更整体的视角考虑。Vernet 回应说,helper 函数确实可以根据 BPF 程序类型使用不同的实现——例如挂载在 tracepoint 上的程序和注册为回调的程序可以使用不同版本的 helper。但这种方法无法满足他的需求,因为这些 BPF 程序的类型划分不够细。在 verifier 看来,sched_ext 中所有相关回调都是 struct_ops 类型(struct_ops 是一种机制,允许内核定义包含多个函数指针的结构体,并由 BPF 程序挂接),而 Vernet 希望能根据不同的 struct_ops 程序采取不同处理方式。但要做到这一点,必须依赖 verifier 所不具备的上下文信息——因为只有定义 struct_ops 回调的子系统或模块才知道应该对哪些函数区别对待。

The discussion went back and forth a little bit, with the other attendee trying to identify ways that the mechanism could be generalized beyond struct_ops programs. Vernet agreed that "if we can abstract it that would be much better for sure," but didn't seem to think that the existing helper mechanism was a suitable basis for that.
讨论进行了几轮,那位与会者试图寻找将该机制推广到 struct_ops 之外的方法。Vernet 表示“如果能抽象化处理当然更好”,但他并不认为现有的 helper 机制适合作为实现基础。

Another member of the audience asked whether it would be possible to have kfuncs that behave differently based on the type of their arguments. The motivating use case would be to enable different data types being inserted into a BPF map to be handled differently. "I want to skip the ownership check when the argument is an sk_buf", they explained. Vernet agreed that this would be technically feasible, since the verifier knows the types of the arguments to the kfunc. The question, in Vernet's eyes, is whether this mechanism would be confusing.
另一位观众提问,是否可以根据参数的类型让 kfunc 表现出不同的行为。他的动机是在往 BPF map 插入不同类型数据时能够进行差异化处理。“当参数是 sk_buf 时,我希望跳过所有权检查”,他解释道。Vernet 认为从技术上看这完全可行,因为 verifier 能识别 kfunc 的参数类型。但他也指出,问题在于这种机制是否会造成混淆。

The first participant in the conversation suggested that this use case could be served just by adding new kfuncs and letting the developer use the right one. The second commenter pushed back, saying that they did not want to introduce many new kfuncs for what is effectively the same behavior — especially not when it seems likely that there will be more types that need special handling to keep in maps in the future. Vernet agreed that it makes sense to give kfuncs the flexibility to decide what they want to do.
第一个提问者建议,可以为这种情况添加新的 kfunc,让开发者自行选择合适的函数调用。但第二位评论者反驳称,不希望为本质上相同行为引入过多新 kfunc,特别是在未来可能有更多类型需要特殊处理以存入 map 的前提下。Vernet 认同,赋予 kfunc 自主判断行为的灵活性是合理的。

That was the end of the discussion at the time, so it remains to be seen whether the proposal will be adopted, and if so in what form.
讨论至此结束,目前尚不清楚这一提案是否会被采纳,以及会以什么样的形式落地。

全部评论 (0)

还没有任何评论哟~