Two approaches to better kernel samepage merging
The KSM subsystem of the kernel identifies duplicate pages and consolidates them into a single shared instance. While enhancing memory efficiency is a notable advantage of KSM, certain challenges persist. During two specialized sessions at the 2025 Linux Storage Conference held across Memory Management Tracks and featuring Filesystem Technology, Mathieu Desnoyers and Sourav Panda proposed improvements to KSM to enhance its applicability across various scenarios.
Supporting user-space text patching
支持用户空间文本补丁
Desnoyers has visited KSMan to address a seemingly unrelated issue he is tackling: code patching for user-space processes. He utilizes tools such as the LTTng tracing framework, enabling the placement of tracepoints within applications. In the current setup, each tracepoint features a control variable and a decision branch to ascertain whether it triggers. Desnoyers mentioned that some clients have applications containing up to 30,000 tracepoints; at this scale, each additional overhead per tracepoint accumulates significantly. He aims to enhance this scenario by employing code patching methods similar to those used in the kernel. There are additional scenarios where code patching can be beneficial: optimizing code based on available features or enabling application features as needed.
由于这是以性能为导向的工作,他对于代码补丁可能导致自身新增新的性能问题这一点非常担忧。特别是,向一个文本页进行修改将导致该页被复制;原本共享的代码将不再共享。即便所有共享该页面进程均以相同方式对其进行修改,这一复制现象依然存在。或许可以帮助消除这些重复页面,KSM 的页面扫描本身也会带来额外负担,这也是 Desnoyers 所希望避免的。
另一个问题是KSM专注于为虚拟机环境执行去重操作。它依赖系统管理员进行配置才能正常运行,并带来了安全隐患。他更倾向于采用一种无需额外配置即可直接使用的简单解决方案。
Previously, he had proposed a similar solution at the summit. However, Linus Torvalds informed him that there are no available kernel spots for implementing two versions of KSM. Since he does not plan to replace the current KSM implementation, he has started exploring alternative solutions.
He is considering introducing a new system call designed to track individual user modifications on binaries. This mechanism will act as overlays over existing files used by applications. Once implemented, it will enable immediate updates for relevant processes without affecting others. However, this approach presents challenges when attempting to differentiate instrumentation across hierarchical process components.
So an improved approach might involve introducing a text_poke() system call, which accepts a vector of instructions to apply. The kernel would monitor altered pages for each address space (mapped file) across multiple levels. Whenever a process modifies one of its pages using this method, the kernel would attempt to locate other altered copies of the same page. If such copies exist with identical modifications, those users could share access to these pages. These modified pages would remain cached even after all users have terminated, ensuring that short-lived applications can reuse them when needed; they could also be released as memory becomes constrained.
可以说,在整个会议室中几乎没有多少人对这个新建议表现出浓厚的兴趣
Matthew Wilcox inquired whether Desnoyers was versed in the "reflink" concept, which he characterized as an instance of write-through copy-on-write file links. Over the years, efforts to integrate a generalized reflink capability into Linux have been unsuccessful, though some filesystems default to implementing this functionality internally. Wilcox posited that patching code could emulate reflink's functionality under the hood while keeping modified files hidden from user space. He indicated that patching code can mimic reflink’s behavior without exposing modified files to user space. When code undergoes patching, he explained, a new inode can be generated for altered files and stored within their respective virtual memory regions. This approach ensures that changes remain encapsulated and inaccessible to users.
He pointed out that one of the most significant challenges lies in the fact that the kernel lacks an efficient mechanism for caching reflinked files. Furthermore; when two reflinked files share the same unchanged page; this page ends up being stored as separate entries within the page cache. Attempts to address this issue have been proposed on various occasions but remain unresolved despite years of consideration. Desnoyers wondered how it might be possible for the kernel to map a list of modifications to their respective inode locations; Wilcox mulled over this concept for a brief moment before responding: "Oh well; it was an intriguing thought."
David Hildenbrand expressed strong interest in pursuing the reflink idea further. He stated that a high-level description of the changes needed was required. The kernel will generate a new file from the list on demand and reclaim it when necessary. However, despite its seeming simplicity and cleanliness, he clarified that "it's not as straightforward as it seems." The meeting concluded with Wilcox reporting that the problem posed was interesting.
Selective KSM
选择性 KSM
During the subsequent meeting, Panda concisely introduced two proposals aimed at tackling existing issues with KSM. While the feature proves effective, it necessitates extensive parameter tuning to function optimally. Moreover, it incurs additional runtime overhead for page scanning and has also been recognized as a potential security concern.
The initial concept is "synchronized KSM", wherein page mergers are directed synchronously from user space. Only when a request is made (with the associated time costs allocated to the requesting process) will page merging occur, and this process will focus solely on predetermined memory regions. Actual requests can originate from sysfs, madvise(), or other system calls. The enhanced security stems from the caller's ability to determine which pages are eligible for merging, while CPU utilization surpasses that of KSM's current background scanning mechanism. Notably, this approach faces a significant drawback: once two pages diverge, they remain separate indefinitely, even if their contents later align.
The second proposal is "divided KSM", where processes are allocated into sensitive and non-sensitive zones. This allocation will be managed through a sysfs hierarchy; new zones can be established as needed. The merging process will be managed by writing a process ID and an address range into a zone's control file; the kernel will assign the process to the zone, followed by a synchronous scan of the specified address range for potential merges. Hildenbrand noted that this approach mirrors using madvise(MADV_MERGEABLE) for controlling merges, with the key distinction being its synchronous execution. He proposed utilizing process_madvise() instead of sysfs to implement this functionality.
Panda 表示, 另一种方法将是创建一个新的 ksm_open() 系统调用,该系统调用将接收表示分区名称的一个参数,并返回表示该分区的一个文件描述符.随后通过 ksm_merge() 请求合并该分区内的重复页面.其他系统调用将被添加,用于撤销页面合并操作或完全从该分区中脱离. Hildenbrand 表达了放弃现有 KSM 实现并非理智之举,因此仅在其基础上引入新的分区机制方案值得考虑.
During the close of session (and of the day), Panda inquired about whether such a feature could be configured during compilation; Michal Hocko advised against this suggestion, as KSM is already an opt-in feature. He stated he prefers the file-descriptor approach, which offers a clear namespace for KSM operations. Hildenbrand added that retaining the global KSM functionality was feasible, provided it was carefully disabled for any process joining a partition.
