Advertisement

Comparison of Big Data OLAP DB : ClickHouse, Druid, and Pinot

阅读量:

In this post, I aim to examine the capabilities of three prominent open-source data storage systems: ClickHouse, Druid, and Pinot. These tools are designed to handle advanced analytical query operations over vast datasets while maintaining real-time processing capabilities.

Important notice: The article is very lengthy, if you're short on time, we recommend reading only the summary part at the end.

Sources of Information

I acquired a deep understanding of ClickHouse's inner workings by studying Alexey Zatelepin's teachings as a key developer at the company. The most comprehensive English resource for learning about ClickHouse is this documentation section, yet it covers only a limited range of topics.

I am a contributor to the Druid project, but I do not hold any significant stake in this system (for instance, I might soon cease participating in its development), which means readers can reasonably anticipate my review of Druid will remain unbiased.

The content of my posts about Pinot is derived from the Architecture section of Pinot's Wiki and The other Wiki pages within the “Design Docs” section, all updated on June 2017, over six months prior.

Upon review by the developers Alexey Zatelepin, Vitaliy Lyudvichenko, associated with ClickHouse; Gian Merlino, a PMC member and a highly active developer in Druid; Kishore Gopalakrishna, architect at Pinot; and Jean-François Im, a developer at Pinot. We extend our thanks to all the reviewers.

Similarities between the Systems

Coupled Data and Compute

All systems like ClickHouse, Druid, and Pinot share fundamental similarities due to their shared approach of storing data and performing query processing on unified nodes, originating from a decoupled BigQuery architecture. In recent discussions about Druid's coupled architecture (1, 2), I have highlighted some intrinsic challenges associated with this design choice. Currently, there remains no fully functional open-source alternative to BigQuery (except possibly Drill), and I have delved into exploring methods for constructing such a system in this article.

Distinguish from Big Data SQL Systems: Their Distinctive Features Include Index Structures and Static Data Distribution Mechanisms.

The subject systems are capable of executing queries more efficiently than the Big Data processing systems from the SQL-on-Hadoop family: Hive, Impala, Presto and Spark. Even when these latter systems retrieve data in a columnar layout such as Parquet or Kudu they still fall short in terms of performance. These technologies leverage advanced indexing techniques such as B+ trees to achieve this efficiency

The system is designed to have its own data layouts for storing indexed information and are closely integrated with their respective query engines. In contrast, SQL-on-Hadoop systems are typically unaware of the underlying data formats, making them less intrusive in large-scale data storage setups.

  • Data is distributed relatively static across the nodes, while distributed query execution leverages this understanding. On the other hand, ClickHouse, Druid, and Pinot do not handle queries involving significant data transfers between node clusters; such as joins involving large table pairs.

No Point Updates and Deletes

On the opposite end of the database spectrum, ClickHouse, Druid, and Pinot do not support point updates or deletions compared to columnar systems such as Kudu, InfluxDB, and Vertica. These systems allow these databases to achieve more efficient columnar compression while implementing more aggressive indexing strategies. This results in enhanced resource utilization rates and accelerated query performance.

The ClickHouse developers at Yandex are aiming to provide support for update operations and delete capabilities in the near future. However, I am uncertain whether these features will be implemented as point queries or range-based data deletions.

Big Data Style Ingestion

Each of these systems—ClickHouse, Druid, and Pinot—supports streaming data ingestion from Kafka. Both Druid and Pinot support Lambda-style streaming alongside batch ingestion of identical data. Unlike Druid and Pinot, which require a dedicated batch processing system, ClickHouse handles batch operations natively through direct insertions. Further details on this topic will be provided below in this post.

Proven at Large Scale

All three systems scale up to large numbers: Yandex.Metrica hosts a ClickHouse cluster with around 10,000 CPU cores. Metamarkets operates a Druid cluster of comparable size. One instance of Pinot at LinkedIn includes thousands of machines.

Immature

按照企业数据库的标准衡量,所有这些主题系统都处于相对初级阶段。
然而,在大多数开源大数据系统中(比如Druid),它们可能并没有那么糟糕。
然而,在ClickHouse、Druid和Pinot中,并没有明显的优化或功能上的亮点。
并且充满了各种各样的问题(关于ClickHouse和Pinot的具体情况我不太确定,
但就目前而言它们的表现与Druid相比并无优势)。

This brings us to the next important section —

On Performance Comparisons and Choice of the System

I regularly encounter on the web how individuals compare and select big data systems — they acquire a representative sample of their data, which they manage to ingest into the evaluated system, and subsequently attempt to measure efficiency — both in terms of memory or disk space required and query completion speed — while lacking an understanding of the evaluated system's internal workings. Then, relying solely on such performance metrics along with occasionally considering feature requirements and comparing existing systems currently have, individuals typically make a choice based on this information. Or Worse still,** many decide to develop their own "better" system from scratch.**

I believe that this approach has significant issues, particularly in the context of open-source big data OLAP systems. Constructing a universal big data OLAP system that can efficiently handle a wide range of use cases and features (including all possible combinations thereof) is an extremely large challenge — it would require an estimated minimum of 100 man-years to develop such a system.

ClickHouse, Druid and Pinot are optimized exclusively for the specific use cases developers prioritize, offering only features aligned with developer needs. If you're deploying a large cluster of any of these systems while prioritizing efficiency, I'm confident your deployment will encounter unique inefficiencies—their focus hasn't addressed potential inefficiencies as comprehensively as we might anticipate. Furthermore, this approach—inputting data without thorough schema design or query optimization—risks significant inefficiencies through poorly configured settings.

CloudFlare: ClickHouse vs. Druid

注:以上改写遵循了所有给定的规则和要求

With respect to their final decision regarding whether to adopt ClickHouse as their preferred solution, I do not question its validity. The reasoning behind their choice of ClickHouse over Druid appears reasonable given the context of around ten nodes and their specific application domain. However, it must be emphasized that the claim that ClickHouse provides at least an order of magnitude greater efficiency (in terms of infrastructure costs) compared to Druid is entirely unfounded. In reality, among these three systems under discussion, Druid stands out as offering significant benefits for cost-effective installations.

When selecting a Big Data OLAP system, avoid comparing their current effectiveness tailored to your use case. Currently, these systems are far from optimal. The key is to focus on how quickly your organization can advance these systems toward becoming more optimal for your specific needs.

Due to their consistent architectural design principles, ClickHouse, Druid and Pinot exhibit nearly identical efficiency limits and performance optimization capabilities. No single system offers a universal solution to surpass others significantly. It would be misleading to draw conclusions based on current benchmark results since these systems perform differently across various metrics. For instance, Druid currently struggles with primary key sorting, whereas ClickHouse excels in this area (see above). Conversely, ClickHouse lacks support for inverted indexes, which Druid compensates for in certain workloads. If you are determined and capable of implementing these optimizations, they can be added to any of the selected systems with relatively modest effort.

Either your organization should employ engineers with the expertise to read, comprehend, and edit the source code of the selected system. Furthermore, they must possess the ability to manage such tasks. Note that ClickHouse is developed using C++, while Druid and Pinot are created with Java.

Or, your organization may wish to enter into a contract agreement with a provider that offers maintenance services for the selected system. Available options include Altinity, Imply, and Hortonworks, each tailored to specific platforms like ClickHouse. Currently, there are no vendors offering support specifically for Pinot technology.

Other development considerations:

Yandex’s ClickHouse developers reported that they dedicate half their time to developing internal-required features and another half to creating solutions highlighted by the community. This implies that ClickHouse’s feature development should align with those most sought-after by its broader community.

Druid developers of Imply are driven to create generally applicable features to maximize their future business.

  • 与Apache项目的开发模式高度相似,在过去几年中,在多个公司协作下完成开发,并且各方关注点差异显著、没有一家公司占据绝对主导地位。
  • 相较之下,在Yandex和LinkedIn等公司的推动下几乎实现了全面独立。
  • 其贡献具有较小的被撤销或拒绝风险,并且这些贡献因偏离主要开发者的目标而具有较小的风险。
  • 其缺乏一个主导的主要开发者公司。

Druid dedicated to supporting "developer API", which enables contributors to develop custom column types, aggregation algorithms, "deep storage" options, etc., while maintaining them separately from the core Druid codebase. Druid developers have documented this API and tracked its compatibility with prior versions. However, this API remains under development and tends to break frequently across Druid releases. As of my knowledge, ClickHouse and Pinot do not maintain similar APIs.

According to GitHub, Pinot attracts the highest number of contributors with an estimated 10+ years of dedicated effort poured into it over the past year. Approximately 6 man-years were invested in ClickHouse, while Druid received around 7. This implies that theoretically, Pinot leads in improvement speed among all subject systems.

The architectures of Druid and Pinot are very similar, whereas ClickHouse exhibits a slight divergence from them. I plan to first examine the architecture of ClickHouse as an example of the generic Druid/Pinot architecture before exploring the minor distinctions between Druid and Pinot.

Differences between ClickHouse and Druid/Pinot

Data Management: Druid and Pinot

Within Druid and Pinot systems, data within each "table" (referred to as a "schema table" in other contexts) is systematically divided into a predetermined number of segments. Alongside temporal dimensions, data segmentation typically occurs at defined intervals. These segmented portions are individually encapsulated as self-contained units known as segments. Each segment encompasses associated metadata, compressed columnar storage, and indexing structures.

Segments are persisted in “deep storage” (e. g. HDFS) and could be loaded on query processing nodes, but the latter are not responsible for durability of the segments, so query processing nodes could be replaced relatively freely. Segments are not rigidly attached to some nodes, they could be loaded more or less on any nodes. A special dedicated server (called “Coordinator” in Druid and “Controller” in Pinot, but I’m going to generically refer to it as “master” below) is responsible for assigning the segments to the nodes, and moving segments between the nodes, if needed. (It doesn’t contradict what I pointed above in this post that all three subject systems, including Druid and Pinot, have “static” data distribution between the nodes, because segment loads and movements in Druid (and Pinot, I suppose) are expensive operations and not done for each particular query, and usually happen only every several minutes, or hours, or days.)

The caching of segment metadata occurs within ZooKeeper, directly within Druid, and through the Helix framework within Pinot. Within Druid, metadata is additionally stored within an SQL database. Its detailed explanation can be found below in the section titled "Differences between Druid and Pinot".

Data Management: ClickHouse

ClickHouse不支持‘时间段划分’功能。
数据在ClickHouse集群中没有专门的‘深层存储’(deep storage),节点不仅负责数据的查询处理(query processing),还承担数据持久化与可靠性的职责(persistence and durability)。
因此无需搭建HDFS或采用如亚马逊S3那样的云存储解决方案。

ClickHouse stores tables in partitions, each comprising specific node sets. There is no central authority or dedicated metadata server. Every node involved in a table's partitioning contains complete and identical copies of the table's metadata, including the storage locations for all other nodes' partitions.

元数据中的分表包含节点的"权重"以实现新数据的分布。例如,在实例中40%的数据应分配至节点A而剩余的30%分别分配至节点B和C均匀分布在各节点之间是正常配置。如上所述的"倾斜度"仅在新增节点时必要以便更快地填充新节点所携带的数据。通过ClickHouse集群管理员的手动操作进行调整或可建立一个基于ClickHouse的自动系统。

Data Management: Comparison

In terms of data management simplicity, ClickHouse surpasses both Druid and Pinot. It requires only a single node type and eschews the need for dedicated servers or deep storage solutions. However, this approach becomes increasingly problematic when datasets grow sufficiently large to necessitate partitioning across dozens or more nodes. The query amplification factor scales similarly to the partitioning factor, even for queries covering small data intervals.

4431ba1df79cba57bab80a74f8a3e3ca.png

Data distribution tradeoff in ClickHouse

As illustrated in the provided image, table data within Druid or Pinot is distributed across three nodes. However, typically, a query for a small interval of data will generally result in only two nodes being accessed unless it crosses a segment boundary. Interestingly, in ClickHouse systems, all queries must access three nodes when dealing with tables partitioned across multiple instances. This example does not highlight significant differences; however, imagining scaling up to hundreds of nodes while maintaining an optimal partitioning factor of e.g., 10 would reveal potential trade-offs in such distributed systems.

To mitigate this problem, the largest ClickHouse cluster at Yandex (of hundreds of nodes) is in fact split into many “subclusters” of a few dozens of nodes each. This ClickHouse cluster is used to power website analytics, and each point of data has “website ID” dimension. There is strict assignment of each website ID to a specific subcluster, where all data for that website ID go. There is some business logic layer on top of that ClickHouse cluster to manage such data separation on both data ingestion and querying sides. Thankfully in their use case, little queries need to hit data across multiple website IDs, and such queries are coming not from customers of the service, so they don’t have strict real-time SLA.

Another limitation exists in the ClickHouse architecture, which is evident when clusters grow quickly. Data balancing is not automatic and requires manual intervention by altering "node weights" within a partitioned table structure.

Tiering of Query Processing Nodes in Druid

Data organization using segments is relatively straightforward to comprehend. Moving segments between nodes is relatively uncomplicated. Those two factors enabled Druid to adopt a tiered architecture for query processing nodes: old data is automatically transferred to servers equipped with larger disk storage, while newer data receives more attention in terms of memory and CPU resources. This approach effectively reduces the operational costs associated with managing a large Druid cluster, although it may result in increased response times for older data.

this new feature has enabled Metamarkets to achieve significant cost savings in Druid infrastructure spend on a monthly basis, compared to its previous setup.

b455ea7de9b4e207d19953f29ed9decd.png

Tiering of query processing nodes in Druid

From what I understand, ClickHouse and Pinot currently lack comparable features, while all nodes within their clusters should inherently remain consistent.

Given that Pinot's architecture is quite similar to Druid's, I believe introducing an analogous feature in Pinot would not be overly challenging. Implementing such a feature in ClickHouse may prove more challenging, as the concept of segments plays a crucial role in facilitating its implementation, although it remains feasible with careful planning.

Data Replication: Druid and Pinot

The unit of replication within Druid and Pinot constitutes a single segment. Replication occurs across both the "deep storage" layer (for example, three replicas in HDFS or transparently within cloud blob storage like Amazon S3) and during query processing: typically, within both Druid and Pinot, each segment is loaded onto two separate nodes. The master server monitors replication levels for each segment, loading it onto a specific node when replication falls below a predetermined threshold (e.g., when a node becomes unavailable).

Data Replication: ClickHouse

The unit of replication in ClickHouse is a table partition on a server, i. e. all data from some table, stored on a server. Similar to partitioning, replication in ClickHouse is “static and specific” rather than “cloud style”, i. e. several servers know that they are replicas of each other (for some specific table; for a different table, replication configuration may be different). Replication provides both durability and query availability. When a disk on some node is corrupted, the data is not lost, because it is stored on some other node too. When some node is temporarily down, queries could be routed to the replica.

Within Yandex's largest ClickHouse cluster, nodes are equally divided into pairs across distinct data centers. Each pair consists of replica nodes from separate data centers, maintaining a replication factor of two.

ClickHouse relies on ZooKeeper to manage replication operations, but it's unnecessary in a single-node ClickHouse deployment context. This implies that in such scenarios, ZooKEEPER remains unrequired.

Data Ingestion: Druid and Pinot

Within the systems of Druid and Pinot, query processing nodes are optimized for loading specific data portions from designated segments while exclusively handling queries related to that stored data. Instead of aggregating incoming data or creating new segment structures, these nodes focus solely on their designated operational roles.

When updating a table with a delay of an hour or more, segments can be created through batch processing engines like Hadoop or Spark. Both Druid and Pinot offer first-class out-of-the-box support for Hadoop functionality. However, there is currently an unsupported third-party plugin available for Druid indexing in Spark. As far as I am aware, Pinot does not even provide this level of support for Spark; instead, you would need to contribute the functionality yourself by understanding the Pinot interfaces and writing Java or Scala code. While this might seem challenging at first glance, it shouldn’t be overly difficult to implement once you dive into it. (Update: Ananth PackkilDurai from Slack has recently contributed support for Spark in Pinot.)

Whenever a table requires real-time updates, both Druid and Pinot introduce the concept of "real-time data nodes" to achieve three functions: receiving new information through Kafka (with Druid also supporting additional sources), responding to queries based on recent data, and generating background segments that are subsequently pushed to the "deep storage" layer.

Data Ingestion: ClickHouse

Due to its unique ability of not creating 'segments' that include all the necessary data within specific time frames, ClickHouse simplifies its approach to ingesting large datasets. Unlike systems like Hadoop or those relying on real-time processing units ('realtime' nodes), ClickHouse does not require a dedicated batch processing engine. Regularly used storage and query nodes in the system can efficiently handle bulk write operations without additional complexity.

Once a table is partitioned, the node responsible for handling batch writes (e.g., 10k rows) calculates and assigns data distribution based on the weights of all nodes within the same partitioned table (as described in the section 'Data management: ClickHouse' above).

Rows written in a single group form a small collection. The set is transformed into column-based format immediately. Each ClickHouse node operates behind the scenes to merge row sets into larger ones. Documentation of ClickHouse refers to this principle as MergeTree and notes its resemblance to log-structured merge trees. However, I find it slightly confusing because the data isn’t organized in tree structures but instead stored flatly in columns.

Data Ingestion: Comparison

Data ingestion in Druid and Pinot poses a significant challenge: it involves multiple distinct services, and managing these services requires substantial effort.

Data ingestion into ClickHouse proves to be notably straightforward (though it does necessitate more intricate handling of historical data — see above), with a notable caveat existing, which requires that you implement batching functionality directly within the system. The system provides built-in support for automatic ingestion and batching from Kafka. For alternative sources of real-time data — ranging from queueing infrastructure alternatives to Kafka and stream processing engines up through simple HTTP endpoints — it may require the creation or contribution of an intermediate batching service. If such requirements are not met by existing solutions, direct contribution to the system's codebase would be necessary.

Query Execution

Druid and Pinot architecture features a dedicated layer of node instances known as "brokers," which forward all incoming queries to the system. These brokers determine the assignment of historical query processing nodes to handle specific subqueries based on segment-to-node mappings. The brokers maintain this mapping information in memory, enabling efficient query routing. The broker node instances distribute subqueries downstream to respective query processing nodes. Upon receiving these subqueries' results, the brokers merge and combine them, delivering the consolidated outcome back to the user.

I can only wonder about the rationale behind removing another type of nodes during the design phase of Druid and Pinot. However, it now appears crucial, considering that cluster databases have reached segment counts exceeding ten million. The storage required for segment-to-node mapping information exceeds several gigabytes. Moreover, allocating such an enormous amount of memory across all query processing nodes is highly inefficient. This represents another significant limitation inherent to the segmented data management architectures employed by Druid and Pinot.

In ClickHouse allocating a separate set of nodes specifically for "query distribution" is generally unnecessary. There exists a special temporary "distributed" table type within ClickHouse that can be established on any given node; when queries target this table type they perform all the responsibilities typically handled by broker nodes in systems like Druid and Pinot. These tables are usually configured on each participating node within partitioned tables meaning practically every single node can function as an entry point for queries within a ClickHouse cluster. Specifically this particular node will issue necessary subqueries to other partitions handle its own portion of the query and integrate partial results from other partitions accordingly.

When an entity (whether it be a processing node within the ClickHouse system or an intermediary node in Druid and Pinot) initiates queries to other nodes, and encounters one or more failed queries due to any circumstances, ClickHouse and Pinot are designed to handle such scenarios effectively: they aggregate results from all successfully executed queries and continue to deliver partial outcomes to users. In contrast, Druid currently lacks this capability: should any query fail during processing, the entire operation is halted.

ClickHouse vs. Druid or Pinot: Conclusions

A segmented approach to data management in Druid and Pinot, compared to a simpler method in ClickHouse, defines many other aspects of these systems. Notably, this difference has minimal impact on the potential compression efficiency (despite the current disappointing compression story across all three systems), or query processing efficiency.

ClickHouse is comparable to traditional relational database management systems (RDMBS), for example PostgreSQL. In particular deploying it requires only a single server. If the potential deployment scale is modest specifically under 100 CPU cores for query processing and up to 1 TB of data I would argue that ClickHouse holds a significant edge over Druid and Pinot primarily because of its simplicity and lack of need for additional node types like master nodes or real-time ingestion nodes. Within this domain ClickHouse is more competitive with InfluxDB than with Druid or Pinot.

Druid and Pinot bear a resemblance to Big Data systems like HBase. Rather than focusing on their performance characteristics, these systems instead rely on ZooKeeper, persistence of replicated storage (such as HDFS), an emphasis on resilience against single-node failures, and self-managed operations without the need for constant human oversight. These similarities highlight key parallels between Druid and Pinot and other Big Data systems.

Across various application domains, none of these databases emerge as an outright winner. I suggest considering the source code that enables one to comprehend the system's functionality. The section "On Performance Comparisons and Choice of the System" delves deeper into this topic.

Furthermore, you could examine the table that follows. Each cell in this table represents a property of some application that renders either ClickHouse or Druid/Pinot more suitable for use in your environment. Rows are not ordered by their importance; instead, the significance of each row varies depending on the application. The relative importance of each row differs across applications, but if your application is characterized by numerous properties from one column and minimal ones from another, it is likely that you should select the corresponding system listed under the column header.

Note: None of these aforementioned attributes imply exclusive reliance on particular systems; they merely suggest that alternatives should not be dismissed outright. The following illustrates this point: when a cluster is anticipated to grow large, it does not suggest exclusively using Druid or Pinot but also allows for considering ClickHouse as an increasingly viable alternative even as cluster sizes expand substantially across various applications. Rather, this indicates that while Druid and Pinot are more likely candidates under certain circumstances due to their inherent properties (e.g., scalability), other factors may weigh more heavily in specific scenarios leading to ClickHouse being the optimal choice for large-scale clusters despite their size.

Differences between Druid and Pinot

Having observed repeatedly above, the architectures of both Druid and Pinot are notably comparable. The significant features exist exclusively within one system or the absence thereof in the other. Additionally, there are areas where advancements by one system surpass those of the other by a considerable margin. However, any such aspects I will discuss can be effectively replicated across other systems with relatively modest effort.

The main distinction between Druid and Pinot lies in their handling of segment management within its master node — a change so significant it may never be eradicated in due course. Developers of both systems would rightly view such a modification with skepticism, as neither approach supersedes another; each system offers distinct advantages and drawbacks.

Segment Management in Druid

The "master" node in Druid (and neither in Pinot) does not handle metadata persistence for data segments within a cluster and fails to manage the current mapping between segments and query processing nodes where these segments are loaded. The management of this data is handled by ZooKeeper. Furthermore, Druid additionally maintains this information within an SQL database that should be provided to set up a Druid cluster. However, I am unsure about the initial reasoning behind this choice, but currently, it offers several advantages.

ZooKeeper中仅存储了少量数据。仅关于段ID到查询处理节点映射的最小信息被保留于ZooKeeper中。除了上述内容外,其他扩展元数据(如段大小、数据维度与指标等)则仅存储在SQL数据库中。

当集群中的某些数据段因为过时而被删除(这是所有时间序列数据库的一个常见特性,在ClickHouse、Druid和Pinot等系统中均有实现),它们会被卸载到查询处理节点上,并从ZooKeeper中移除了这些元数据;然而,“深层存储”以及伴随的SQL数据库中仍会保留这些数据段的信息。只要它们并未从这些地方手动删除,则可以在需要时快速恢复非常老的数据(通常用于某些报告或调查)

It wasn’t originally intended, but Druid has proposed initiatives to make ZooKeeper dependency become optional. ZooKEEPER is currently utilized for three primary functions: segment management, service discovery, and property storage; notably, these include real-time data ingestion management. These functionalities—service discovery and property storage—could be facilitated by Consul. Segment management could potentially be executed using HTTP announcements and commands; additionally, its partial implementation stems from the persistence function of zookeeper being supported through SQL databases.

Having an SQL database as a dependency presents a significant drawback—particularly if the organization hasn’t yet established its own SQL database. Druid supports both MySQL and PostgreSQL natively. Additionally, it offers an open-source extension for Microsoft SQL Server. Additionally, when deployed in the cloud, Druid enables the use of managed RDBMS services like Amazon RDS.

Segment Management in Pinot

While Druid manages all segment-related functions itself and relies solely on Curator for inter-ZooKeeper communication, Pinot shifts a significant portion of segment and cluster management responsibilities to the Helix framework. On one hand, this arrangement allows Pinot developers to concentrate their efforts on other key aspects of the system. It enables Pinot developers to concentrate on other critical components of the system. Helix likely exhibits fewer bugs than Druid's internal logic, possibly due to its varied test environments and the substantial development effort invested in Helix.

On the other hand, Helix likely restricts Pinot within its "framework boundaries". Consequently, both Helix and Pinot will inevitably be reliant upon ZooKeeper moving forward.

Now, I will outline the key surface-level distinctions between Druid and Pinot. By using "shallow" here, I signify that there are clear pathways for replicating those features within systems that may lack them, if anyone is interested.

“Predicate pushdown” in Pinot

Whenever data undergoes partitioning in Kafka based on certain dimension keys, Pinot generates partitions that encapsulate the metadata detailing this partitioning. Then, for a query incorporating a predicate on these dimensions, a broker node performs upstream filtering of these partitions to optimize resource utilization. This typically reduces the number of affected partitions and, as a result, decreases the number of subsequent query processing nodes that must process these filtered partitions.

This feature is important for performance in some applications.

Nowadays, Druid implements key-based partitioning when segments are created in Hadoop, but not yet during realtime ingestion. Druid does not currently support predicate pushdown on brokers.

“Pluggable” Druid and Opinionated Pinot

Since Druid has been widely adopted and evolved through contributions from numerous organizations, over the years it has garnered backing from multiple interchangeable options for nearly every dedicated component or 'service':

HDFS、Cassandra、Amazon S3、Google Cloud Storage、Azure Blob Storage等均可被视为"深度存储"形式。

Kafka, RabbitMQ等(通过tranquility)作为实时数据接入源;

Druid本身、Graphite、Ambari、StatsD或Kafka作为数据传输渠道,在实时监控DataCluster的同时(metrics)。

Originally developed primarily by LinkedIn with the aim of addressing its specific requirements, the system has evolved over time. It typically leaves users with limited flexibility, often forcing them to choose between HDFS or Amazon S3 as a persistent storage solution. For handling large-scale data storage, HDFS and Amazon S3 are commonly employed, while Kafka is the sole option for real-time data streaming applications. However, if someone requires more flexibility—

Data Format and Query Execution Engine are Optimized Better in Pinot

Specifically, several aspects that are part of Pinot's segment format are notably absent from Druid.

Data compression of indexed tables or columns in terms of bit-level and byte-level granularity within the Druid database system.

Inverted indexes are not required for each column in Druid, whereas they are mandatory in Pinot. This leads to inefficiency at times and excessive memory usage. The disparity in memory consumption between Druid and Pinot as noticed by Uber might reasonably be linked to this issue.

  • Min and max values in numeric columns are recorded per segment.

  • Built-in data sorting capabilities are a hallmark feature of Druid. However, achieving such functionality within Druid required manual intervention and was not particularly elegant. The implication here is thatPinot’s innovative approach to data organization contributes to its superior space utilization efficiency.

    • Some more optimized format is used for multi-value columns than in Druid.

All those functionalities could potentially be integrated into Druid, however. Although Pinot's format is significantly more optimized than Druid's current implementation, it still falls short of achieving true optimality. For instance, both Pinot and Druid rely solely on general-purpose compression methods like Zstandard and have not yet incorporated any of the innovative compression techniques outlined in the Gorilla paper.

Regarding query execution, it is regrettable that Uber predominantly utilized count(*) queries for benchmarking Druid's performance against Pinot (1, 2). This is merely an inefficient linear scan operation within Druid at present. Despite its simplicity, one could readily upgrade its performance to O(1). Such comparisons highlight the limitations inherent in 'black box' evaluations and are thoroughly discussed in the aforementioned section concerning system performance assessments and choice.

It can be inferred that the performance differences across GROUP BY queries, as reported by Uber, can be traced back to the absence of data sorting within Druid’s segments, as mentioned earlier in this section.

Druid Has a Smarter Segment Assignment (Balancing) Algorithm

Pinot算法通过将数据段分配给当前负载最少的查询处理节点来实现高效查询。Druid算法则更加复杂先进,在评估每个数据段的表和时间后,使用复杂的公式计算最终分数以确定最优查询处理节点以分配新数据段。该算法在Metamarkets的实际应用中带来了30-40%的查询速度提升。然而,在Metamarkets我们对这一算法仍不满意,请参阅本博客文章中的“历史节点性能波动巨大”一节。

I'm baffled by LinkedIn's acceptance of such a straightforward balance mechanism in Pinot, but I can foresee potentially significant benefits lying ahead if they allocate more time enhancing their algorithm.

Pinot is More Fault Tolerant on the Query Execution Path

Previously, I discussed the "Query Execution" section earlier. When a "broker" node generates queries to other nodes and encounters issues with some of these queries, Pinot aggregates data from all successful queries and provides partial results back to the user.

Druid doesn’t implement this feature at the moment.

Tiering of Query Processing Nodes in Druid

The Druid system supports extraction of 'tiers' representing processing nodes dedicated to older datasets and newer datasets. Specifically, these tiers dedicated to older datasets exhibit significantly lower 'CPU, RAM resources per segment count' ratios compared to newer tiers. This configuration enables the trade of reduced infrastructure costs at the expense of slightly degraded query efficiency when interacting with older datasets.

As far as I know, Pinot doesn’t currently have a similar feature.

Summary

从架构角度来看,ClickHouse、Druid和Pinot具有高度相似性。它们在处理通用大数据框架(如Impala、Presto、Spark)之间占据的独特 niches中各有侧重,并且特别适合支持唯一主键、点更新和删除操作的列式数据库(比如InfluxDB)。

Thanks to their structural similarities, ClickHouse, Druid, and Pinot exhibit nearly identical optimization limits. However, as of today, all three systems remain underdeveloped and significantly below their optimization potential. Significant efficiency gains can be realized for each system when applied specifically within a particular domain—typically achievable within just a few engineer-months. I do not advise comparing the performance characteristics across these systems.

Among these three systems, ClickHouse distinguishes itself slightly from Druid and Pinot, while the other two are nearly identical, constituting two independent implementations of precisely the same system.

与传统数据库如PostgreSQL等相比

Druid and Pinot behave similarly to other Big Data systems within the Hadoop ecosystem. These systems maintain self-driven properties even at very large scales (exceeding 500 nodes). In contrast, ClickHouse is subject to significant professional oversight. Additionally, Druid and Pinot hold a more advantageous position when optimizing infrastructure costs across large clusters compared to ClickHouse. Furthermore, they offer a more suitable solution within cloud environments.

Long-term sustainability distinction lies between Druid and Pinot, as Pinot determines its architecture based on the Helix framework and plans to maintain reliance on ZooKeeper, whereas Druid may eventually stop relying on it. Additionally, Some instances of Druid will rely upon having access to at least one SQL database for their operations.

Currently, Pinot's optimization surpasses that of Druid. However, it's important to note that — "I do not advise comparing the performance metrics of these systems at all," and relevant sections in the post.

全部评论 (0)

还没有任何评论哟~