Advertisement

A Hadoop data pipeline to analyze applicaction performance

阅读量:

1. Introduction

In recent years, this advancement has garnered significant attention for its adaptable and efficient architecture in storing and processing vast quantities of data on standard computing hardware. This technology is frequently employed for analyzing application log files, given that the volume of logs generated by applications continues to expand (volume) while logs are typically unstructured in nature (variety). The architecture is characterized by a distributed class-based model where master nodes oversee operations while worker nodes concurrently manage multiple tasks. Such adaptability enables it to expand horizontally as data volumes grow.

In this project, we have constructed a data pipeline designed to assess the performance of applications by analyzing application performance metrics derived from log file records and database performance metrics obtained from the DBAU database. The system utilizes XXX as an example for analysis purposes; however, it is flexible enough to adapt for evaluating other applications as needed.

In this typical scenario, apperdata represents the duration of RESTful APIs. The performance data indicates that a particular RESTful API took approximately 283 milliseconds to execute. The execution of this particular API takes approximately 283 milliseconds to complete. Therefore, by reordering the information based on API duration, we can identify underperforming RESTful APIs and apply optimizations accordingly.

2012-12-14-06-01 06:01:24.743 283 /XXX/webapp/maskingservice/needMasking/4964

On the other hand, within db₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabasedatebasebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatespanning from within db_perf_data. This allows us to determine the number of selectselectselectselectselectselectselectreadreadreadreadreadreadupdateupdateupdateupdateupdateinsertinsertinsertinsertdeleteoperationsdeleteoperationsdeleteoperationsdeleteoperations performed at a particular time point (currently aggregated at minute-level).

2. System Design
2.1 Overview
A set of Hadoop ecosystem tools, as illustrated in Figure 2.1, are employed to construct the data pipeline, which encompasses three distinct phases:
a) Data Collection: Utilizing Flume, we gather both log files and db2perfdata and transfer them into Hadoop storage.
b) Data Storage: HDFS is utilized to archive the log files, while Hbase serves as the repository for storing db2perfdata.
c) Data Processing: Pig is employed to perform ETL operations on the log files, thereby converting unstructured logs into structured data. Additionally, Oozie triggers Pig jobs based on predefined time intervals and data availability.
d) Data Analysis/Reporting: Leveraging Hive enables us to execute SQL queries for analysis and reporting purposes. A notable feature is that we are able to join application performance metrics extracted from log files (stored in HDFS storage) with db2perfdata (stored in Hbase).

Figure 2.1 Architecture of the data pipeline

2.2 数据采集
2.2.1 Flume
Flume是一种广泛应用于Hadoop的数据传输工具。其配置采用客户端-收集器模式:每天由Autosys作业触发两个客户端运行,并与对应的收集器一起在Hadoop虚拟机上作为后台进程执行。如图所示的数据采集管道由以下组件构成:

  • Flume客户端源:一个Flume客户端负责Alcazar的日志文件(daily log file),另一个客户端则调用DBAU提供的show_db2perfdata脚本以获取分钟级的db2perfdata数据。
  • Flume客户端汇入:两个客户端都是内置的Avro汇入(sink)。
  • Flume收集器源:两个收集器都是内置的Avro源(source),这意味着客户端与收集器之间通过Avro协议进行通信。
  • Flume收集器汇出:日志文件收集器使用内置的HDFS汇出(hdfs-sink),而db2perfdata收集器则使用内置的HBase汇出(hbase-sink)。不过由于我们采用了定制化的数据源格式,因此需要自定义HBase编码器(serializer)。

2.3 Data Storage
2.3.1 HDFS
Due to its affordability and scalability in architecture, HDFS is an ideal choice for storing massive amounts of raw log files. We employ a specific file organization pattern, denoted as "nameNode_path alcazar/{YEAR}/{MONTH}/${DAY}", which facilitates efficient storage and retrieval of these logs.

2.3.2 Hbase
Hbase, an innovative columnar-based key/value store, provides real-time read/write capabilities to HDFS. Known for efficiently managing time-series and sparse datasets, it is particularly well-suited for handling the db2perfdata in this project due to its ability to manage both minute-level and second-level time-series data alongside sparse records that drop down to minute or second levels—resulting in numerous zero-value entries. Notably, storing these zero values incurs no cost in Hbase. Its schema-free design allows for straightforward table creation by specifying only the table name and a family column: create 'alcazarDbPerf', 'f1'

2.4 数据处理
2.4.1 Ozzie
如前所述(见2.2节),为了实现数据采集的自动化,在每天早晨会启动一个Flume客户端以执行Autosys作业。类似地,在自动化数据处理方面,则会使用Ozzie来安排Pig作业。

Figure 2.3 Oozie Pipeline

Please refer to Figure 2.3, which illustrates that...
a) The Coordinator task functions as both a time-driven and data-driven process. Specifically, it is programmed to initiate daily operations at a designated hour. Additionally, if the log file archive path "{nameNode}{alcazar_path}/{YEAR}/{MONTH}/${DAY}" does not exist when this task begins, it will halt execution until Oozie resumes and restarts it immediately.
b) The Pig workflow task initiates upon completion of the Coordinator task and processes data from the log file archive directory before generating ETL outputs stored in a temporary staging area.
c) Once completed, this task will move ETL outputs from said staging area into a predetermined destination archive path.

2.4.2 Pig
Pig is frequently employed to establish data processing pipelines on Hadoop systems. As depicted in Figure 2.4, a Pig script has been developed with the objective of conducting four distinct processing stages:

  1. Filtering: it employs string matching techniques to eliminate irrelevant logs.
  2. Field Extraction: regular expressions are utilized to identify and extract meaningful data fields such as date, time, url, etc., with two sets of fields being extracted—one for API start logs and another for API end logs.
  3. Joining: the two sets of fields are merged based on url and session id.
  4. Conversion: a custom user-defined function (UDF) is employed to calculate duration and convert datetime formats.

2.5 Data Analysis/Reporting
2.5.1 Hive

Hive is known as a schema-on-read SQL-supportive system on Hadoop. Schema-on-read means we only need to specify a schema when reading the data. Thus, the data to be queried on can be stored in any format. In this project, while appperfdata is stored in a structured format, db2perfdata is stored in Hbase specific format.
For appperfdata, the DDL to create table is:
CREATE EXTERNAL TABLE alcazarPerf (startDate STRING, startTime STRING, duration INT, url STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

【/user/junz/logAnalysis/alcazar/logETLArchive

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = "f1:Sel,f1:Rd,f1:Ins,f1:Upd,f1:Del") TBLPROPERTIES("hbase.table.name" = "alcazarDbPerf");
Once tables have been created in Hive, it becomes possible to employ SQL queries for analysis or reporting purposes. For instance, ordering application performance data by API duration can be achieved through the following query:
SELECT * FROM alcazarPerf ORDER BY duration DESC;
Performing a join operation between appsperfdata and db2perfdata can be done using this query:
SELECT * FROM alcazarperf JOIN hbase_alcazarDbPerF ON alcazarperf.startDate = hbase_alcazarDbPerF.time;

全部评论 (0)

还没有任何评论哟~