A Hadoop data pipeline to analyze applicaction performance

阅读量：

1. Introduction

In recent years, this advancement has garnered significant attention for its adaptable and efficient architecture in storing and processing vast quantities of data on standard computing hardware. This technology is frequently employed for analyzing application log files, given that the volume of logs generated by applications continues to expand (volume) while logs are typically unstructured in nature (variety). The architecture is characterized by a distributed class-based model where master nodes oversee operations while worker nodes concurrently manage multiple tasks. Such adaptability enables it to expand horizontally as data volumes grow.

In this project, we have constructed a data pipeline designed to assess the performance of applications by analyzing application performance metrics derived from log file records and database performance metrics obtained from the DBAU database. The system utilizes XXX as an example for analysis purposes; however, it is flexible enough to adapt for evaluating other applications as needed.

In this typical scenario, apperdata represents the duration of RESTful APIs. The performance data indicates that a particular RESTful API took approximately 283 milliseconds to execute. The execution of this particular API takes approximately 283 milliseconds to complete. Therefore, by reordering the information based on API duration, we can identify underperforming RESTful APIs and apply optimizations accordingly.

2012-12-14-06-01 06:01:24.743 283 /XXX/webapp/maskingservice/needMasking/4964

On the other hand, within db₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabase₂perfdatabasedatebasebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatebasedatespanning from within db_perf_data. This allows us to determine the number of selectselectselectselectselectselectselectreadreadreadreadreadreadupdateupdateupdateupdateupdateinsertinsertinsertinsertdeleteoperationsdeleteoperationsdeleteoperationsdeleteoperations performed at a particular time point (currently aggregated at minute-level).

2. System Design
2.1 Overview
A set of Hadoop ecosystem tools, as illustrated in Figure 2.1, are employed to construct the data pipeline, which encompasses three distinct phases:
a) Data Collection: Utilizing Flume, we gather both log files and db2perfdata and transfer them into Hadoop storage.
b) Data Storage: HDFS is utilized to archive the log files, while Hbase serves as the repository for storing db2perfdata.
c) Data Processing: Pig is employed to perform ETL operations on the log files, thereby converting unstructured logs into structured data. Additionally, Oozie triggers Pig jobs based on predefined time intervals and data availability.
d) Data Analysis/Reporting: Leveraging Hive enables us to execute SQL queries for analysis and reporting purposes. A notable feature is that we are able to join application performance metrics extracted from log files (stored in HDFS storage) with db2perfdata (stored in Hbase).

Figure 2.1 Architecture of the data pipeline

2.2 数据采集
2.2.1 Flume
Flume是一种广泛应用于Hadoop的数据传输工具。其配置采用客户端-收集器模式：每天由Autosys作业触发两个客户端运行，并与对应的收集器一起在Hadoop虚拟机上作为后台进程执行。如图所示的数据采集管道由以下组件构成：

Flume客户端源：一个Flume客户端负责Alcazar的日志文件（daily log file），另一个客户端则调用DBAU提供的show_db2perfdata脚本以获取分钟级的db2perfdata数据。
Flume客户端汇入：两个客户端都是内置的Avro汇入（sink）。
Flume收集器源：两个收集器都是内置的Avro源（source），这意味着客户端与收集器之间通过Avro协议进行通信。
Flume收集器汇出：日志文件收集器使用内置的HDFS汇出（hdfs-sink），而db2perfdata收集器则使用内置的HBase汇出（hbase-sink）。不过由于我们采用了定制化的数据源格式，因此需要自定义HBase编码器（serializer）。

2.3 Data Storage
2.3.1 HDFS
Due to its affordability and scalability in architecture, HDFS is an ideal choice for storing massive amounts of raw log files. We employ a specific file organization pattern, denoted as "nameNode_path alcazar/ ${YEAR}/$ {MONTH}/${DAY}", which facilitates efficient storage and retrieval of these logs.

2.3.2 Hbase
Hbase, an innovative columnar-based key/value store, provides real-time read/write capabilities to HDFS. Known for efficiently managing time-series and sparse datasets, it is particularly well-suited for handling the db2perfdata in this project due to its ability to manage both minute-level and second-level time-series data alongside sparse records that drop down to minute or second levels—resulting in numerous zero-value entries. Notably, storing these zero values incurs no cost in Hbase. Its schema-free design allows for straightforward table creation by specifying only the table name and a family column: create 'alcazarDbPerf', 'f1'

2.4 数据处理
2.4.1 Ozzie
如前所述（见2.2节），为了实现数据采集的自动化，在每天早晨会启动一个Flume客户端以执行Autosys作业。类似地，在自动化数据处理方面，则会使用Ozzie来安排Pig作业。

Figure 2.3 Oozie Pipeline

Please refer to Figure 2.3, which illustrates that...
a) The Coordinator task functions as both a time-driven and data-driven process. Specifically, it is programmed to initiate daily operations at a designated hour. Additionally, if the log file archive path " ${nameNode}$ {alcazar_path}/ ${YEAR}/$ {MONTH}/${DAY}" does not exist when this task begins, it will halt execution until Oozie resumes and restarts it immediately.
b) The Pig workflow task initiates upon completion of the Coordinator task and processes data from the log file archive directory before generating ETL outputs stored in a temporary staging area.
c) Once completed, this task will move ETL outputs from said staging area into a predetermined destination archive path.

2.4.2 Pig
Pig is frequently employed to establish data processing pipelines on Hadoop systems. As depicted in Figure 2.4, a Pig script has been developed with the objective of conducting four distinct processing stages:

Filtering: it employs string matching techniques to eliminate irrelevant logs.
Field Extraction: regular expressions are utilized to identify and extract meaningful data fields such as date, time, url, etc., with two sets of fields being extracted—one for API start logs and another for API end logs.
Joining: the two sets of fields are merged based on url and session id.
Conversion: a custom user-defined function (UDF) is employed to calculate duration and convert datetime formats.

2.5 Data Analysis/Reporting
2.5.1 Hive
Hive is known as a schema-on-read SQL-supportive system on Hadoop. Schema-on-read means we only need to specify a schema when reading the data. Thus, the data to be queried on can be stored in any format. In this project, while appperfdata is stored in a structured format, db2perfdata is stored in Hbase specific format.
For appperfdata, the DDL to create table is:
CREATE EXTERNAL TABLE alcazarPerf (startDate STRING, startTime STRING, duration INT, url STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

【/user/junz/logAnalysis/alcazar/logETLArchive

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = "f1:Sel,f1:Rd,f1:Ins,f1:Upd,f1:Del") TBLPROPERTIES("hbase.table.name" = "alcazarDbPerf");
Once tables have been created in Hive, it becomes possible to employ SQL queries for analysis or reporting purposes. For instance, ordering application performance data by API duration can be achieved through the following query:
SELECT * FROM alcazarPerf ORDER BY duration DESC;
Performing a join operation between appsperfdata and db2perfdata can be done using this query:
SELECT * FROM alcazarperf JOIN hbase_alcazarDbPerF ON alcazarperf.startDate = hbase_alcazarDbPerF.time;

全部评论 (0)

还没有任何评论哟~

A Hadoop data pipeline to analyze applicaction performance

1\.Introduction Inrecentyears,Hadoophasbeenunderthespotlightforitsflexibleandscalablearchitecturetos...

Scrapy use pipeline store data to postgredb

http://stackoverflow.com/questions =========Scrapyusepipelinestoredatatopostgredb========== [postgre...

Building a Realtime Streaming Data Pipeline Using Kafka

作者：禅与计算机程序设计艺术 1.简介 ApacheKafka是一个开源的分布式流处理平台，由LinkedIn开发并开源，用于高吞吐量、低延迟的数据实时传输。本文将使用Kafka作为数据源，使用Sto...

A Guide to Building a Machine Learning Pipeline at Scal

作者：禅与计算机程序设计艺术 1.简介自从人工智能（AI）这个词被提出后，无论是在科技还是产业界，人们对其产生了浓厚兴趣。随着移动互联网、云计算、大数据等技术的迅猛发展，机器学习也经历了一个重要的变...

Tesla Details a Plan to Upgrade Model 3 Performance by

作者：禅与计算机程序设计艺术 1.简介自从2016年发布了Model3在美国市场之后，激动不已的是看到特斯拉并没有放弃对Autopilot（自动驾驶）系统的开发，而是通过改进底盘结构、升级处理器等方...

Data Stream Management: A Modern Approach to Data Integration

1.背景介绍数据流管理（DataStreamManagement,DSMT）是一种处理实时数据流的技术，它为实时数据处理提供了一种统一的、高效的方法。随着互联网的发展，大量的实时数据流在各个领域产生...

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection

方法方法分为三个核心阶段:从简短的经验中学习，基于经验的评估，以及从自我指导的经验中再培训。初始阶段强调为模型配备基本的指令遵循能力。随后的阶段引入了一个新的度量来评估每个样本的指令遵循难度评分，该...

Introduction to Hadoop Ecosystem for Data Science

作者：禅与计算机程序设计艺术 1.简介 HadoopEcosystem是一个基于Java的开源框架，主要用于存储、处理和分析海量数据。其提供的组件包括HDFS（HadoopDistributedFil...

Can you describe how to improve the performance of a we

作者：禅与计算机程序设计艺术 1.简介缓存是一个计算机科学领域中非常重要的一个技术。在软件开发中，由于数据访问量大、业务复杂、服务器性能有限等特点，使得应用系统在运行时需要频繁地访问数据库查询数据，...

Pipeline Data Hazards

文章目录 PipelineHazards DataHazards DetectandStall DetectandForward PipelineHazards Whatcangowronginpip...

是否确定退出登录?

A Hadoop data pipeline to analyze applicaction performance

全部评论 (0)

相关文章推荐

A Hadoop data pipeline to analyze applicaction performance

Scrapy use pipeline store data to postgredb

Building a Realtime Streaming Data Pipeline Using Kafka

A Guide to Building a Machine Learning Pipeline at Scal

Tesla Details a Plan to Upgrade Model 3 Performance by

Data Stream Management: A Modern Approach to Data Integration

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection

Introduction to Hadoop Ecosystem for Data Science

Can you describe how to improve the performance of a we

Pipeline Data Hazards