A Primer on the Role of Outliers in Data Analysis
作者:禅与计算机程序设计艺术
1.简介
识别异常数据是数据分析与机器学习任务中的关键步骤,旨在通过探索大量数据集来识别出类或罕见模式。
异常数据定义为在某些属性(如值或分布)上与大多数其他观察结果显著不同的数据点。
它们可用于异常检测、预测未来事件、识别重要趋势等目的。
通过适当识别并处理这些异常值可以显著提高预测模型的准确性。
本文将深入探讨异常数据在数据分析中的作用,
并提供基本概念及其高效检测与管理的方法。
此外,
我们将介绍几种常用的算法,
并结合Python编程语言实现及解读实例进行详细说明。
最后,
我们将讨论这一领域可能面临的未来挑战与局限性。
2.基本概念术语说明
An atypical observation is diverged from other data points in terms of their attributes such as value or distribution. This phenomenon is potentially attributed to various types of errors including recording inaccuracies measurement miscalibrations experimental fluctuations and sampling biases. A variety of methods exist for identifying and removing outliers from datasets which can be broadly categorized into statistical-based analytical-based and machine learning-based techniques.
Tukey's rule - 该方法通过设定每个中位数上下三个四分位距的范围来计算。任何超出该范围的数据点被视为异常值并从数据集中移除。
- LocalOutlierFactor(LOF)- LOF通过衡量给定对象与其邻居之间的局部密度偏差来识别高密度区域中的集群点群,并使用这些信息来检测异常值的存在。当某对象的LOF分数超过设定阈值时,则将其标记为潜在异常值。
These techniques assess the distance of a particular object to every other object within the dataset, labeling those at a significant distance as outliers. One widely used method is DBSCAN, which operates based on Euclidean distances between data points.
- 标准差方法 - 该方法计算了数据集中每个不同特征的标准差,并标记出任何超出其标准差两倍的样本。然而该方法对数据中的极端值非常敏感,在数据缺乏明确模式或规律时表现良好。
稳健估计量 - 稳健估计量采用中位绝对偏差(MAD)作为衡量指标,其定义为全体数据集中各点与该数据集中位数之间绝对差值的中位数。任何超出低于或高于该特征稳健估计值三分之一IQR范围内的点都被识别为异常值。
Furthermore, other methods focus on analyzing variable relationships before assessing outlier pairs. For instance, pairwise scatter plots illustrate bivariate data distributions and often highlight clusters of similarly valued data points. The correlation coefficients provide further insights into variable relationships, identifying pairs with particularly strong correlations that may suggest multi-collinearity.
3.核心算法原理和具体操作步骤以及数学公式讲解
Now let's go over each algorithm mentioned above in detail:
3.1 Tukey's Rule
Tukey's rule defines the upper and lower quartile ranges around the median, subsequently eliminating any data point that falls outside these boundaries. This mathematical explanation demonstrates how Tukey's rule identifies outliers by calculating quartiles around the median value.
Let’s suppose that X represents the sorted list of n data points. Let’s denote \lfloor n/2 \rfloor as the position corresponding to the middle element in this sorted list. This implies that...
is the first quartile, i.e., the quarter of the data points below Q_1. Similarly,
is the third quartile, i.e., the quarter of the data points above Q_3.
The inter-quartile range (IQR) is calculated as:
Any data point beyond (Q_1 - 1.5 \times IQR) and (Q_3 + 1.5 \times IQR) thresholds is classified as an outlier and excluded from the analysis.
Here's how you can implement Tukey's rule using Python:
def tukey(data):
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
return data[(data > lower_bound) & (data < upper_bound)]
代码解读
This function accepts a numpy array data as input and generates a new array that includes only the values within certain limits. I note that we are computing percentiles via NumPy's percentile() function, which necessitates importing NumPy at the start of the code.
3.2 Local Outlier Factor (LOF)
Local Outlier Factor (LOF) 是一种广泛使用的技术,专门用于检测复杂数据集中的异常值。该方法通过计算每个数据点与其邻居之间的距离来确定其异常程度,并赋予远离邻居的数据点更高的异常分数。这些异常分数可用于在训练分类或回归模型之前筛选出潜在的离群值。
At its core, LOF aims to determine the local density of each data point by examining its nearest neighbors and classifies it as an outlier if this density is below a predetermined threshold. The estimation of local density often employs methods like Gaussian kernels or binning techniques, including Kernel Density Estimation (KDE). After calculating the local densities for all data points, LOF incorporates both spatial and feature-based considerations to assess how likely it is that an object will be identified as an outlier.
To effectively employ LOF for outlier detection within the framework of Scikit-learn’s LocalOutlierFactor class module, follow these steps:
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(novelty=True, contamination='auto') # novelty parameter indicates whether to use non-linearity trick to enhance LOF performance
outliers = lof.fit_predict(X)
scores = lof.negative_outlier_factor_ # negative outlier factor gives us the raw outlier scores
# select top-k highest scoring outliers
top_k = int(len(X)*0.1) # assume we want to keep 10% of outliers
indices = np.argsort(-scores)[0:top_k]
outliers_selected = np.zeros_like(outliers).astype('bool')
outliers_selected[indices] = True
# plot selected outliers
plt.scatter(X[:, 0], X[:, 1], c='#AAAAAA', s=10)
plt.scatter(X[outliers_selected][:, 0], X[outliers_selected][:, 1], marker='+', color='red', s=20)
plt.show()
代码解读
Note that we've introduced a boolean mask outliers_selected into our original data matrix X to indicate which rows correspond to outliers detected by LOF. If our false positive count significantly exceeds our false negative count, we might consider adjusting the contamination hyperparameter to reduce the number of positive labels assigned to outliers. Additionally, since we must train LOF separately for each partition of the data, we cannot perform distributed processing efficiently here. However, Scikit-learn provides a distributed version of LOF called LocalOutlierFactorCV, which can process multiple partitions concurrently. The reason is that we need to train LOF separately for each partition of the data. It cannot process multiple partitions concurrently.
3.3 Distance Based Methods
Distance-based methods calculate distances between a given object and every other object in a dataset to classify those that are far apart as outliers. Euclidean Distance is widely used for outlier detection due to its ability to calculate distances based on squared feature differences. Manhattan Distance is another popular choice because it sums absolute feature differences. Chebyshev Distance also comes into play by considering maximum feature differences. Mahalanobis Distance further refines this by accounting for data covariance structures.
DBSCAN is known as one of the most widely used clustering algorithms and is often employed for identifying dense areas within datasets. It starts by identifying core samples which are more densely populated than their surrounding areas. Core samples form clusters, with other points lying on the edge of these clusters. Points near cluster centers but not always part of them are classified as outliers. DBSCAN achieves robust performance even with complex geometries, noisy data, and clusters varying in density. Here's how you can implement DBSCAN using SciPy's dbscan function:
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, fcluster
import matplotlib.pyplot as plt
def dbscan(data, eps, min_samples):
dist_matrix = squareform(pdist(data)) # calculate distance matrix
Z = linkage(dist_matrix, 'ward') # perform hierarchical agglomerative clustering
labels = fcluster(Z, t=eps, criterion='distance') # apply DBSCAN clustering
num_clusters = len(set(labels)) - (1 if -1 in labels else 0) # count number of clusters
print("Number of clusters:", num_clusters)
# plot clusters
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='Set1', alpha=0.5)
plt.colorbar()
plt.show()
return indices
代码解读
We begin by computing the distance matrix for all point pairs via squareform() and pdist(). Subsequently, we employ Ward's method through the linkage() function to execute hierarchical clustering and derive a similarity matrix (Z). The function fcluster() is then utilized to apply DBSCAN clustering on this matrix, specifying the desired epsilon value along with the minimum cluster size. The resultant cluster labels form an array, which is subsequently visualized using Matplotlib's scatter() function.
3.4 Standard Deviation Method
The standard deviation approach represents a straightforward and reliable technique for identifying outliers within a dataset. This methodology is grounded in analyzing deviations from central tendencies. It posits that any data point exceeding two times the standard deviation of its respective feature acts as an outlier. Consequently, this approach excludes points near the mean and identifies scenarios where datasets exhibit significant variability or extreme observations. Here's how you can implement this method in Python:
def stddev(data, threshold):
means = np.mean(data, axis=0) # calculate column-wise means
stdevs = np.std(data, ddof=1, axis=0) # calculate column-wise standard deviations (ddof=1 provides consistent results)
zscores = (data - means)/stdevs # calculate z-scores
return np.where(np.abs(zscores) >= threshold) # select indices where |zscore| exceeds threshold
代码解读
This function accepts a numpy array data and a threshold value threshold when provided and outputs a tuple that includes two arrays: row and column index positions where the values surpass the threshold. By applying this method, you can effectively identify the relevant rows.
mask = stddev(data, threshold)[0].tolist() # convert tuples to lists
rows_to_remove = []
for idx in mask:
row = X[idx,:]
if condition(row):
rows_to_remove.append(idx)
X = np.delete(X, rows_to_remove, axis=0)
代码解读
3.5 Robust Estimators
鲁棒估计器采用诸如中位绝对离差(MAD)等统计量来识别数据集中的异常值。MAD被定义为每个数据点与整个数据集的中位数之间的绝对差的中位数。任何超出该特征稳健估计量三分之一四分位间距范围的数据点将被视为异常值。以下是使用Python实现稳健估计器的方法:首先计算每个特征的稳健估计量,并计算其中位数以确定其尺度时使用MAD。将阈值设置在该中位数上方和下方三分之一四分位间距的位置,并将超出此阈值的数据点视为异常值。
def mad(arr):
med = np.median(arr)
mad = np.median([np.abs(x - med) for x in arr])
return mad
def get_iqr(arr):
q1, q3 = np.percentile(arr, [25, 75])
iqr = q3 - q1
return iqr
def get_mad_bounds(arr):
med = np.median(arr)
mad = mad(arr)
lower_bound = max(q1 - 1.5*iqr, med - 3*mad)
upper_bound = min(q3 + 1.5*iqr, med + 3*mad)
return lower_bound, upper_bound
def robust_estimator(data):
robust_estimators = []
for col in range(data.shape[1]):
median = np.median(data[:,col])
iqr = get_iqr(data[:,col])
mad_lower, mad_upper = get_mad_bounds(data[:,col])
robust_estimator = (median - 1.5*iqr, median + 1.5*iqr, mad_lower, mad_upper)
robust_estimators.append(robust_estimator)
return robust_estimators
def outliers_by_range(data, robust_estimators):
outliers = []
for row in range(data.shape[0]):
vals = data[row,:]
for col in range(data.shape[1]):
median = np.median(vals)
iqr = get_iqr(vals)
mad_lower, mad_upper = robust_estimators[col]
if vals[col] <= mad_lower or vals[col] >= mad_upper:
outliers.append((row, col))
return outliers
代码解读
This function accepts a numpy array data as input and generates robust estimates for each feature. The function then examines each row within the dataset to identify columns where each element exceeds its respective robust estimate. Finally, it outputs a list detailing which rows and columns have exceeded their respective boundaries.
4.具体代码实例和解释说明
With the aim of demonstrating the practical application of the outlined algorithms, it will involve analyzing a real-world dataset that contains information about patients' blood pressure measurements recorded over time. Subsequently, we will perform data loading and preprocessing steps, specifically addressing missing values by removing them and normalizing features to achieve a zero mean and unit variance.
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
df = pd.read_csv('bloodpressure.csv')
# drop rows with missing values
df.dropna(inplace=True)
# scale features to zero mean unit variance
X = df.drop(['id'], axis=1).values
scaled_X = scale(X)
代码解读
Next, let's visualize the data distribution:
import seaborn as sns
sns.pairplot(pd.DataFrame(scaled_X, columns=['bp']), diag_kind='hist')
代码解读
From the figure, we observe an apparent concentration of outliers appearing at approximately (90, 110). Prior to proceeding, let's examine whether Tukey's rule would exclude this point:
tukeyed_X = tukey(scaled_X)
print(tukeyed_X.shape)
代码解读
Output: (148, 2)
It has been found that Tukey's rule eliminated this point with certainty. Now, let us attempt to apply LOF and DBSCAN; these methods have proven to be effective in similar contexts.
from sklearn.neighbors import LocalOutlierFactor
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, fcluster
import matplotlib.pyplot as plt
# run LOF
lof = LocalOutlierFactor(novelty=True, contamination='auto')
lof.fit_predict(X)
lof_scores = lof.negative_outlier_factor_
# select top-k highest scoring outliers
top_k = int(len(X)*0.1) # assume we want to keep 10% of outliers
indices = np.argsort(-lof_scores)[0:top_k]
outliers_selected = np.zeros_like(outliers).astype('bool')
outliers_selected[indices] = True
# plot selected outliers
plt.scatter(X[:, 0], X[:, 1], c='#AAAAAA', s=10)
plt.scatter(X[outliers_selected][:, 0], X[outliers_selected][:, 1], marker='+', color='red', s=20)
plt.title('Local Outlier Factor (LOF)')
plt.show()
# run DBSCAN
dist_matrix = squareform(pdist(X))
Z = linkage(dist_matrix, 'ward')
labels = fcluster(Z, t=2, criterion='distance')
# plot clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Set1', alpha=0.5)
plt.title('Density-Based Spatial Clustering of Applications with Noise (DBSCAN)')
plt.show()
代码解读
First, we train a LOF model on the scaled data matrix to calculate raw outlier scores based on negative outlier factor attributes. Subsequently, we sort these scores in descending order and designate the top-10% (i.e., those with largest magnitudes) as candidate outliers for removal. A boolean array named outliers_selected is generated to identify these top-10% outliers, which are then highlighted in red. Finally, we re-render the dataset excluding these identified outliers. Additionally, we employ DBSCAN clustering with a radius setting of 2 units and utilize Matplotlib's scatter() function to visualize the resulting clusters.
According to our analysis using DBSCAN, we identify that this dataset is divided into four distinct clusters by our algorithm. Notably, one of these clusters includes an outlier located at coordinates (90, 110). In contrast, another clustering method demonstrates its effectiveness by successfully separating two main groups within this dataset while also identifying this outlier point. However, this alternative approach distinguishes fewer outliers compared to DBSCAN's inclusion of all potential outliers. Furthermore, neither method is capable of capturing every subtle variation present in this data set; as a result, either approach may lead to either underfitting or overfitting issues with our model. Consequently, selecting an appropriate clustering algorithm should be based on specific requirements dictated by each unique problem encountered in this domain.
