【Machine Learning】23.Anomaly Detection 异常检测
Anomaly Detection 异常检测
-
- 引言
-
- 异常检测 (Anomaly Detection)
-
2.1 问题陈述与目标定义
-
2.2 数据集介绍
-
2.3 高斯模型 (Gaussian Model)
- 2.3.1 估计用于高斯模型的参数
- 练习一 (Exercise 1)
- 2.3.2 设定异常检测阈值
- 练习二 (Exercise 2)
- 2.3.1 估计用于高斯模型的参数
-
2.4 在大数据集上的实践
-
3.课后题
异常数据的检测,这实际上也是一种无监督学习(因为不知道什么是异常)
1.导入
import numpy as np
import matplotlib.pyplot as plt
from utils import *
%matplotlib inline
文件utils.py的代码:
import numpy as np
import matplotlib.pyplot as plt
def load_data():
X = np.load("data/X_part1.npy")
X_val = np.load("data/X_val_part1.npy")
y_val = np.load("data/y_val_part1.npy")
return X, X_val, y_val
def load_data_multi():
X = np.load("data/X_part2.npy")
X_val = np.load("data/X_val_part2.npy")
y_val = np.load("data/y_val_part2.npy")
return X, X_val, y_val
def multivariate_gaussian(X, mu, var):
"""
Computes the probability
density function of the examples X under the multivariate gaussian
distribution with parameters mu and var. If var is a matrix, it is
treated as the covariance matrix. If var is a vector, it is treated
as the var values of the variances in each dimension (a diagonal
covariance matrix
"""
k = len(mu)
if var.ndim == 1:
var = np.diag(var)
X = X - mu
p = (2* np.pi)**(-k/2) * np.linalg.det(var)**(-0.5) * \
np.exp(-0.5 * np.sum(np.matmul(X, np.linalg.pinv(var)) * X, axis=1))
return p
def visualize_fit(X, mu, var):
"""
This visualization shows you the
probability density function of the Gaussian distribution. Each example
has a location (x1, x2) that depends on its feature values.
"""
X1, X2 = np.meshgrid(np.arange(0, 35.5, 0.5), np.arange(0, 35.5, 0.5))
Z = multivariate_gaussian(np.stack([X1.ravel(), X2.ravel()], axis=1), mu, var)
Z = Z.reshape(X1.shape)
plt.plot(X[:, 0], X[:, 1], 'bx')
if np.sum(np.isinf(Z)) == 0:
plt.contour(X1, X2, Z, levels=10**(np.arange(-20., 1, 3)), linewidths=1)
# Set the title
plt.title("The Gaussian contours of the distribution fit to the dataset")
# Set the y-axis label
plt.ylabel('Throughput (mb/s)')
# Set the x-axis label
plt.xlabel('Latency (ms)')
2.Anomaly Detection 异常检测
2.1 问题描述
在本练习中,您将实现异常检测算法以检测服务器计算机中的异常行为。
数据集包含两个特征:
- 吞吐量(mb/s)和
- 每个服务器的响应延迟(ms)。
每当您的服务器启动运行时
您对这些示例中的大部分是否认为它们主要是服务器正常运作的"正常"(非异常)情况表示怀疑?同时也可能存在一些 server 在该 dataset 中出现异常行为的例子。
您将使用高斯模型来检测您的数据集。
- 您将首先从二维数据集出发, 该数据集将允许您对其进行可视化分析, 以便了解算法运行机制。
- 对这个数据集进行高斯分布建模后, 通过识别出概率显著低于预期的观测值, 我们可以判断这些点为异常样本。
- 最后, 您将把异常检测算法应用到多维度的数据集合中, 这有助于提高模型对复杂模式的捕捉能力。
2.2 数据集
您将从加载此任务的数据集开始。
- 该函数用于获取数据并将这些数据加载至变量X\_train、X\_val以及y\_val中。
- 您将利用X\_train来拟合高斯分布模型。
- 您将在交叉验证过程中利用X\_val以及y\_val来选择合适的阈值参数,并根据结果区分异常样本与正常样本。
# Load the dataset
X_train, X_val, y_val = load_data()
查看前五条数据
# Display the first five elements of X_train
print("The first 5 elements of X_train are:\n", X_train[:5])
# Display the first five elements of X_val
print("The first 5 elements of X_val are\n", X_val[:5])
# Display the first five elements of y_val
print("The first 5 elements of y_val are\n", y_val[:5])
检查shape
print ('The shape of X_train is:', X_train.shape)
print ('The shape of X_val is:', X_val.shape)
print ('The shape of y_val is: ', y_val.shape)
The shape of X_train is: (307, 2)
The shape of X_val is: (307, 2)
The shape of y_val is: (307,)
数据可视化:
该数据集仅限于具有两个可绘制属性的数据。您可以用散点图来呈现数据(X_train),因为它正好包含吞吐量和延迟这两个可绘制的属性。
# Create a scatter plot of the data. To change the markers to blue "x",
# we used the 'marker' and 'c' parameters
plt.scatter(X_train[:, 0], X_train[:, 1], marker='x', c='b')
# Set the title
plt.title("The first dataset")
# Set the y-axis label
plt.ylabel('Throughput (mb/s)')
# Set the x-axis label
plt.xlabel('Latency (ms)')
# Set axis range
plt.axis([0, 30, 0, 30])
plt.show()

2.3 高斯分布
Carrying out anomaly detection requires that you first attempt to build a model that fits the data distributions.
Given a training set \{x^{(1)}, ..., x^{(m)}\}, you aim to estimate the Gaussian distribution among all the features x_i.
Recall that the Gaussian distribution is given by
p(x ; \mu,\sigma ^2) = \frac{1}{\sqrt{2 \pi \sigma ^2}}\exp^{ - \frac{(x - \mu)^2}{2 \sigma ^2} }
where \mu is the mean and \sigma^2 controls the variance.
For every feature i from 1 to n, one must determine the corresponding mean μᵢ and variance σᵢ² such that they appropriately model the data points {xᵢ⁽¹⁾, ..., xᵢ⁽ᵐ⁾}, which represent the i-th dimension across all examples.
2.3.1 Estimating parameters for a Gaussian
Your task is to complete the code in estimate_gaussian below.
Exercise 1
Please finish implementing the $estimate\_gaussian$ function in the following section. This function computes the mean (mu) and variance (var) for each feature in the input matrix `X. The mean represents the average value of each feature, while the variance quantifies how much these values deviate from this average.
The specific statistical metrics (\mu_i, \sigma_i^2) that belong to the i-th feature can be determined using a series of mathematical equations. To calculate the mean value, you should refer to: Mean value calculation formula.
\mu_i = \frac{1}{m} \sum_{j=1}^m x_i^{(j)}
and for the variance you will employ: The formula you should utilize is the variance calculation formula.
# UNQ_C1
# GRADED FUNCTION: estimate_gaussian
def estimate_gaussian(X):
"""
Calculates mean and variance of all features
in the dataset
Args:
X (ndarray): (m, n) Data matrix
Returns:
mu (ndarray): (n,) Mean of all features
var (ndarray): (n,) Variance of all features
"""
m, n = X.shape
### START CODE HERE ###
mu = np.mean(X,axis = 0)#别忘了要指定轴
var = np.mean((X - mu)**2,axis = 0) ##注意**2是在sum里面的
### END CODE HERE ###
return mu, var
函数调用:
# Estimate mean and variance of each feature
mu, var = estimate_gaussian(X_train)
print("Mean of each feature:", mu)
print("Variance of each feature:", var)
Mean of each feature: [14.11222578 14.99771051]
Variance of each feature: [1.83263141 1.70974533]
# Returns the density of the multivariate normal
# at each data point (row) of X_train
p = multivariate_gaussian(X_train, mu, var)
#Plotting code
visualize_fit(X_train, mu, var)
2.3.2 选择阈值
我们已经估计了高斯参数,在给定这种分布情况下,请您通过分析这些样本数据来确定哪些样本具有较高的概率以及哪些样本的概率较低。
- 稀少 的实例更可能是我们数据集中不寻常 的情况。
- 确定哪些实例属于异常的一种常见方法是通过在交叉验证集上设定一个阈值来识别。
在本节中, 您将负责实现其中涉及的代码, 并基于交叉验证集上的F1分数来选择合适的ε值
-
For this, we will use a cross validation set
\{(x_{\rm cv}^{(1)}, y_{\rm cv}^{(1)}),\ldots, (x_{\rm cv}^{(m_{\rm cv})}, y_{\rm cv}^{(m_{\rm cv})})\}, where the label y=1 corresponds to an anomalous example, and y=0 corresponds to a normal example. y=1代表异常数据,y=0代表正常数据 -
在每次交叉验证的例子中, 我们将计算p(x_{\rm cv}^{(i)})值. 这些概率值集合p(x_{\rm cv}^{(1)}), \ldots, p(x_{\rm cv}^{(m_{\rm cv})})将被整理成一个向量p_val, 并传递给函数
select_threshold. -
对应的标签y_{\rm cv}^{(1)}, \ldots, y_{\rm cv}^{(m_{\rm cv})}将被提供到同一个函数中作为输入向量y_val的一部分.
Exercise 2
I am tasked with completing the select_threshold function below, which is designed to determine the optimal threshold for identifying outliers. This should be based on analyzing results obtained from both a validation set (p_val) and a ground truth (y_val). 完成下列函数
现有程序select_threshold中包含一个循环结构,在此循环中会测试ε的不同取值范围,并通过F1分数来确定最优的ε值。
通过选择epsilon作为阈值来计算F1分数,并将值放在“F1”中。
Please remember that whenever an example x exhibits a low probability value, specifically when p(x) < \varepsilon, it is categorized as an anomaly. When the probability of an instance being below the threshold indicates its classification as abnormal data.
You are able to calculate the precision and recall through the following formulas:
\begin{aligned} prec &= \frac{tp}{tp + fp}, \\ rec &= \frac{tp}{tp + fn}, \end{aligned}
where prec represents the precision (the probability of correctly predicting positive cases), and rec represents recall (the probability of correctly identifying positive cases among all actual positives).
- tp represents the count of true positives: The actual classification states that a data point is flagged as an anomaly and our method successfully identifies its anomalous nature. 真阳性
- fp denotes the quantity of false positives: The true label indicates that a data point should not be considered anomalous yet our system erroneously flags it as such. 假阳性
- fn signifies the number of false negatives: The actual classification labels a data point as anomalous but our method fails to recognize this correctly. 假阴性
The F_1 score is calculated based on precision (p) and recall (r), which is expressed as follows: 计算F1分数的方程
the equation can be written as F_1 = \frac{2pr}{p + r}.
Implementation Note: For computing tp, fp, and fn, you can utilize a vectorized approach instead of iterating through each example.
代码填空:
该代码中采用\epsilon的方法是在概率最大值与最小值区间内将其分割成一千份后依次遍历。
# UNQ_C2
# GRADED FUNCTION: select_threshold
def select_threshold(y_val, p_val):
"""
Finds the best threshold to use for selecting outliers
based on the results from a validation set (p_val)
and the ground truth (y_val)
Args:
y_val (ndarray): Ground truth on validation set
p_val (ndarray): Results on validation set
Returns:
epsilon (float): Threshold chosen
F1 (float): F1 score by choosing epsilon as threshold
"""
best_epsilon = 0
best_F1 = 0
F1 = 0
step_size = (max(p_val) - min(p_val)) / 1000
for epsilon in np.arange(min(p_val), max(p_val), step_size):
### START CODE HERE ###
predictions = # Your code here to calculate predictions for each example using epsilon as threshold
tp = # Your code here to calculate number of true positives
fp = # Your code here to calculate number of false positives
fn = # Your code here to calculate number of false negatives
prec = # Your code here to calculate precision
rec = # Your code here to calculate recall
F1 = # Your code here to calculate F1
### END CODE HERE ###
if F1 > best_F1:
best_F1 = F1
best_epsilon = epsilon
return best_epsilon, best_F1
答案:
for epsilon in np.arange(min(p_val), max(p_val), step_size):
### START CODE HERE ###
predictions = (p_val < epsilon)
tp = np.sum((predictions == 1) & (y_val == 1))
fp = np.sum((predictions == 1) & (y_val == 0))# Your code here to calculate number of false positives
fn = np.sum((predictions == 0) & (y_val == 1))# Your code here to calculate number of false negatives
prec = tp / (tp + fp)
rec = tp / (tp + fn)
F1 = 2 * prec * rec / (prec + rec)# Your code here to calculate F1
### END CODE HERE ###
if F1 > best_F1:
best_F1 = F1
best_epsilon = epsilon
return best_epsilon, best_F1
测试代码:
p_val = multivariate_gaussian(X_val, mu, var)
epsilon, F1 = select_threshold(y_val, p_val)
print('Best epsilon found using cross-validation: %e' % epsilon)
print('Best F1 on Cross Validation Set: %f' % F1)
Best epsilon found using cross-validation: 8.990853e-05
Best F1 on Cross Validation Set: 0.875000
可视化异常数据
# Find the outliers in the training set
outliers = p < epsilon
# Visualize the fit
visualize_fit(X_train, mu, var)
# Draw a red circle around those outliers
plt.plot(X_train[outliers, 0], X_train[outliers, 1], 'ro',
markersize= 10,markerfacecolor='none', markeredgewidth=2)

2.4 在大数据集上的实践
Within this dataset, each instance is characterized by 11 distinct features, representing a wide range of characteristics inherent to your compute servers.
-
The
load_data()function is depicted in the following manner, assigning data to variables$X_{train\_high}$,$X_{val\_high}$, and$y_{val\_high}$. -
The subscript
_highis intended to differentiate these variables from those utilized in the preceding section. -
For modeling purposes, we will employ
$X_{train\_high}$to fit a Gaussian distribution. -
We will utilize both
$X_{val\_high}and `y_{val_high}`$ as a cross-validation dataset for threshold selection and anomaly detection, distinguishing between anomalous and normal examples. -
该函数负责将数据存储于以下三个变量中:
X_train_high、X_val_high以及y_val_high。_high_这一标记旨在使这些变量与其他部分使用的变量进行区分。- 我们将会利用
X_train_high来进行高斯分布的拟合。
在交叉验证阶段中,X_val_high和y_val_high被用来选择合适的阈值, 并帮助识别异常与正常样本之间的差异。
加载数据
# load the dataset
X_train_high, X_val_high, y_val_high = load_data_multi()
查看数据维度
print ('The shape of X_train_high is:', X_train_high.shape)
print ('The shape of X_val_high is:', X_val_high.shape)
print ('The shape of y_val_high is: ', y_val_high.shape)
进行异常检测
The code below will use your code to
- 计算高斯参数(\mu_i, \sigma_i^2)
- 首先评估基于训练数据的概率分布
- 然后评估基于交叉验证集的概率分布
- 最终将使用
select_threshold方法来确定最佳阈值\varepsilon
# Apply the same steps to the larger dataset
# Estimate the Gaussian parameters
mu_high, var_high = estimate_gaussian(X_train_high)
# Evaluate the probabilites for the training set
p_high = multivariate_gaussian(X_train_high, mu_high, var_high)
# Evaluate the probabilites for the cross validation set
p_val_high = multivariate_gaussian(X_val_high, mu_high, var_high)
# Find the best threshold
epsilon_high, F1_high = select_threshold(y_val_high, p_val_high)
print('Best epsilon found using cross-validation: %e'% epsilon_high)
print('Best F1 on Cross Validation Set: %f'% F1_high)
print('# Anomalies found: %d'% sum(p_high < epsilon_high))
Best epsilon found using cross-validation: 1.377229e-18
Best F1 on Cross Validation Set: 0.615385
# Anomalies found: 117
3.课后题
- 监督学习和异常检测的使用:
-
您正在开发一个系统用于识别数据中心内计算机的故障情况。您拥有1万份来自正常运行的计算机的数据记录,并未收到来自任何计算机的故障报告。
-
您正在开发一个系统用于识别数据中心内计算机的故障情况。您拥有1万份正常运行的数据记录以及1万份因发生故障而产生的数据记录。
- 上面的使用场景很容易判断,但要是有已知异常数据但极少要怎么办?

将异常发动机的数据(与一些正常发动机一起)进行交叉验证和/或测试
阈值变小了,更少的数据会被分配为异常数据

您正致力于对新制造的飞机发动机进行温度和振动强度监测。进行了对100个引擎的数据采集,并应用视频讲座中介绍的高斯模型进行数据拟合。从这100个示例中获得的数据分布如图所示。最新的测试发动机具有温度读数为17.5度、振动强度读数为48单位的特点,在图中标注了洋红色区域。该发动机同时出现这两种指标的概率是多少

