【Machine Learning】23.Anomaly Detection 异常检测

阅读量：

Anomaly Detection 异常检测

1. 引言
1. 异常检测 (Anomaly Detection)
- 2.1 问题陈述与目标定义
- 2.2 数据集介绍
- 2.3 高斯模型 (Gaussian Model)
  - 2.3.1 估计用于高斯模型的参数
    - 练习一 (Exercise 1)
  - 2.3.2 设定异常检测阈值
    - 练习二 (Exercise 2)
- 2.4 在大数据集上的实践
- 3.课后题

异常数据的检测，这实际上也是一种无监督学习（因为不知道什么是异常）

1.导入

复制代码

    import numpy as np
    import matplotlib.pyplot as plt
    from utils import *
    
    %matplotlib inline

文件utils.py的代码：

复制代码

    import numpy as np
    import matplotlib.pyplot as plt
    
    def load_data():
    X = np.load("data/X_part1.npy")
    X_val = np.load("data/X_val_part1.npy")
    y_val = np.load("data/y_val_part1.npy")
    return X, X_val, y_val
    
    def load_data_multi():
    X = np.load("data/X_part2.npy")
    X_val = np.load("data/X_val_part2.npy")
    y_val = np.load("data/y_val_part2.npy")
    return X, X_val, y_val
    
    
    def multivariate_gaussian(X, mu, var):
    """
    Computes the probability 
    density function of the examples X under the multivariate gaussian 
    distribution with parameters mu and var. If var is a matrix, it is
    treated as the covariance matrix. If var is a vector, it is treated
    as the var values of the variances in each dimension (a diagonal
    covariance matrix
    """
    
    k = len(mu)
    
    if var.ndim == 1:
        var = np.diag(var)
        
    X = X - mu
    p = (2* np.pi)**(-k/2) * np.linalg.det(var)**(-0.5) * \
        np.exp(-0.5 * np.sum(np.matmul(X, np.linalg.pinv(var)) * X, axis=1))
    
    return p
        
    def visualize_fit(X, mu, var):
    """
    This visualization shows you the 
    probability density function of the Gaussian distribution. Each example
    has a location (x1, x2) that depends on its feature values.
    """
    
    X1, X2 = np.meshgrid(np.arange(0, 35.5, 0.5), np.arange(0, 35.5, 0.5))
    Z = multivariate_gaussian(np.stack([X1.ravel(), X2.ravel()], axis=1), mu, var)
    Z = Z.reshape(X1.shape)
    
    plt.plot(X[:, 0], X[:, 1], 'bx')
    
    if np.sum(np.isinf(Z)) == 0:
        plt.contour(X1, X2, Z, levels=10**(np.arange(-20., 1, 3)), linewidths=1)
        
    # Set the title
    plt.title("The Gaussian contours of the distribution fit to the dataset")
    # Set the y-axis label
    plt.ylabel('Throughput (mb/s)')
    # Set the x-axis label
    plt.xlabel('Latency (ms)')

2.Anomaly Detection 异常检测

2.1 问题描述

在本练习中，您将实现异常检测算法以检测服务器计算机中的异常行为。

数据集包含两个特征：

吞吐量（mb/s）和
每个服务器的响应延迟（ms）。

每当您的服务器启动运行时

您对这些示例中的大部分是否认为它们主要是服务器正常运作的"正常"(非异常)情况表示怀疑？同时也可能存在一些 server 在该 dataset 中出现异常行为的例子。

您将使用高斯模型来检测您的数据集。

您将首先从二维数据集出发, 该数据集将允许您对其进行可视化分析, 以便了解算法运行机制。
对这个数据集进行高斯分布建模后, 通过识别出概率显著低于预期的观测值, 我们可以判断这些点为异常样本。
最后, 您将把异常检测算法应用到多维度的数据集合中, 这有助于提高模型对复杂模式的捕捉能力。

2.2 数据集

您将从加载此任务的数据集开始。

该函数用于获取数据并将这些数据加载至变量 $X\_train$ 、 $X\_val$ 以及 $y\_val$ 中。
您将利用 $X\_train$ 来拟合高斯分布模型。
您将在交叉验证过程中利用 $X\_val$ 以及 $y\_val$ 来选择合适的阈值参数，并根据结果区分异常样本与正常样本。

复制代码

    # Load the dataset
    X_train, X_val, y_val = load_data()

查看前五条数据

复制代码

    # Display the first five elements of X_train
    print("The first 5 elements of X_train are:\n", X_train[:5])
    
    # Display the first five elements of X_val
    print("The first 5 elements of X_val are\n", X_val[:5]) 
    
    # Display the first five elements of y_val
    print("The first 5 elements of y_val are\n", y_val[:5])

检查shape

复制代码

    print ('The shape of X_train is:', X_train.shape)
    print ('The shape of X_val is:', X_val.shape)
    print ('The shape of y_val is: ', y_val.shape)
    
    The shape of X_train is: (307, 2)
    The shape of X_val is: (307, 2)
    The shape of y_val is:  (307,)

数据可视化：

该数据集仅限于具有两个可绘制属性的数据。您可以用散点图来呈现数据（ $X_train$ ），因为它正好包含吞吐量和延迟这两个可绘制的属性。

复制代码

    # Create a scatter plot of the data. To change the markers to blue "x",
    # we used the 'marker' and 'c' parameters
    plt.scatter(X_train[:, 0], X_train[:, 1], marker='x', c='b') 
    
    # Set the title
    plt.title("The first dataset")
    # Set the y-axis label
    plt.ylabel('Throughput (mb/s)')
    # Set the x-axis label
    plt.xlabel('Latency (ms)')
    # Set axis range
    plt.axis([0, 30, 0, 30])
    plt.show()

2.3 高斯分布

Carrying out anomaly detection requires that you first attempt to build a model that fits the data distributions.

Given a training set $\{x^{(1)}, ..., x^{(m)}\}$ , you aim to estimate the Gaussian distribution among all the features $x_i$ .

Recall that the Gaussian distribution is given by

$p(x ; \mu,\sigma ^2) = \frac{1}{\sqrt{2 \pi \sigma ^2}}\exp^{ - \frac{(x - \mu)^2}{2 \sigma ^2} }$

where $\mu$ is the mean and $\sigma^2$ controls the variance.

For every feature i from 1 to n, one must determine the corresponding mean μᵢ and variance σᵢ² such that they appropriately model the data points {xᵢ⁽¹⁾, ..., xᵢ⁽ᵐ⁾}, which represent the i-th dimension across all examples.

2.3.1 Estimating parameters for a Gaussian

Your task is to complete the code in estimate_gaussian below.

Exercise 1

Please finish implementing the $estimate\_gaussian$ function in the following section. This function computes the mean (mu) and variance (var) for each feature in the input matrix ` $X$ . The mean represents the average value of each feature, while the variance quantifies how much these values deviate from this average.

The specific statistical metrics ( $\mu_i$ , $\sigma_i^2$ ) that belong to the $i$ -th feature can be determined using a series of mathematical equations. To calculate the mean value, you should refer to: Mean value calculation formula.

$\mu_i = \frac{1}{m} \sum_{j=1}^m x_i^{(j)}$

and for the variance you will employ: The formula you should utilize is the variance calculation formula.

复制代码

    # UNQ_C1
    # GRADED FUNCTION: estimate_gaussian
    
    def estimate_gaussian(X): 
    """
    Calculates mean and variance of all features 
    in the dataset
    
    Args:
        X (ndarray): (m, n) Data matrix
    
    Returns:
        mu (ndarray): (n,) Mean of all features
        var (ndarray): (n,) Variance of all features
    """
    
    m, n = X.shape
    
    ### START CODE HERE ### 
    mu = np.mean(X,axis = 0)#别忘了要指定轴
    var = np.mean((X - mu)**2,axis = 0) ##注意**2是在sum里面的
    ### END CODE HERE ### 
        
    return mu, var

函数调用：

复制代码

    # Estimate mean and variance of each feature
    mu, var = estimate_gaussian(X_train)              
    
    print("Mean of each feature:", mu)
    print("Variance of each feature:", var)
    
    Mean of each feature: [14.11222578 14.99771051]
    Variance of each feature: [1.83263141 1.70974533]

复制代码

    # Returns the density of the multivariate normal
    # at each data point (row) of X_train
    p = multivariate_gaussian(X_train, mu, var)
    
    #Plotting code 
    visualize_fit(X_train, mu, var)

2.3.2 选择阈值

我们已经估计了高斯参数，在给定这种分布情况下，请您通过分析这些样本数据来确定哪些样本具有较高的概率以及哪些样本的概率较低。

稀少的实例更可能是我们数据集中不寻常 的情况。
- 确定哪些实例属于异常的一种常见方法是通过在交叉验证集上设定一个阈值来识别。

在本节中, 您将负责实现其中涉及的代码, 并基于交叉验证集上的F1分数来选择合适的ε值

For this, we will use a cross validation set
$\{(x_{\rm cv}^{(1)}, y_{\rm cv}^{(1)}),\ldots, (x_{\rm cv}^{(m_{\rm cv})}, y_{\rm cv}^{(m_{\rm cv})})\}$ , where the label $y=1$ corresponds to an anomalous example, and $y=0$ corresponds to a normal example. y=1代表异常数据，y=0代表正常数据
在每次交叉验证的例子中, 我们将计算 $p(x_{\rm cv}^{(i)})$ 值. 这些概率值集合 $p(x_{\rm cv}^{(1)}), \ldots, p(x_{\rm cv}^{(m_{\rm cv})})$ 将被整理成一个向量 $p_val$ , 并传递给函数select_threshold.
对应的标签 $y_{\rm cv}^{(1)}, \ldots, y_{\rm cv}^{(m_{\rm cv})}$ 将被提供到同一个函数中作为输入向量 $y_val$ 的一部分.

Exercise 2

I am tasked with completing the select_threshold function below, which is designed to determine the optimal threshold for identifying outliers. This should be based on analyzing results obtained from both a validation set (p_val) and a ground truth (y_val). 完成下列函数

现有程序select_threshold中包含一个循环结构，在此循环中会测试ε的不同取值范围，并通过F1分数来确定最优的ε值。

通过选择epsilon作为阈值来计算F1分数，并将值放在“F1”中。

Please remember that whenever an example $x$ exhibits a low probability value, specifically when $p(x) < \varepsilon$ , it is categorized as an anomaly. When the probability of an instance being below the threshold indicates its classification as abnormal data.

You are able to calculate the precision and recall through the following formulas:

\begin{aligned} prec &= \frac{tp}{tp + fp}, \\ rec &= \frac{tp}{tp + fn}, \end{aligned}

where $prec$ represents the precision (the probability of correctly predicting positive cases), and $rec$ represents recall (the probability of correctly identifying positive cases among all actual positives).

tp represents the count of true positives: The actual classification states that a data point is flagged as an anomaly and our method successfully identifies its anomalous nature. 真阳性
- $fp$ denotes the quantity of false positives: The true label indicates that a data point should not be considered anomalous yet our system erroneously flags it as such. 假阳性
- $fn$ signifies the number of false negatives: The actual classification labels a data point as anomalous but our method fails to recognize this correctly. 假阴性

The $F_1$ score is calculated based on precision (p) and recall (r), which is expressed as follows: 计算F1分数的方程
the equation can be written as $F_1 = \frac{2pr}{p + r}$ .

Implementation Note: For computing $tp$ , $fp$ , and $fn$ , you can utilize a vectorized approach instead of iterating through each example.

代码填空：

该代码中采用 $\epsilon$ 的方法是在概率最大值与最小值区间内将其分割成一千份后依次遍历。

复制代码

    # UNQ_C2
    # GRADED FUNCTION: select_threshold
    
    def select_threshold(y_val, p_val): 
    """
    Finds the best threshold to use for selecting outliers 
    based on the results from a validation set (p_val) 
    and the ground truth (y_val)
    
    Args:
        y_val (ndarray): Ground truth on validation set
        p_val (ndarray): Results on validation set
        
    Returns:
        epsilon (float): Threshold chosen 
        F1 (float):      F1 score by choosing epsilon as threshold
    """ 
    
    best_epsilon = 0
    best_F1 = 0
    F1 = 0
    
    step_size = (max(p_val) - min(p_val)) / 1000
    
    for epsilon in np.arange(min(p_val), max(p_val), step_size):
    
        ### START CODE HERE ### 
        predictions = # Your code here to calculate predictions for each example using epsilon as threshold
        
        tp = # Your code here to calculate number of true positives
        fp = # Your code here to calculate number of false positives
        fn = # Your code here to calculate number of false negatives
        
        prec = # Your code here to calculate precision
        rec = # Your code here to calculate recall
        
        F1 = # Your code here to calculate F1
        ### END CODE HERE ### 
        
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1

答案：

复制代码

    for epsilon in np.arange(min(p_val), max(p_val), step_size):
        ### START CODE HERE ### 
        predictions = (p_val < epsilon)
        
        tp = np.sum((predictions == 1) & (y_val == 1))
        fp = np.sum((predictions == 1) & (y_val == 0))# Your code here to calculate number of false positives
        fn = np.sum((predictions == 0) & (y_val == 1))# Your code here to calculate number of false negatives
        
        prec = tp / (tp + fp)  
        rec = tp / (tp + fn)
        
        F1 = 2 * prec * rec / (prec + rec)# Your code here to calculate F1
        ### END CODE HERE ### 
        
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1

测试代码：

复制代码

    p_val = multivariate_gaussian(X_val, mu, var)
    epsilon, F1 = select_threshold(y_val, p_val)
    
    print('Best epsilon found using cross-validation: %e' % epsilon)
    print('Best F1 on Cross Validation Set: %f' % F1)
    
    Best epsilon found using cross-validation: 8.990853e-05
    Best F1 on Cross Validation Set: 0.875000

可视化异常数据

复制代码

    # Find the outliers in the training set 
    outliers = p < epsilon
    
    # Visualize the fit
    visualize_fit(X_train, mu, var)
    
    # Draw a red circle around those outliers
    plt.plot(X_train[outliers, 0], X_train[outliers, 1], 'ro',
         markersize= 10,markerfacecolor='none', markeredgewidth=2)

2.4 在大数据集上的实践

Within this dataset, each instance is characterized by 11 distinct features, representing a wide range of characteristics inherent to your compute servers.

The load_data() function is depicted in the following manner, assigning data to variables $X_{train\_high}$ , $X_{val\_high}$ , and $y_{val\_high}$ .
The subscript _high is intended to differentiate these variables from those utilized in the preceding section.
For modeling purposes, we will employ $X_{train\_high}$ to fit a Gaussian distribution.
We will utilize both $X_{val\_high} $and `$ y_{val_high}`$ as a cross-validation dataset for threshold selection and anomaly detection, distinguishing between anomalous and normal examples.
该函数负责将数据存储于以下三个变量中：X_train_high、X_val_high以及y_val_high。
- _high_这一标记旨在使这些变量与其他部分使用的变量进行区分。
- 我们将会利用X_train_high来进行高斯分布的拟合。
  在交叉验证阶段中, X_val_high和y_val_high被用来选择合适的阈值, 并帮助识别异常与正常样本之间的差异。

加载数据

复制代码

    # load the dataset
    X_train_high, X_val_high, y_val_high = load_data_multi()

查看数据维度

复制代码

    print ('The shape of X_train_high is:', X_train_high.shape)
    print ('The shape of X_val_high is:', X_val_high.shape)
    print ('The shape of y_val_high is: ', y_val_high.shape)

进行异常检测

The code below will use your code to

计算高斯参数（\mu_i, \sigma_i^2）
- 首先评估基于训练数据的概率分布
- 然后评估基于交叉验证集的概率分布
- 最终将使用select_threshold方法来确定最佳阈值 $\varepsilon$

复制代码

    # Apply the same steps to the larger dataset
    
    # Estimate the Gaussian parameters
    mu_high, var_high = estimate_gaussian(X_train_high)
    
    # Evaluate the probabilites for the training set
    p_high = multivariate_gaussian(X_train_high, mu_high, var_high)
    
    # Evaluate the probabilites for the cross validation set
    p_val_high = multivariate_gaussian(X_val_high, mu_high, var_high)
    
    # Find the best threshold
    epsilon_high, F1_high = select_threshold(y_val_high, p_val_high)
    
    print('Best epsilon found using cross-validation: %e'% epsilon_high)
    print('Best F1 on Cross Validation Set:  %f'% F1_high)
    print('# Anomalies found: %d'% sum(p_high < epsilon_high))
    
    
    Best epsilon found using cross-validation: 1.377229e-18
    Best F1 on Cross Validation Set:  0.615385
    # Anomalies found: 117

3.课后题

监督学习和异常检测的使用：

您正在开发一个系统用于识别数据中心内计算机的故障情况。您拥有1万份来自正常运行的计算机的数据记录，并未收到来自任何计算机的故障报告。
您正在开发一个系统用于识别数据中心内计算机的故障情况。您拥有1万份正常运行的数据记录以及1万份因发生故障而产生的数据记录。
1. 上面的使用场景很容易判断，但要是有已知异常数据但极少要怎么办？

将异常发动机的数据（与一些正常发动机一起）进行交叉验证和/或测试

阈值变小了，更少的数据会被分配为异常数据

您正致力于对新制造的飞机发动机进行温度和振动强度监测。进行了对100个引擎的数据采集，并应用视频讲座中介绍的高斯模型进行数据拟合。从这100个示例中获得的数据分布如图所示。最新的测试发动机具有温度读数为17.5度、振动强度读数为48单位的特点，在图中标注了洋红色区域。该发动机同时出现这两种指标的概率是多少

全部评论 (0)

还没有任何评论哟~

【Machine Learning】23.Anomaly Detection 异常检测

AnomalyDetection异常检测 1.导入 2.AnomalyDetection异常检测 2.1问题描述 2.2数据集 2.3高斯分布 2.3.1Estimatingparametersfor...

[异常检测] Learning Memory-guided Normality for Anomaly Detection

LearningMemoryguidedNormalityforAnomalyDetection 单位：YonseiUniversity 会议：CVPR2020 论文地址：LearningMemory...

异常检测(Anomaly Detection)技术

异常检测AnomalyDetection技术异常检测AnomalyDetection概述异常检测的基本概念异常检测的方法异常检测的评估指标异常检测AnomalyDetection的研究案例 ...

异常检查(Anomaly Detection)

异常检测anomalydetection 背景什么叫做异常算法高斯分布开发异常检测算法评价模型与监督学习的区别关于feature的选择误差分析：多元高斯分布 IsolationFore...

【ICLR 2022】异常检测 | Anomaly Detection for Tabular Data with Internal Contrastive Learning

论文题目：AnomalyDetectionforTabularDatawithInternalContrastiveLearning 论文链接：<https://openreview.net/pdf?...

[异常检测] Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection

RegularityLearningviaExplicitDistributionModelingforSkeletalVideoAnomalyDetection 单位：上海交通大学、商汤、上海AIL...

数据科学——异常检测（Anomaly Detection）

异常检测，也被称为异常发现或离群点检测，是数据挖掘领域中的一个重要分支。它的目标是在数据集中识别出与大多数数据点显著不同的数据点，这些数据点被称为异常点或离群点。异常检测在许多领域都有应用，包括金融欺...

CVPR anomaly detection异常检测论文汇总

CVPRanomalydetection异常检测论文汇总 1.12019CVPR 1.22018CVPR 1.32017CVPR 1.32016CVPR 关键词为anomalydetection 1....

【AndrewNg机器学习】异常检测(Anomaly detection)

文章目录 1高斯分布 2异常检测算法 3开发和评价一个异常检测系统 4异常检测与监督学习之对比 5特征的选择 6多元高斯分布及其应用假设我们有数据集x^1,x^2,……,x^m,其中每个样本都有两个...

[异常检测] Graph Embedded Pose Clustering for Anomaly Detection

GraphEmbeddedPoseClusteringforAnomalyDetection 会议：CVPR2020 单位：TelAvivUniversity,AlibabaGroup 论文：arxi...

是否确定退出登录?

【Machine Learning】23.Anomaly Detection 异常检测

Anomaly Detection 异常检测

1.导入

2.Anomaly Detection 异常检测

2.1 问题描述

2.2 数据集

2.3 高斯分布

2.3.1 Estimating parameters for a Gaussian

Exercise 1

2.3.2 选择阈值

Exercise 2

2.4 在大数据集上的实践

3.课后题

全部评论 (0)

相关文章推荐

【Machine Learning】23.Anomaly Detection 异常检测

[异常检测] Learning Memory-guided Normality for Anomaly Detection

异常检测(Anomaly Detection)技术

异常检查(Anomaly Detection)

【ICLR 2022】异常检测 | Anomaly Detection for Tabular Data with Internal Contrastive Learning

[异常检测] Regularity Learning via Explicit Distribution Modeling for Skeletal Video Anomaly Detection

数据科学——异常检测（Anomaly Detection）

CVPR anomaly detection异常检测 论文汇总

【AndrewNg机器学习】异常检测(Anomaly detection)

[异常检测] Graph Embedded Pose Clustering for Anomaly Detection

CVPR anomaly detection异常检测论文汇总