Advertisement

【Machine Learning】23.Anomaly Detection 异常检测

阅读量:

Anomaly Detection 异常检测

    1. 引言
    1. 异常检测 (Anomaly Detection)
    • 2.1 问题陈述与目标定义

    • 2.2 数据集介绍

    • 2.3 高斯模型 (Gaussian Model)

      • 2.3.1 估计用于高斯模型的参数
        • 练习一 (Exercise 1)
      • 2.3.2 设定异常检测阈值
        • 练习二 (Exercise 2)
    • 2.4 在大数据集上的实践

    • 3.课后题

异常数据的检测,这实际上也是一种无监督学习(因为不知道什么是异常)

1.导入

复制代码
    import numpy as np
    import matplotlib.pyplot as plt
    from utils import *
    
    %matplotlib inline

文件utils.py的代码:

复制代码
    import numpy as np
    import matplotlib.pyplot as plt
    
    def load_data():
    X = np.load("data/X_part1.npy")
    X_val = np.load("data/X_val_part1.npy")
    y_val = np.load("data/y_val_part1.npy")
    return X, X_val, y_val
    
    def load_data_multi():
    X = np.load("data/X_part2.npy")
    X_val = np.load("data/X_val_part2.npy")
    y_val = np.load("data/y_val_part2.npy")
    return X, X_val, y_val
    
    
    def multivariate_gaussian(X, mu, var):
    """
    Computes the probability 
    density function of the examples X under the multivariate gaussian 
    distribution with parameters mu and var. If var is a matrix, it is
    treated as the covariance matrix. If var is a vector, it is treated
    as the var values of the variances in each dimension (a diagonal
    covariance matrix
    """
    
    k = len(mu)
    
    if var.ndim == 1:
        var = np.diag(var)
        
    X = X - mu
    p = (2* np.pi)**(-k/2) * np.linalg.det(var)**(-0.5) * \
        np.exp(-0.5 * np.sum(np.matmul(X, np.linalg.pinv(var)) * X, axis=1))
    
    return p
        
    def visualize_fit(X, mu, var):
    """
    This visualization shows you the 
    probability density function of the Gaussian distribution. Each example
    has a location (x1, x2) that depends on its feature values.
    """
    
    X1, X2 = np.meshgrid(np.arange(0, 35.5, 0.5), np.arange(0, 35.5, 0.5))
    Z = multivariate_gaussian(np.stack([X1.ravel(), X2.ravel()], axis=1), mu, var)
    Z = Z.reshape(X1.shape)
    
    plt.plot(X[:, 0], X[:, 1], 'bx')
    
    if np.sum(np.isinf(Z)) == 0:
        plt.contour(X1, X2, Z, levels=10**(np.arange(-20., 1, 3)), linewidths=1)
        
    # Set the title
    plt.title("The Gaussian contours of the distribution fit to the dataset")
    # Set the y-axis label
    plt.ylabel('Throughput (mb/s)')
    # Set the x-axis label
    plt.xlabel('Latency (ms)')

2.Anomaly Detection 异常检测

2.1 问题描述

在本练习中,您将实现异常检测算法以检测服务器计算机中的异常行为。

数据集包含两个特征:

  • 吞吐量(mb/s)和
  • 每个服务器的响应延迟(ms)。

每当您的服务器启动运行时

您对这些示例中的大部分是否认为它们主要是服务器正常运作的"正常"(非异常)情况表示怀疑?同时也可能存在一些 server 在该 dataset 中出现异常行为的例子。

您将使用高斯模型来检测您的数据集。

  • 您将首先从二维数据集出发, 该数据集将允许您对其进行可视化分析, 以便了解算法运行机制。
  • 对这个数据集进行高斯分布建模后, 通过识别出概率显著低于预期的观测值, 我们可以判断这些点为异常样本。
  • 最后, 您将把异常检测算法应用到多维度的数据集合中, 这有助于提高模型对复杂模式的捕捉能力。

2.2 数据集

您将从加载此任务的数据集开始。

  • 该函数用于获取数据并将这些数据加载至变量X\_trainX\_val以及y\_val中。
  • 您将利用X\_train来拟合高斯分布模型。
  • 您将在交叉验证过程中利用X\_val以及y\_val来选择合适的阈值参数,并根据结果区分异常样本与正常样本。
复制代码
    # Load the dataset
    X_train, X_val, y_val = load_data()

查看前五条数据

复制代码
    # Display the first five elements of X_train
    print("The first 5 elements of X_train are:\n", X_train[:5])
    
    # Display the first five elements of X_val
    print("The first 5 elements of X_val are\n", X_val[:5]) 
    
    # Display the first five elements of y_val
    print("The first 5 elements of y_val are\n", y_val[:5])

检查shape

复制代码
    print ('The shape of X_train is:', X_train.shape)
    print ('The shape of X_val is:', X_val.shape)
    print ('The shape of y_val is: ', y_val.shape)
    
    The shape of X_train is: (307, 2)
    The shape of X_val is: (307, 2)
    The shape of y_val is:  (307,)

数据可视化:

该数据集仅限于具有两个可绘制属性的数据。您可以用散点图来呈现数据(X_train),因为它正好包含吞吐量和延迟这两个可绘制的属性。

复制代码
    # Create a scatter plot of the data. To change the markers to blue "x",
    # we used the 'marker' and 'c' parameters
    plt.scatter(X_train[:, 0], X_train[:, 1], marker='x', c='b') 
    
    # Set the title
    plt.title("The first dataset")
    # Set the y-axis label
    plt.ylabel('Throughput (mb/s)')
    # Set the x-axis label
    plt.xlabel('Latency (ms)')
    # Set axis range
    plt.axis([0, 30, 0, 30])
    plt.show()
在这里插入图片描述

2.3 高斯分布

Carrying out anomaly detection requires that you first attempt to build a model that fits the data distributions.

Given a training set \{x^{(1)}, ..., x^{(m)}\}, you aim to estimate the Gaussian distribution among all the features x_i.

Recall that the Gaussian distribution is given by

p(x ; \mu,\sigma ^2) = \frac{1}{\sqrt{2 \pi \sigma ^2}}\exp^{ - \frac{(x - \mu)^2}{2 \sigma ^2} }

where \mu is the mean and \sigma^2 controls the variance.

For every feature i from 1 to n, one must determine the corresponding mean μᵢ and variance σᵢ² such that they appropriately model the data points {xᵢ⁽¹⁾, ..., xᵢ⁽ᵐ⁾}, which represent the i-th dimension across all examples.

2.3.1 Estimating parameters for a Gaussian

Your task is to complete the code in estimate_gaussian below.

Exercise 1

Please finish implementing the $estimate\_gaussian$ function in the following section. This function computes the mean (mu) and variance (var) for each feature in the input matrix `X. The mean represents the average value of each feature, while the variance quantifies how much these values deviate from this average.

The specific statistical metrics (\mu_i, \sigma_i^2) that belong to the i-th feature can be determined using a series of mathematical equations. To calculate the mean value, you should refer to: Mean value calculation formula.

\mu_i = \frac{1}{m} \sum_{j=1}^m x_i^{(j)}

and for the variance you will employ: The formula you should utilize is the variance calculation formula.

复制代码
    # UNQ_C1
    # GRADED FUNCTION: estimate_gaussian
    
    def estimate_gaussian(X): 
    """
    Calculates mean and variance of all features 
    in the dataset
    
    Args:
        X (ndarray): (m, n) Data matrix
    
    Returns:
        mu (ndarray): (n,) Mean of all features
        var (ndarray): (n,) Variance of all features
    """
    
    m, n = X.shape
    
    ### START CODE HERE ### 
    mu = np.mean(X,axis = 0)#别忘了要指定轴
    var = np.mean((X - mu)**2,axis = 0) ##注意**2是在sum里面的
    ### END CODE HERE ### 
        
    return mu, var

函数调用:

复制代码
    # Estimate mean and variance of each feature
    mu, var = estimate_gaussian(X_train)              
    
    print("Mean of each feature:", mu)
    print("Variance of each feature:", var)
    
    Mean of each feature: [14.11222578 14.99771051]
    Variance of each feature: [1.83263141 1.70974533]
复制代码
    # Returns the density of the multivariate normal
    # at each data point (row) of X_train
    p = multivariate_gaussian(X_train, mu, var)
    
    #Plotting code 
    visualize_fit(X_train, mu, var)

2.3.2 选择阈值

我们已经估计了高斯参数,在给定这种分布情况下,请您通过分析这些样本数据来确定哪些样本具有较高的概率以及哪些样本的概率较低。

  • 稀少 的实例更可能是我们数据集中不寻常 的情况。
    • 确定哪些实例属于异常的一种常见方法是通过在交叉验证集上设定一个阈值来识别。

在本节中, 您将负责实现其中涉及的代码, 并基于交叉验证集上的F1分数来选择合适的ε值

  • For this, we will use a cross validation set
    \{(x_{\rm cv}^{(1)}, y_{\rm cv}^{(1)}),\ldots, (x_{\rm cv}^{(m_{\rm cv})}, y_{\rm cv}^{(m_{\rm cv})})\}, where the label y=1 corresponds to an anomalous example, and y=0 corresponds to a normal example. y=1代表异常数据,y=0代表正常数据

  • 在每次交叉验证的例子中, 我们将计算p(x_{\rm cv}^{(i)})值. 这些概率值集合p(x_{\rm cv}^{(1)}), \ldots, p(x_{\rm cv}^{(m_{\rm cv})})将被整理成一个向量p_val, 并传递给函数select_threshold.

  • 对应的标签y_{\rm cv}^{(1)}, \ldots, y_{\rm cv}^{(m_{\rm cv})}将被提供到同一个函数中作为输入向量y_val的一部分.

Exercise 2

I am tasked with completing the select_threshold function below, which is designed to determine the optimal threshold for identifying outliers. This should be based on analyzing results obtained from both a validation set (p_val) and a ground truth (y_val). 完成下列函数

现有程序select_threshold中包含一个循环结构,在此循环中会测试ε的不同取值范围,并通过F1分数来确定最优的ε值。

通过选择epsilon作为阈值来计算F1分数,并将值放在“F1”中。

Please remember that whenever an example x exhibits a low probability value, specifically when p(x) < \varepsilon, it is categorized as an anomaly. When the probability of an instance being below the threshold indicates its classification as abnormal data.

You are able to calculate the precision and recall through the following formulas:

\begin{aligned} prec &= \frac{tp}{tp + fp}, \\ rec &= \frac{tp}{tp + fn}, \end{aligned}

where prec represents the precision (the probability of correctly predicting positive cases), and rec represents recall (the probability of correctly identifying positive cases among all actual positives).

  • tp represents the count of true positives: The actual classification states that a data point is flagged as an anomaly and our method successfully identifies its anomalous nature. 真阳性
    • fp denotes the quantity of false positives: The true label indicates that a data point should not be considered anomalous yet our system erroneously flags it as such. 假阳性
    • fn signifies the number of false negatives: The actual classification labels a data point as anomalous but our method fails to recognize this correctly. 假阴性

The F_1 score is calculated based on precision (p) and recall (r), which is expressed as follows: 计算F1分数的方程
the equation can be written as F_1 = \frac{2pr}{p + r}.

Implementation Note: For computing tp, fp, and fn, you can utilize a vectorized approach instead of iterating through each example.

代码填空:

该代码中采用\epsilon的方法是在概率最大值与最小值区间内将其分割成一千份后依次遍历。

复制代码
    # UNQ_C2
    # GRADED FUNCTION: select_threshold
    
    def select_threshold(y_val, p_val): 
    """
    Finds the best threshold to use for selecting outliers 
    based on the results from a validation set (p_val) 
    and the ground truth (y_val)
    
    Args:
        y_val (ndarray): Ground truth on validation set
        p_val (ndarray): Results on validation set
        
    Returns:
        epsilon (float): Threshold chosen 
        F1 (float):      F1 score by choosing epsilon as threshold
    """ 
    
    best_epsilon = 0
    best_F1 = 0
    F1 = 0
    
    step_size = (max(p_val) - min(p_val)) / 1000
    
    for epsilon in np.arange(min(p_val), max(p_val), step_size):
    
        ### START CODE HERE ### 
        predictions = # Your code here to calculate predictions for each example using epsilon as threshold
        
        tp = # Your code here to calculate number of true positives
        fp = # Your code here to calculate number of false positives
        fn = # Your code here to calculate number of false negatives
        
        prec = # Your code here to calculate precision
        rec = # Your code here to calculate recall
        
        F1 = # Your code here to calculate F1
        ### END CODE HERE ### 
        
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1

答案:

复制代码
    for epsilon in np.arange(min(p_val), max(p_val), step_size):
        ### START CODE HERE ### 
        predictions = (p_val < epsilon)
        
        tp = np.sum((predictions == 1) & (y_val == 1))
        fp = np.sum((predictions == 1) & (y_val == 0))# Your code here to calculate number of false positives
        fn = np.sum((predictions == 0) & (y_val == 1))# Your code here to calculate number of false negatives
        
        prec = tp / (tp + fp)  
        rec = tp / (tp + fn)
        
        F1 = 2 * prec * rec / (prec + rec)# Your code here to calculate F1
        ### END CODE HERE ### 
        
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon
        
    return best_epsilon, best_F1

测试代码:

复制代码
    p_val = multivariate_gaussian(X_val, mu, var)
    epsilon, F1 = select_threshold(y_val, p_val)
    
    print('Best epsilon found using cross-validation: %e' % epsilon)
    print('Best F1 on Cross Validation Set: %f' % F1)
    
    Best epsilon found using cross-validation: 8.990853e-05
    Best F1 on Cross Validation Set: 0.875000

可视化异常数据

复制代码
    # Find the outliers in the training set 
    outliers = p < epsilon
    
    # Visualize the fit
    visualize_fit(X_train, mu, var)
    
    # Draw a red circle around those outliers
    plt.plot(X_train[outliers, 0], X_train[outliers, 1], 'ro',
         markersize= 10,markerfacecolor='none', markeredgewidth=2)
在这里插入图片描述

2.4 在大数据集上的实践

Within this dataset, each instance is characterized by 11 distinct features, representing a wide range of characteristics inherent to your compute servers.

  • The load_data() function is depicted in the following manner, assigning data to variables $X_{train\_high}$, $X_{val\_high}$, and $y_{val\_high}$.

  • The subscript _high is intended to differentiate these variables from those utilized in the preceding section.

  • For modeling purposes, we will employ $X_{train\_high}$ to fit a Gaussian distribution.

  • We will utilize both $X_{val\_high} and `y_{val_high}`$ as a cross-validation dataset for threshold selection and anomaly detection, distinguishing between anomalous and normal examples.

  • 该函数负责将数据存储于以下三个变量中:X_train_highX_val_high以及y_val_high

    • _high_这一标记旨在使这些变量与其他部分使用的变量进行区分。
    • 我们将会利用X_train_high来进行高斯分布的拟合。
      在交叉验证阶段中, X_val_highy_val_high被用来选择合适的阈值, 并帮助识别异常与正常样本之间的差异。

加载数据

复制代码
    # load the dataset
    X_train_high, X_val_high, y_val_high = load_data_multi()

查看数据维度

复制代码
    print ('The shape of X_train_high is:', X_train_high.shape)
    print ('The shape of X_val_high is:', X_val_high.shape)
    print ('The shape of y_val_high is: ', y_val_high.shape)

进行异常检测

The code below will use your code to

  • 计算高斯参数(\mu_i, \sigma_i^2
    • 首先评估基于训练数据的概率分布
    • 然后评估基于交叉验证集的概率分布
    • 最终将使用select_threshold方法来确定最佳阈值\varepsilon
复制代码
    # Apply the same steps to the larger dataset
    
    # Estimate the Gaussian parameters
    mu_high, var_high = estimate_gaussian(X_train_high)
    
    # Evaluate the probabilites for the training set
    p_high = multivariate_gaussian(X_train_high, mu_high, var_high)
    
    # Evaluate the probabilites for the cross validation set
    p_val_high = multivariate_gaussian(X_val_high, mu_high, var_high)
    
    # Find the best threshold
    epsilon_high, F1_high = select_threshold(y_val_high, p_val_high)
    
    print('Best epsilon found using cross-validation: %e'% epsilon_high)
    print('Best F1 on Cross Validation Set:  %f'% F1_high)
    print('# Anomalies found: %d'% sum(p_high < epsilon_high))
    
    
    Best epsilon found using cross-validation: 1.377229e-18
    Best F1 on Cross Validation Set:  0.615385
    # Anomalies found: 117

3.课后题

  1. 监督学习和异常检测的使用:
  • 您正在开发一个系统用于识别数据中心内计算机的故障情况。您拥有1万份来自正常运行的计算机的数据记录,并未收到来自任何计算机的故障报告。

  • 您正在开发一个系统用于识别数据中心内计算机的故障情况。您拥有1万份正常运行的数据记录以及1万份因发生故障而产生的数据记录。

    1. 上面的使用场景很容易判断,但要是有已知异常数据但极少要怎么办?
在这里插入图片描述

将异常发动机的数据(与一些正常发动机一起)进行交叉验证和/或测试

阈值变小了,更少的数据会被分配为异常数据

在这里插入图片描述

您正致力于对新制造的飞机发动机进行温度和振动强度监测。进行了对100个引擎的数据采集,并应用视频讲座中介绍的高斯模型进行数据拟合。从这100个示例中获得的数据分布如图所示。最新的测试发动机具有温度读数为17.5度、振动强度读数为48单位的特点,在图中标注了洋红色区域。该发动机同时出现这两种指标的概率是多少

在这里插入图片描述

全部评论 (0)

还没有任何评论哟~