Advertisement

斯坦福CS229(吴恩达授)学习笔记(5)

阅读量:

CS229-notes3

  • Outline
    • Main Content
      • Problem Set II: Kernel Methods and Support Vector Machines
      • Section 1: Kernel Ridge Regression Technique
        • Section 2: \ell_2 Regularization in Soft Margin SVM Models
        • Section 3: Gaussian Kernel-Based Support Vector Machines
        • Section 4: Application of Naive Bayes and SVM Techniques in Spam Filtering Tasks

说明

此份学习笔记是基于斯坦福大学 cs229 的 lecture notes(第三卷)的内容整理而成,并与 B 站上发布的一系列课程视频相关于第七和第八讲:最优间隔分类器问题及顺序最小优化算法。在整理过程中可能会涉及到 cs229 的第一卷中的一些基础知识内容。

正文

Problem Set #2: Kernels, SVMs, and Theory

1. Kernel ridge regression

解答:
(a)
J(\theta)=\frac{1}{2}\sum_{i=1}^m(\theta^{T}x^{(i)}-y^{(i)})^2+\frac{\lambda}{2}||\theta||^2=\frac{1}{2}\sum_{i=1}^m(\theta^{T}x^{(i)}-y^{(i)})^2+\frac{\lambda}{2}\theta^T\theta
所以
\begin{aligned} \frac{\partial}{\partial\theta}J(\theta) =&\frac{\partial}{\partial\theta}\frac{1}{2}\sum_{i=1}^m(\theta^{T}x^{(i)}-y^{(i)})^2+\frac{\lambda}{2}\theta^T\theta \\ =&\frac{1}{2}\sum_{i=1}^m\frac{\partial}{\partial\theta}(\theta^{T}x^{(i)}-y^{(i)})^2+\lambda\theta\\ =&\frac{1}{2}\sum_{i=1}^m·2·(\theta^{T}x^{(i)}-y^{(i)})·x^{(i)}+\lambda\theta\\ =&\sum_{i=1}^m(\theta^{T}x^{(i)}-y^{(i)})x^{(i)}+\lambda\theta\\ \end{aligned}

\begin{aligned} X=&\left[ \begin{matrix} -&(x^{(1)})^T& -&\\ -&(x^{(2)})^T &-\\ &\vdots\\ -&(x^{(m)})^T& \- \end{matrix} \right]\\ \vec y=&\left[ \begin{matrix} y^{(1)}\\ y^{(2)}\\ \vdots\\ y^{(m)} \end{matrix} \right]\\ \end{aligned}

\begin{aligned} \frac{\partial}{\partial\theta}J(\theta) =&\sum_{i=1}^m(\theta^{T}x^{(i)}-y^{(i)})x^{(i)}+\lambda\theta\\ =&X^T(X\theta-\vec y)+\lambda\theta\\ =&(X^TX+\lambda I)\theta-X^T\vec y \end{aligned}
\frac{\partial}{\partial\theta}J(\theta) =0,可得\theta=(X^TX+\lambda I)^{-1}X^T\vec y
(b)
首先可以证明
\begin{aligned} (\lambda I+BA)^{-1}B=&(\lambda I+BA)^{-1}(B^{-1})^{-1}\\ =&(B^{-1}(\lambda I+BA))^{-1}\\ =&(\lambda B^{-1}+A)^{-1}\\ =&BB^{-1}(\lambda B^{-1}+A)^{-1}\\ =&B(\lambda B^{-1}B+AB)^{-1}\\ =&B(\lambda I+AB)^{-1} \end{aligned}
如果把x^{(i)}替换成\phi(x^{(i)}),则\theta=(\phi(X)^T\phi(X)+\lambda I)^{-1}\phi(X)^T\vec y
所以
\begin{aligned} \theta^T\phi(x_{new})=&((\phi(X)^T\phi(X)+\lambda I)^{-1}\phi(X)^T\vec y)^T\phi(x_{new})\\ =&(\vec y^T\phi(X)((\phi(X)^T\phi(X)+\lambda I)^T)^{-1})\phi(x_{new})\\ =&\vec y^T\phi(X)(\lambda I+\phi(X)^T\phi(X))^{-1}\phi(x_{new})\\ \end{aligned}
根据上面证明的公式
\begin{aligned} \theta^T\phi(x_{new})=&\vec y^T\phi(X)(\lambda I+\phi(X)^T\phi(X))^{-1}\phi(x_{new})\\ =&\vec y^T(\lambda I+\phi(X)\phi(X)^T)^{-1}\phi(X)\phi(x_{new})\\ \end{aligned}
用guassian kernel等kernel trick替换掉\phi(X)\phi(X)^T(即Gram矩阵)和\phi(X)\phi(x_{new})即可。(注意这里和讲义中定义的k=\phi(x)^T\phi(x)不同,因为这里X是定义成m\times n的x_{new}n\times 1的,本质相同)

2. \ell _2 norm soft margin SVMs

解答:
(a)soft margin SVMs的hyper-plane是由y^{(i)}(\omega^Tx^{(i)}+b)\leq1的样本点决定的。optimal value求出意味着所有的样本点都满足了约束。而\xi_i<0对应的样本点y^{(i)}(\omega^Tx^{(i)}+b)>1,远离svm vectors。这些远离svm vectors的样本点在hyper-plane的optimal表达式中并不施加影响。所以\xi_i<0的约束条件可有可无,i.e. \xi_i\geq0 can be removed。
(b)
\begin{aligned} \mathcal{L}(\omega,b,\xi,\alpha,\gamma)=&\frac{1}{2}\omega^T\omega+\frac{C}{2}\sum^m_{i=1}\xi^2_i+\sum^m_{i=1}\alpha_i[1-\xi_i-y^{(i)}(\omega^Tx^{(i)}+b)]\\ =&\frac{1}{2}\omega^T\omega+\frac{C}{2}\sum^m_{i=1}\xi^2_i-\sum^m_{i=1}\alpha_i[y^{(i)}(\omega^Tx^{(i)}+b)-1+\xi_i] \end{aligned}
(c)
\begin{aligned} \nabla_\omega\mathcal{L}=&\omega-\nabla_\omega\sum^m_{i=1}\alpha_iy^{(i)}(\omega^Tx^{(i)}+b)\\ =&\omega-\sum^m_{i=1}\nabla_\omega\alpha_iy^{(i)}\omega^Tx^{(i)}\\ =&\omega-\sum^m_{i=1}\alpha_iy^{(i)}x^{(i)}\\ \end{aligned}
\nabla_\omega\mathcal{L}=0可得\omega=\sum^m_{i=1}\alpha_iy^{(i)}x^{(i)}
\begin{aligned} \frac{\mathcal{\partial L}}{\partial b}=&\frac{\mathcal{\partial}}{\partial b}-\sum^m_{i=1}\alpha_i[y^{(i)}(\omega^Tx^{(i)}+b)-1+\xi_i]\\ =&\frac{\mathcal{\partial}}{\partial b}-\sum^m_{i=1}\alpha_iy^{(i)}(\omega^Tx^{(i)}+b)\\ =&\frac{\mathcal{\partial}}{\partial b}-\sum^m_{i=1}\alpha_iy^{(i)}b\\ =&-\sum^m_{i=1}\alpha_iy^{(i)} \end{aligned}
\frac{\mathcal{\partial L}}{\partial b}=0可得\sum^m_{i=1}\alpha_iy^{(i)}=0
\begin{aligned} \nabla_{\xi_i}\mathcal{L}=&\nabla_{\xi_i}\{\frac{C}{2}\sum^m_{i=1}\xi^2_i-\sum^m_{i=1}\alpha_i[y^{(i)}(\omega^Tx^{(i)}+b)-1+\xi_i]\}\\ =&\nabla_{\xi_i}\{\frac{C}{2}\sum^m_{i=1}\xi^2_i-\sum^m_{i=1}\alpha_i\xi_i\}\\ =&C\xi_i-\alpha_i\\ \end{aligned}
\nabla_{\xi_i}\mathcal{L}=0可知,\xi_i=\frac{\alpha_i}{C},所以
\begin{aligned} \xi=&\frac{1}{C}\left[ \begin{matrix} \alpha_1\\ \alpha_2\\ \vdots\\ \alpha_m \end{matrix} \right]\\ \end{aligned}
(d)将(c)中求得的各式带入\mathcal{L}中可得
\begin{aligned} \mathcal{L}(\omega,b,\xi,\alpha,\gamma)=&\frac{1}{2}\omega^T\omega+\frac{C}{2}\sum^m_{i=1}\xi^2_i-\sum^m_{i=1}\alpha_i[y^{(i)}(\omega^Tx^{(i)}+b)-1+\xi_i]\\ =&\frac{1}{2}\sum^m_{i=1}\alpha_iy^{(i)}(x^{(i)})^T\sum^m_{j=1}\alpha_jy^{(j)}x^{(j)}+\frac{C}{2}·\frac{1}{C^2}\sum^m_{i=1}\alpha_i^2-\sum^m_{i=1}\alpha_i[y^{(i)}(\omega^Tx^{(i)}+b)-1+\frac{\alpha_i}{C}]\\ =&\frac{1}{2}\sum^m_{i=1}\alpha_iy^{(i)}(x^{(i)})^T\sum^m_{j=1}\alpha_jy^{(j)}x^{(j)}+\frac{C}{2}·\frac{1}{C^2}\sum^m_{i=1}\alpha_i^2-\sum^m_{i=1}\alpha_iy^{(i)}(\omega^Tx^{(i)}+b)+\sum^m_{i=1}\alpha_i-\frac{1}{C}\sum^m_{i=1}\alpha_i^2\\ =&\frac{1}{2}\sum^m_{i=1}\alpha_iy^{(i)}(x^{(i)})^T\sum^m_{j=1}\alpha_jy^{(j)}x^{(j)}+\frac{C}{2}·\frac{1}{C^2}\sum^m_{i=1}\alpha_i^2-\sum^m_{i=1}\alpha_iy^{(i)}\omega^Tx^{(i)}-\sum^m_{i=1}\alpha_iy^{(i)}b+\sum^m_{i=1}\alpha_i-\frac{1}{C}\sum^m_{i=1}\alpha_i^2\\ =&\frac{1}{2}\sum^m_{i=1}\alpha_iy^{(i)}(x^{(i)})^T\sum^m_{j=1}\alpha_jy^{(j)}x^{(j)}-\sum^m_{i=1}\alpha_iy^{(i)}\sum^m_{j=1}\alpha_jy^{(j)}(x^{(j)})^Tx^{(i)}+\sum^m_{i=1}\alpha_i-\frac{1}{2C}\sum^m_{i=1}\alpha_i^2\\ =&-\frac{1}{2}\sum^m_{i=1}\alpha_iy^{(i)}(x^{(i)})^T\sum^m_{j=1}\alpha_jy^{(j)}x^{(j)}+\sum^m_{i=1}\alpha_i-\frac{1}{2C}\sum^m_{i=1}\alpha_i^2\\ =&\sum^m_{i=1}\alpha_i-\frac{1}{2C}\sum^m_{i=1}\alpha_i^2-\frac{1}{2}\sum^m_{i,j=1}y^{(i)}y^{(j)}\alpha_i\alpha_j(x^{(i)})^Tx^{(j)}\\ =&\sum^m_{i=1}\alpha_i-\frac{1}{2C}\sum^m_{i=1}\alpha_i^2-\frac{1}{2}\sum^m_{i,j=1}y^{(i)}y^{(j)}\alpha_i\alpha_j\\ \end{aligned}
根据dual问题的格式maxmin\mathcal{L}可得\ell _2 norm soft margin SVM opitimization problem 就是
\begin{aligned} max_\alpha W(\alpha)=&\sum^m_{i=1}\alpha_i-\frac{1}{2C}\sum^m_{i=1}\alpha_i^2-\frac{1}{2}\sum^m_{i,j=1}y^{(i)}y^{(j)}\alpha_i\alpha_j.\\ s.t. \qquad&\alpha_i\geq0,i=1,...,m\\ &\sum^m_{i=1}\alpha_iy^{(i)}=0 \end{aligned}

3. SVM with Gaussian kernel

解答:
(a)根据提示,当\alpha_ib确定以后,\tau最后肯定能表示成\epsilon的函数。
这里先定下来对于i=1...m\alpha_i=1,b=0(也可以假定为别的组合)
那么
f(x)=\sum^m_{i=1}y^{(i)}K(x^{(i)},x)=\sum^m_{i=1}y^{(i)}exp(-\frac{||x^{(i)}-x||^2}{\tau^2})
所以
\begin{aligned} f(x^{(j)})-y^{(j)}=&-y^{(j)}+\sum^m_{i=1}y^{(i)}exp(-\frac{||x^{(i)}-x^{(j)}||^2}{\tau^2})\\ =&-y^{(j)}+y^{(j)}+\sum^m_{i=1,i\neq j}y^{(i)}exp(-\frac{||x^{(i)}-x^{(j)}||^2}{\tau^2})\\ =&\sum^m_{i=1,i\neq j}y^{(i)}exp(-\frac{||x^{(i)}-x^{(j)}||^2}{\tau^2})\\ |f(x^{(j)})-y^{(j)}|\leq&\sum^m_{i=1,i\neq j}|y^{(i)}exp(-\frac{||x^{(i)}-x^{(j)}||^2}{\tau^2})|\\ \leq&\sum^m_{i=1,i\neq j}|y^{(i)}||exp(-\frac{||x^{(i)}-x^{(j)}||^2}{\tau^2})|\\ \leq&\sum^m_{i=1,i\neq j}|y^{(i)}||exp(-\frac{\epsilon^2}{\tau^2})|=(m-1)exp(-\frac{\epsilon^2}{\tau^2})\\ \end{aligned}
当全部分类正确时|f(x^{(j)})-y^{(j)}|\leq1
(m-1)exp(-\frac{\epsilon^2}{\tau^2})=1可得
\tau^2=\frac{\epsilon^2}{ln(m-1)}
所以对于i=1...m,\alpha_i=1,b=0,\tau^2=\frac{\epsilon^2}{ln(m-1)}时,所有的样本点可正确分类。
(b)可以正确分类。根据(a)的结论,无论样本点在低维情况下(即不用核技巧)可不可separable,我们都可以用高斯核函数把它们映射到无限维空间中进行separable(惩罚项还是要一直保证),保证所有样本点正确分类。这里,带松弛变量的SVM(即用来解决低维情况下non-separable的改进版SVM)是原有SVM在一般情况下的推广,所以在同样\tau的条件下,既然SVM可以正确分类所有样本点,那么作为更一般的带松弛变量的SVM肯定也可以正确分类。
(c)说不来。争论的关键在C的值是arbitrary,即任意的。根据讲义,当样本点的函数距离是1-\xi_i时,需要在目标函数上加上一项惩罚项C\xi_i。如果C是任意的,那么当C很小或者为0时,即使样本点的函数距离出现异常,惩罚的力度因为太弱会相当于没有,这样优化出来的分离超平面相当于把用来解决线性可分问题的SVM算法直接用到了线性不可分或者含outliers的样本点分类问题中,自然无法保证分类的准确性。

4. Naive Bayes and SVMs for Spam Classification

解答:
WEKA下载地址:https://www.cs.waikato.ac.nz/ml/weka/及我下载的是weka-3-8-3jre-x64.exe这一版本,其中包含了一个 Oracle’s 64-bit Java VM 1.8 的安装软件。
(a)
在 Windows 系统中打开 WEKA 软件,并选择‘Simple CLI’模块即可启动‘SimpleCLI’窗口。

直接在最下方文本框中输入相应命令,敲击回车即可。注意文件的路径。 此处输入的是

复制代码
    java weka.classifiers.bayes.NaiveBayesMultinomial -t $file_path/q4/spam_train_1000.arff -T $file_path/q4/spam_test.arff

执行结果如下图

1.spam_train_10.arff和spam_test.arff的结果

2.spam_train_25.arff和spam_test.arff的结果

3.spam_train_50.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

4.spam_train_100.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

5.spam_train_200.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

6.spam_train_300.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

7.spam_train_400.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

8.spam_train_500.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

9.spam_train_750.arff和spam_test.arff的结果

10.spam_train_1000.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

11.spam_train_1250.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

12.spam_train_1500.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

13.spam_train_1750.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

14.spam_train_2000.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

由以上数据(除去出错的训练集外)可绘制如下图形

(b)
1.spam_train_10.arff和spam_test.arff的结果

2.spam_train_25.arff和spam_test.arff的结果

3.spam_train_50.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

4.spam_train_100.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

5.spam_train_200.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

6.spam_train_300.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

7.spam_train_400.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

8.spam_train_500.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

9.spam_train_750.arff和spam_test.arff的结果

10.spam_train_1000.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

11.spam_train_1250.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

12.spam_train_1500.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

13.spam_train_1750.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

14.spam_train_2000.arff和spam_test.arff的结果
训练集上的表现

测试集上的表现

由以上数据(除去出错的训练集外)可绘制如下图形

对比NaiveBayesMultinomial(NBM)和SMO的求解结果,如下图

可获得以下结论:
1)针对训练数据进行分类测试时,在采用NBM与SMO算法的情况下均能实现较低程度上的误分率;其中SMO的表现更为出色,并且这种差异性随着训练数据量的增长而愈发显著;
2)针对测试数据进行分类测试时,在用于数据量少于500个的情况下,NBM与SMO均能实现较低程度上的误分率, 其中前者表现更为出色;而在用于数据量超过1 万的情况下,两者的误分会显著上升,但此时SMO依然能够保持相对较低的表现水平。

全部评论 (0)

还没有任何评论哟~