An introduction to Linear Regression
An Introduction to Linear Regression
Intent
supervised learning:
- regression (continious)
- classfication (discrete)
h(\theta)=\displaystyle\sum_{i=0}^{n}\theta_iX_i=\theta^{T}X
For historical reasons, this function h is called a hypothesis.
\theta^{'}s \quad\text{is the parameters}
(x^{(i)},y^{(i)})\qquad(training \quad example)
\{(x^{(i)},y^{(i)}); i=1,...,m\}\qquad(training \quad set)

要让h(\theta)接近于trainset中的y^{(i)}
cost function: J(\theta) =\frac{1}{2}\displaystyle\sum_{i=1}^{m}(h_{\theta}(x^{(i)} - y^{(i)})^2
为了使生成的模型达到稳定性能,在偏差与方差之间进行权衡是非常必要的。正则化回归采用L1(即拉索)和L2(即里德)作为惩罚项。其中,L1等同于特征选择过程(参数变为零)。其主要作用是降低方差。交叉验证与自助法也是常用的技术来权衡偏差与方差。
LMS
gradient descent
\theta_j被更新为当前值减去学习率乘以目标函数相对于\theta_j的偏导数;\alpha代表学习率
对θj求偏导数运算的结果等于二分之一乘以(h_θ(x)−y)再乘以xj
For a single training example, this gives the update rule:
\theta_j = \theta_j +\alpha(y^{(i)}-h_\theta(x^{(i)}))x^{(i)}_j
The rule is called LMS update rule (LMS, “least mean squares”)
batch gradient descent
I n s i t i t u l t i v eu n t i lc o v e r g e n c e{
d o
q u a d
t h es u m m a t i o no v e rd o ff o r
$i = 1t o$$m
( y ^{( i ) } - h _{θ}( x ^{( i ) } )) x _{ j } ^{( i ) }
d o w n w a r d s
θ _{ j } = θ _{ j } + α ×$
}
随机梯度下降法
Loop \quad\{
\quad for \, i \, from \, 1 \, to \, m,
\qquad update \, parameters: \, \theta_j = \theta_j + \alpha \sum_{k=1}^{m} (y^{(k)} - h_\theta(x^{(k)})) x^{(k)}_j
\qquad (as applied to each \, j)
\quad end.
end.
stochastic 相比于batch的方法收敛速度快。
An interpretation of Probability
please refer to [1]
An interpretation of Linear Algebra
在Ng的draft中分别运用了基于线性代数和基于概率论的方法来推导Linear regression模型。其中概率论的方法则利用高斯分布的最大似然估计来进行建模;相比之下,基于Linear Algebra的方法处理起来较为复杂;然而从投影的角度来看,则可能更为直观[2]。
这里尽量写的清楚点[2]:
Projection onto a line
A line goes through the origin in the direction of a = (a_1, . .. , a_m). Along that line, we want the point p closest to b = (b_1, . .. , b_m ). The key to projection is orthogonality: The line from b to p is perpendicular to the vector a. This is the dotted line marked e for error in Figure below-which we now compute by algebra.

Projection amounts to a multiple of. Let p equal x^a, which is represented as “x hat” multiplied by. Calculating this scalar x^a will yield the vector p. From the formula for [something], we derive P. These steps outline how to obtain all projection matrices: first determine x, then compute v, and finally obtain P.
Projection onto a Subspace
We aim to calculate projections onto n-dimensional subspaces following a structured approach. The process involves three key steps: identifying a suitable vector, computing its projection onto this subspace using the formula p = A\hat{x}, and determining the corresponding transformation matrix.
The residual b - A\hat{x} is perpendicular to each of the vectors a_1, ..., a_n. The n right angles result in n equations forming the system of equations.
a^T_1(b-A\hat{x})=0\\
\quad\quad\vdots
a^T_n(b-A\hat{x})=0\\
The matrix with those rows a^T is A^T. The equations are exactly A^T(b - A\hat{x}) = 0. Rewrite in its famous form A^T Ax = A^T b. This is the equation for \hat{x}, and the coefficient matrix is A^T A. Now we can find and and , in that order:
p = A\hat{x}=A(A^TA)^{-1}A^Tb
project matrix is:
P=A(A^TA)^{-1}A^T
有了projection matrix就可以求了,也就是在A的column space上的投影。
这里需要A的columns是相互独立的,这样A^TA才可逆。
classification (Logistic regression)
Logit回归是对线性回归采用sigmoid函数这一做法无需再多解释
Generalized Linear Models
线性和逻辑回归都属于广义线性模型(Generalized Linear Models, GLM)的关键组成部分。根据响应变量的不同概率分布特性及符号y可推导出一系列特定模型。具体而言,线性回归对应于响应变量服从正态分布的情况(Linear regression),而逻辑回归则适用于二元分类问题(Logistic regression)。Softmax回归适用于响应变量处于有限类别中的情况(Softmax regression)。有关此主题,请参考文献中的R实现部分( Softmax regression)。该算法所采用的目标函数采用了弹性网(Elastic Net)方法进行优化(其解决的cost function is elastic net's cost function)。
- CS 229 Notes - 监督学习
- 线性代数导论(第4章):吉尔伯特·斯特朗
**
