北京大学生物信息学-第四周-马尔可夫 HMM及其应用
从状态到马尔可夫链
A Markov chain represents a discrete stochastic process across subsequent time points. The transition probabilities from one state to any other state, including itself, are determined by the probability distribution.
- 马尔科夫链用于描绘一组离散状态在不同时间点之间的转移模式。
- 这里的转移模式无需完全唯一,并可通过单一概率分布进行有效刻画。
- 唯一的要求是:t时刻的状态概率分布仅依赖于之前有限个m时刻的状态的概率分布,并被称作m阶马尔科夫链。
- 一阶马尔科夫链仅涉及当前的状态与前一个状态的关系。

转移概率:

从k状态转移到l状态、以及从l状态转移到k状态的过程中、转移概率矩阵呈现出不对称性。通常情况下、这种转移过程假设独立于时间变量t、属于齐次马尔可夫链的情形、即对应于线性组合罚分的情况。
有限自动机:
M对应的是成功匹配。
X对应的是前向配对。
Y对应的是后向配对。
当出现连续的前向配对时,在当前位置形成一个开放缺口(该缺口即由未配对的碱基组成)。
当连续出现两个以上的未配对碱基时,在最后一位形成一个延伸缺口(该缺口即由这些未配对的碱基组成)。

隐马尔可夫模型 Hidden Markov Model HMM
Such as the observable symbols ("tokens", y(t)) are produced based on their corresponding states (x(t)).
隐马尔科夫模型 introduced the concept of observable symbols by building upon the foundation of underlying states.
Each state can introduce a set of observable symbols with different probabilities, denoted as p.

Each state in the HMM model is subject to having a probability distribution across its possible output tokens, which includes both the Transition Probabilities and Emission Probabilities (generative probabilities).
- Thus, an HMM is composed of two distinct sequences of information.
– The state sequence 状态序列
– The token sequence (emitted sequence). 发射序列

I cannot directly observe the state sequence within a Hidden Markov Model (HMM). Its underlying nature remains hidden.
Rather than directly observing it, we must infer the latent state path from the derived observable token path.

Given a HMM, a sequence of tokens could be generated as
following:
Each time the system visits a given state, it generates a token according to that state's emission probability distribution.
Then, we select the next state to visit based on the state's transition probability distribution.
然后,根据状态的转移概率分布,我们选择下一个要访问的状态。

Each unit of the HMM constitutes a pair of successfully aligned residues in the M state, or a single residue with an insertion/deletion gap in either X or Y states.
HMM中的每个单位都对应着一对成功配对的残基(M状态),或者是一个残基与一个插入/删除间隙(X或Y状态)。
The transition and emission probabilities determine the likelihood of each alignment between two sequence units.
转移和发射概率决定了两单位之间配对的可能性。
- Under the basis of HMM, every alignment of two sequences is assigned to a specific probability.
– We search for an optimal alignment between two input sequences that possesses the greatest probability.
- We determine an optimal alignment between two input sequences that possesses the greatest probability.


Probabilistic interpretation概率解释:

δ:在生物演化过程中发生DNA片段插入与删除的可能性以及产生一个空位的可能性。
M状态生成的可能性,在生物演化过程中对应替代发生的频率。
基于概率论的知识进行进一步分析以计算两序列之间最大的相似性。
Given the nature of HMM, many different state paths can give rise to the same token sequence.
HMM同一个观察序列可以来自许多不同的状态路径,在概率求和的过程中所有序列之和的最大值即为两序列之间的最大可能相似性。

By aggregating all the tokens together, we can obtain the full probability of a given token sequence.



用隐马尔科夫模型建立预测模型
符号->状态路径
对每一个可能的状态序列评估其生成观测符号串的概率,在这些序列中具有最高生成概率的那个特定状态-观测序列对应关系即为目标需要识别的目标状态-观测序列。
- 例子-The Minimal Gene Prediction System (MGPS)最小的基因预测系统:
识别一段基因序列中的编码区与非编码区。

状态转换图:

训练模型:
训练什么?
- 状态之间的转移几率
- 各状态下生成的可能性
- 根据已知的“训练集”来推断其发生的可能性
- 具有注释标记的基因组片段,并标示出其中区分编码与非编码序列的具体区域。

从而填写转移概率矩阵 和生成概率矩阵 :

对基因组序列的位置进行推导分析后可得到最可能的状态路径(即概率最大化的状态序列):


由于要做很多乘法运算而导致其效率较低,并且当连乘的次数增多时计算结果容易变得过小以至于发生下溢现象在这种情况中引入了对数树的方法

Testing Sequence:CGAAAAAATCG

0.62+0.097+0.523=1.24
-2.32比-1.24小,所以不会被保留。
回溯:从最大的开始

结果:

应用:5’剪切位点的预测
