北京大学生物信息学-第三周-序列数据库 BLAST
序列数据库
- Genbank 是 NCBI 建立的一个 genomic DNA 数据库,在 publicly available resources 上收集 DNA 序列信息。
- SRA(Sequence Read Archive)数据库专门用于存储二代测序产生的原始数据集(包括 454、Illumina、SOLiD、IonTorrent、Helicos 和 CompleteGenomics)。除了上述原始序列数据外,在 NCBI 的 SRA 系统中还包含 raw reads 对比参考基因的信息。
BLAST算法初探
BLAST Ideas: Seeding‐and‐extending:种子-扩展
-
Identify initial correspondences(种子)between the query and subject;
-
Expand initial correspondences into high-scoring segment pairs(HSPs);
– Execute the Smith-Waterman algorithm exclusively for the designated area.- Assess the reliability of the alignment.评估校准的可靠性

Seeding:
Given a specific word length w, typically set as 3 for proteins and 11 for nucleotides, dividing the query sequence into consecutive segments of consecutive seed words.

速览:索引数据库 加速
该库预先构建索引用于迅速定位指定种子的所有位置。
速览:索引数据库 加速
该库预先构建索引用于迅速定位指定种子的所有位置。

E‐Value: How a match may occur:匹配偶然发生的可能
• The expected occurrence of alignments with a given score within the searched database
在已搜索的数据库中预期出现具有特定分数值的对齐次数
– for instance, if E=10, 10 matches with scores this high are anticipated to arise randomly
例如,在这种情况下(E=10),预计会随机出现与该分数相匹配的10个匹配项

BLAST详解
Why BLAST?
Homology represents the fundamental concept in the entire field of biology.
BLAST tool primarily employs database searches to compute sequence similarities.
- If you work with one or a few proteins or genes, it can tell you about their conservation, active sites, structure and regulation in other organisms, etc.
如果你研究一个或几个蛋白质或基因,它可以告诉你它们在其他有机体中的保存、活性位点、结构和调控等。
What BLAST does?
Identity: A term referring to the occurrence of identical nucleotides and amino acids at matching positions within aligned sequence alignments.
Similarity measures the degree of sameness or difference between two sequences.
- 同源性是基于共同祖先来定义的。同源序列通常具有相似性。具有同源性的序列区域也被称为保守区域。
动态规划算法在准确性上表现优异 - FASTA 和 BLAST算法虽然准确性稍逊于动态规划法,但其计算速度依然非常高效。

0 Filtering 过滤
To avoid generating a large number of statistically significant yet biologically unattractive outcomes.
Low complexity and repeats 低复杂性和重复性 , i.e.
(CA)n
KLKLKLKLKLKL
These regions will be covered by the specified letters.
Ns (for nucleotide residues) 核苷酸残余
Xs(for amino acid residues) 氨基酸残余
-Fflag: filter query sequence 过滤查询序列
1 Seeding 播种
-
Typically create a w-letter vocabulary for the query sequence.
-
Usually consisting of 3 letters for protein sequences, such as k-mers, and 11 letters for nucleic acid sequences.
-
当一个查询序列包含n个字母时, 它的总单词数等于n - w + 1
- -W标识符表示字符宽度

2 Search word hits 搜索词命中率
-
Scor矩阵
-
对于氨基酸来说,BLOSUM或PAM被使用
-
在DNA单词中,一致配对得分为+5分,不一致配对得分为-4分或+2分和-3分
-
No spaces are permitted
- Words with scores exceeding a certain threshold T will continue to be included in the potential matching word list

3 Scanning 扫描
- 哈希表:基于直接寻址策略的方法 HashTable
- 确定性有限自动机/有限状态机:具有更快的速度
确定性有限自动机/有限状态机:更快
- 确定性有限自动机/有限状态机:具有更快的速度

4 Extending->HSP 扩展
- Cutoff score S

5 Significance evaluation 重要评估
Raw scores : lacking clear meaning if one doesn't have detailed knowledge of the scoring system used. 原始分数:如果没有对所使用的评分系统有详细了解的话,则没有明确的意义。
B scores : normalization of raw scores based on a specific formula.
位分数:根据特定公式对原始分数进行标准化处理。

- E values: corresponding to a given bit score
对应于给定的位分数


