Advertisement

GTDB-Tk基因组物种注释

阅读量:

文章目录

  • 安装程序
  • 最新版GTDB R220及其相关组件
  • GTDB-Tk软件组件运行于版本号v2.3.3,并配套有gtdbtk_r214_data.tar.gz文件
  • GTDB-Tk软件组件运行于版本号v2.1.1,并配套有gtdbtk_r207_v2_data.tar.gz文件
  • GTDB-Tk软件组件运行于旧版v1.3.0

数据库
采用单步分类标记法进行物种注释
逐步应用分步分类标记方法
使用 batchfile 输入基因组路径并压缩基因组数据(gzip)
批量处理 Rep-seq 数据与 GTDB 注释过程
查询当前已安装的数据库列表
运用 iTOL 进行生物信息学可视化分析
使用 pplacer 构建并插入进化树结构
时间参数设置相关问题
数据库配置参数设置问题
时间管理建议
数据库优化建议

安装

最新版GTDB R220

参考https://ecogenomics.github.io/GTDBTk/installing/index.html

在这里插入图片描述
  • gtdbtk 2.3.2是兼容上述最新版R220数据库的

GTDB-Tk v2.3.3 and gtdbtk_r214_data.tar.gz

复制代码
    https://data.gtdb.ecogenomic.org/releases/release214/214.0/auxillary_files/ 
    (gtdbtk-2.1.1) [yut@node01 02_Glacier_new_taxa]$ conda env config vars set GTDBTK_DATA_PATH="/datanode02/yut/Database/GTDBTK_r214_data/" # 配置数据库环境变量
    To make your changes take effect please reactivate your environment
    
    
    
      
      
      
      
    
    代码解读

GTDB-Tk v2.1.1 and gtdbtk_r207_v2_data.tar.gz

复制代码
    Executing transaction: /
    GTDB-Tk v2.1.1 requires ~63G of external data which needs to be downloaded
    and extracted. This can be done automatically, or manually.
    
    Automatic:
    
        1. Run the command "download-db.sh" to automatically download and extract to:
            /datanode02/yut/Software/Miniconda3/envs/gtdbtk-2.1.1/share/gtdbtk-2.1.1/db/
    
    Manual:
    
        1. Manually download the latest reference data:
            wget https://data.gtdb.ecogenomic.org/releases/release207/207.0/auxillary_files/gtdbtk_r207_v2_data.tar.gz
    
        2. Extract the archive to a target directory:
            tar -xvzf gtdbtk_r207_v2_data.tar.gz -C "/path/to/target/db" --strip 1 > /dev/null
            rm gtdbtk_r207_v2_data.tar.gz
    
        3. Set the GTDBTK_DATA_PATH environment variable by running:
            conda env config vars set GTDBTK_DATA_PATH="/path/to/target/db"
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读
  • 问题
复制代码
    EXCEPTION: ProdigalException
    MESSAGE: An exception was caught while running Prodigal: Prodigal returned a non-zero exit code.
    AttributeError: module 'numpy' has no attribute 'bool'.
    
    
      
      
      
    
    代码解读

解决方法:

复制代码
    (gtdbtk-2.1.1) [yut@node01 02_Glacier_new_taxa]$ mamba install numpy=1.23.1
    
    
      
    
    代码解读

GTDB-Tk 1.3.0

复制代码
    conda install -c bioconda gtdbtk
    
    
      
    
    代码解读

GTDB-Tk v1.3.0 needs approximately 25 gigabytes of external data, which must be downloaded and extracted. This operation can either be automated or performed manually.

复制代码
 Run the command download-db.sh to automatically download to:
    /home/yutao/miniconda3/share/gtdbtk-1.3.0/db/
    
     2. Manually download the latest reference data:
    https://github.com/Ecogenomics/GTDBTk#gtdb-tk-reference-data
    
     2b. Set the GTDBTK_DATA_PATH environment variable in the file:
     /home/yutao/miniconda3/etc/conda/activate.d
    
    
      
      
      
      
      
      
      
      
    
    代码解读

说明

数据库

复制代码
    wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release95/95.0/auxillary_files/gtdbtk_r95_data.tar.gz
    tar xvzf gtdbtk_r95_data.tar.gz
    #GTDB-Tk requires an environment variable named GTDBTK_DATA_PATH to be set to the directory containing the unarchived GTDB-Tk reference data.
    export GTDBTK_DATA_PATH=/path/to/release/package/
    
    # 最新版R202
    https://data.gtdb.ecogenomic.org/releases/release202/202.0/auxillary_files/gtdbtk_r202_data.tar.gz
    # 参考
    https://ecogenomics.github.io/GTDBTk/installing/index.html
    
    
      
      
      
      
      
      
      
      
      
    
    代码解读

推荐使用flash download下载到本地,在通过filezila上传到服务器

classify_wf 物种注释一步使用

复制代码
    # 输入基因组目录
    gtdbtk classify_wf --cpus 20 --genome_dir HTR9_maxbin_genome -x fasta --out_dir classify 
    
    
      
      
    
    代码解读

–genome_dir 包含多个基因组的目录
-x 基组文件后缀名

  • 建议将GTDB-Tk v0.3.2的进程数量限制在30个以内
  • 请确保不过滤基因组数据
  • 最终结果即物种注释表
复制代码
    (gtdbtk-2.1.1) [yut@io02 TestData]$ cat  gtdb_out/gtdbtk.bac120.summary.tsv
    user_genome     classification  fastani_reference       fastani_reference_radius        fastani_taxonomy        fastani_ani     fastani_af      closest_placement_reference     closest_placement_radius        closest_placement_taxonomy      closest_placement_ani   closest_placement_af      pplacer_taxonomy        classification_method   note    other_related_references(genome_id,species_name,radius,ANI,AF)  msa_percent     translation_table       red_value       warnings
    COVID19.fa      Unclassified    N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     No bacterial or archaeal marker
    DRR331386_ma_bin.12.fa  d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Burkholderiales;f__Burkholderiaceae;g__Rhodoferax;s__   N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     N/A     d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Burkholderiales;f__Burkholderiaceae;g__Rhodoferax;s__     taxonomic classification defined by topology and ANI    classification based on placement in class-level tree   GCA_010365055.1, s__Rhodoferax sp010365055, 95.0, 86.38, 0.64; GCF_000013605.1, s__Rhodoferax ferrireducens, 95.0, 86.12, 0.68; GCA_013791765.1, s__Rhodoferax sp013791765, 95.0, 85.54, 0.65; GCA_002413825.1, s__Rhodoferax sp002413825, 95.0, 80.4, 0.48; GCA_012927445.1, s__Rhodoferax sp012927445, 95.0, 80.38, 0.47; GCF_003347515.1, s__Rhodoferax sp003347515, 95.0, 80.21, 0.5; GCA_017880985.1, s__Rhodoferax sp017880985, 95.0, 80.07, 0.46; GCA_002422455.1, s__Rhodoferax sp002422455, 95.0, 79.42, 0.47; GCA_903828665.1, s__Rhodoferax sp903828665, 95.0, 79.26, 0.38; GCA_002842265.1, s__Rhodoferax sp002842265, 95.0, 79.26, 0.37; GCA_014859475.1, s__Rhodoferax sp014859475, 95.0, 79.23, 0.42; GCA_903839085.1, s__Rhodoferax sp903839085, 95.0, 79.18, 0.33; GCA_011391645.1, s__Rhodoferax sp011391645, 95.0, 79.12, 0.41; GCA_903876735.1, s__Rhodoferax sp903876735, 95.0, 79.09, 0.39; GCA_903920695.1, s__Rhodoferax sp903920695, 95.0, 78.98, 0.39; GCA_011391545.1, s__Rhodoferax sp011391545, 95.0, 78.98, 0.34; GCA_903914175.1, s__Rhodoferax sp903914175, 95.0, 78.95, 0.35; GCA_018780565.1, s__Rhodoferax sp018780565, 95.0, 78.94, 0.39; GCA_002789915.1, s__Rhodoferax sp002789915, 95.0, 78.86, 0.38; GCA_003133245.1, s__Rhodoferax sp003133245, 95.0, 78.84, 0.37; GCA_016708345.1, s__Rhodoferax sp016708345, 95.0, 78.81, 0.36; GCA_903905735.1, s__Rhodoferax sp903905735, 95.0, 78.81, 0.36; GCA_007280205.1, s__Rhodoferax sp007280205, 95.0, 78.8, 0.28; GCA_002083775.1, s__Rhodoferax ferrireducens_A, 95.0, 78.78, 0.37; GCA_903938955.1, s__Rhodoferax sp903938955, 95.0, 78.76, 0.29; GCA_001770935.1, s__Rhodoferax sp001770935, 95.0, 78.66, 0.35; GCA_903960915.1, s__Rhodoferax sp903960915, 95.0, 78.66, 0.32; GCA_903863395.1, s__Rhodoferax sp903863395, 95.0, 78.63, 0.32; GCA_001771065.1, s__Rhodoferax sp001771065, 95.0, 78.6, 0.32; GCF_001938565.1, s__Rhodoferax antarcticus, 95.0, 78.6, 0.32; GCA_903896425.1, s__Rhodoferax sp903896425, 95.0, 78.6, 0.3; GCF_018619435.1, s__Rhodoferax sp018619435, 95.0, 78.58, 0.34; GCA_002786915.1, s__Rhodoferax sp002786915, 95.0, 78.55, 0.35; GCF_013416775.1, s__Rhodoferax sp013416775, 95.0, 78.53, 0.34; GCA_002784245.1, s__Rhodoferax sp002784245, 95.0, 78.52, 0.4; GCA_016718155.1, s__Rhodoferax sp016718155, 95.0, 78.51, 0.35; GCF_002017865.1, s__Rhodoferax fermentans, 95.0, 78.5, 0.34; GCA_903823025.1, s__Rhodoferax sp903823025, 95.0, 78.47, 0.31; GCA_903873485.1, s__Rhodoferax sp903873485, 95.0, 78.47, 0.35; GCA_903897005.1, s__Rhodoferax sp903897005, 95.0, 78.4, 0.28; GCF_001955715.1, s__Rhodoferax saidenbachensis, 95.0, 78.36, 0.3; GCA_001795455.1, s__Rhodoferax sp001795455, 95.0, 78.31, 0.32; GCF_003415675.1, s__Rhodoferax sp003415675, 95.0, 78.3, 0.35; GCA_001871515.1, s__Rhodoferax sp001871515, 95.0, 78.3, 0.32; GCA_008080595.1, s__Rhodoferax sp008080595, 95.0, 78.3, 0.32; GCA_016703945.1, s__Rhodoferax sp016703945, 95.0, 78.26, 0.32; GCA_903894475.1, s__Rhodoferax sp903894475, 95.0, 78.23, 0.3; GCA_009693325.1, s__Rhodoferax sp009693325, 95.0, 78.19, 0.25; GCA_903845935.1, s__Rhodoferax sp903845935, 95.0, 78.15, 0.3; GCA_016718405.1, s__Rhodoferax sp016718405, 95.0, 78.14, 0.33; GCA_008015235.1, s__Rhodoferax sp008015235, 95.0, 78.13, 0.28; GCA_016719055.1, s__Rhodoferax sp016719055, 95.0, 78.08, 0.33; GCF_002943465.1, s__Rhodoferax sp002943465, 95.0, 78.07, 0.32; GCA_014190675.1, s__Rhodoferax sp014190675, 95.0, 78.06, 0.23; GCA_016719405.1, s__Rhodoferax sp016719405, 95.0, 78.05, 0.28; GCA_903897925.1, s__Rhodoferax sp903897925, 95.0, 78.04, 0.27; GCF_017948265.1, s__Rhodoferax sp017948265, 95.0, 78.02, 0.28; GCF_017798165.1, s__Rhodoferax sp017798165, 95.0, 78.01, 0.31; GCA_903853895.1, s__Rhodoferax sp903853895, 95.0, 78.01, 0.26; GCF_002251855.1, s__Rhodoferax sp002251855, 95.0, 77.99, 0.28; GCA_018063565.1, s__Rhodoferax sp018063565, 95.0, 77.97, 0.26; GCA_903886165.1, s__Rhodoferax sp903886165, 95.0, 77.97, 0.32; GCA_903840825.1, s__Rhodoferax sp903840825, 95.0, 77.95, 0.24; GCA_903906695.1, s__Rhodoferax sp903906695, 95.0, 77.95, 0.32; GCA_016705575.1, s__Rhodoferax sp016705575, 95.0, 77.9, 0.28; GCA_903927295.1, s__Rhodoferax sp903927295, 95.0, 77.85, 0.27; GCA_009925925.1, s__Rhodoferax sp009925925, 95.0, 77.84, 0.21; GCA_018061265.1, s__Rhodoferax sp018061265, 95.0, 77.83, 0.22; GCA_903928635.1, s__Rhodoferax sp903928635, 95.0, 77.83, 0.27; GCA_002256115.1, s__Rhodoferax sp002256115, 95.0, 77.82, 0.26; GCA_903945325.1, s__Rhodoferax sp903945325, 95.0, 77.81, 0.27; GCA_017997975.1, s__Rhodoferax sp017997975, 95.0, 77.79, 0.23; GCA_013297935.1, s__Rhodoferax sp013297935, 95.0, 77.79, 0.23; GCA_903864505.1, s__Rhodoferax sp903864505, 95.0, 77.78, 0.2; GCA_009922295.1, s__Rhodoferax sp009922295, 95.0, 77.78, 0.21; GCF_005876985.1, s__Rhodoferax bucti, 95.0, 77.77, 0.29; GCF_002163715.1, s__Rhodoferax sp002163715, 95.0, 77.77, 0.23; GCA_903925725.1, s__Rhodoferax sp903925725, 95.0, 77.7, 0.21; GCA_903944555.1, s__Rhodoferax sp903944555, 95.0, 77.64, 0.18; GCA_903948045.1, s__Rhodoferax sp903948045, 95.0, 77.61, 0.24; GCA_903856025.1, s__Rhodoferax sp903856025, 95.0, 77.61, 0.25; GCA_903930255.1, s__Rhodoferax sp903930255, 95.0, 77.57, 0.19; GCA_903924955.1, s__Rhodoferax sp903924955, 95.0, 77.54, 0.24; GCA_903876045.1, s__Rhodoferax sp903876045, 95.0, 77.49, 0.21; GCA_018882485.1, s__Rhodoferax sp018882485, 95.0, 77.47, 0.22; GCA_903954045.1, s__Rhodoferax sp903954045, 95.0, 77.45, 0.2; GCA_903958915.1, s__Rhodoferax sp903958915, 95.0, 77.43, 0.19; GCA_903869595.1, s__Rhodoferax sp903869595, 95.0, 77.43, 0.21; GCA_009924455.1, s__Rhodoferax sp009924455, 95.0, 77.41, 0.23; GCA_903875275.1, s__Rhodoferax sp903875275, 95.0, 77.34, 0.23; GCA_903859265.1, s__Rhodoferax sp903859265, 95.0, 77.32, 0.19; GCA_903928325.1, s__Rhodoferax sp903928325, 95.0, 77.31, 0.21; GCA_002390825.1, s__Rhodoferax sp002390825, 95.0, 77.3, 0.19; GCF_006974105.1, s__Rhodoferax sp006974105, 95.0, 77.2, 0.19; GCA_002381045.1, s__Rhodoferax sp002381045, 95.0, 77.19, 0.22; GCA_903870955.1, s__Rhodoferax sp903870955, 95.0, 77.02, 0.18; GCA_007279715.1, s__Rhodoferax sp007279715, 95.0, 77.01, 0.11; GCA_903914985.1, s__Rhodoferax sp903914985, 95.0, 76.98, 0.22; GCA_903870065.1, s__Rhodoferax sp903870065, 95.0, 76.94, 0.19; GCA_005793295.1, s__Rhodoferax sp005793295, 95.0, 75.99, 0.1      81.04   11      0.9887676016165975      Genome not assigned to closest species as it falls outside its pre-defined ANI radius;Genome has more than 11.7% of markers with multiple hits
    DRR331386_ma_bin.18.fa  d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__UBA4664;f__UBA4664;g__UBA4664;s__  N/A     N/A     N/A     N/A     N/A     GCA_002415345.1 95.0    d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__UBA4664;f__UBA4664;g__UBA4664;s__UBA4664 sp00241534582.82    0.8     d__Bacteria;p__Patescibacteria;c__Saccharimonadia;o__UBA4664;f__UBA4664;g__UBA4664;s__  taxonomic classification defined by topology and ANI    classification based on placement in class-level tree   N/A     65.55   11      0.9673951391491065      Genome not assigned to closest species as it falls outside its pre-defined ANI radius
    
    
      
      
      
      
      
    
    代码解读
  • 时间:3179个基因组28核耗时30h
    • 标记基因蛋白多序列文件及其他结果
复制代码
    (gtdbtk-2.1.1) [yut@io02 TestData]$ tree gtdb_out
    gtdb_out
    ├── align
    │   ├── gtdbtk.bac120.filtered.tsv
    │   ├── gtdbtk.bac120.msa.fasta.gz # gtdb所有物种细菌120个标记
    │   └── gtdbtk.bac120.user_msa.fasta.gz # 输入基因组细菌120个标记
    ├── classify
    │   ├── gtdbtk.bac120.classify.tree.1.tree
    │   ├── gtdbtk.bac120.classify.tree.6.tree
    │   ├── gtdbtk.bac120.summary.tsv
    │   ├── gtdbtk.bac120.tree.mapping.tsv
    │   └── gtdbtk.backbone.bac120.classify.tree
    ├── gtdbtk.bac120.summary.tsv -> classify/gtdbtk.bac120.summary.tsv
    ├── gtdbtk.log
    ├── gtdbtk.warnings.log
    └── identify
    ├── gtdbtk.ar53.markers_summary.tsv
    ├── gtdbtk.bac120.markers_summary.tsv
    ├── gtdbtk.failed_genomes.tsv
    └── gtdbtk.translation_table_summary.tsv
    
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

分步使用

原始样本数据

复制代码
    # Create the directory.
    mkdir -p /tmp/gtdbtk && cd /tmp/gtdbtk
    
    # Obtain the genomes.
    mkdir -p /tmp/gtdbtk/genomes
    wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/947/435/GCF_003947435.1_ASM394743v1/GCF_003947435.1_ASM394743v1_genomic.fna.gz -O /tmp/gtdbtk/genomes/genome_a.fna.gz
    wget -q https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/011/125/GCA_002011125.1_ASM201112v1/GCA_002011125.1_ASM201112v1_genomic.fna.gz -O /tmp/gtdbtk/genomes/genome_b.fna.gz
    
    
      
      
      
      
      
      
      
    
    代码解读

Step 2: Gene calling (identify) *通过单步处理:分类标记为 classify_wf;然而,分部处理更适合处理大规模数据。

列出指定目录中的文件及子目录下的文件信息

总计共有877,598个基因组记录

  • r w - r w - ., 用户uqamussi拥有此基因组序列的读取、写入及删除权限;该基因组序列由用户uqamussi创建于20世纪9年代末。

/tmp/gtdbtk/genomes/genome_a.fna.gz

/tmp/gtdbtk/genomes/genome_b.fna.gz

复制代码
    gtdbtk identify --genome_dir /tmp/gtdbtk/genomes --out_dir /tmp/gtdbtk/identify --extension gz --cpus 2
    
    
      
    
    代码解读

genome_dir 包含若干基因组文件,并如 binning 过程后产生的多个 bins 所示。
extension 作为基因组文件的后缀名.

[2020-08-03 11:03:42] INFO: GTDB-Tk v1.3.0
[2020-08-03 11:03:42] INFO: gtdbtk identify --genome_dir /tmp/gtdbtk/genomes --out_dir /tmp/gtdbtk/identify --extension gz --cpus 2
[2020-08-03 11:03:42] INFO: Using GTDB-Tk reference data version r95: /srv/db/gtdbtk/official/release95
[2020-08-03 11:03:42] INFO: Identifying markers in 2 genomes with 2 threads.
[2020-08-03 11:03:42] INFO: Running Prodigal V2.6.3 to identify genes.
==> Processed 2/2 (100%) genomes [ 6.28s/it, ETA 00:00]
[2020-08-03 11:03:54] INFO: Identifying TIGRFAM protein families.
==> Processed 2/2 (100%) genomes [ 3.79s/it, ETA 00:00]
[2020-08-03 11:04:02] INFO: Identifying Pfam protein families.
==> Processed 2/2 (100%) genomes [ 1.62it/s, ETA 00:00]
[2020-08-03 11:04:03] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2020-08-03 11:04:03] INFO: Done.

在每一个基因组对应的中间文件目录中获取被调用的基因及标记信息

genome_a.fnapfamtophit.tsv and genome_a.fnaprotein.gff with their sha256 hashes are generated.
The computed hashes for both files are stored in genome_a.fnaprotptein.gff.sha256.
The prodigal translation table is used to annotate the sequences, with annotations saved as prodigal_translation_table.ts.
The genomic features are exported as FASTA format in genome_a.fnaprotptein.faa.
The annotation results are stored alongside their hash values in genome_a.fnaprotptein.faa.sha256.
The genomic annotation information is preserved in the tigrfam database as genome_a.fnaprotptein.tigrfam.out.
The protein-coding regions are identified and stored in genome_a.fnaprotptein.tigrfam.out.sha256.
Functional annotation of genomic regions is provided in the tigrfam database as genome_a.fnaprotptein.tigrfam_tophit.tsv.
Annotation results for top hits are stored with their hash values in genome_a.fnaprotptein.tigrfam_tophit.tsv.sha256.

然而,在某些情况下,单独浏览概要文件可能更为高效。该概要文件具体说明了从古细菌122或细菌120标记集中识别出的所有标记。(archaeal 122, or bacterial 120 marker set.)

cat /tmp/gtdbtk/identify/gtdbtk.ar122.markers_summary.tsv
Name count of unique genes count of multiple genes count of multiple unique genes count of missing genes List of unique genes List of multiple genes List of multiple unique genes List of missing genes
The genome_a.fna file contains 1 total entries with the following features:

  • A total number_of_entries: 47
  • A total number_of_unique_entries: 47
  • A total number_of_duplicate_entries: 9
  • A total number_of_unmapped_entries: 7
    The genome_b.fna file contains 48 total entries with the following features:
  • A total number_of_entries: 59
  • A total number_of_unique_entries: 58
  • A total number_of_duplicate_entries: 9
  • A total number_of_unmapped_entries: 7
  • Step 3: Performing sequence alignment (sequence alignment process) on genetic sequences.
    This step is responsible for performing sequence alignment on all identified markers, identifying the most probable functional domain, and generating a concatenated multiple sequence alignment (MSA).
复制代码
    gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus 2
    
    
      
    
    代码解读

[2023-XX-Xmas XX:XX] INFO: GTDB-Tk vX.Y.Z
[2023-XX-Xmas XX:XX] INFO: gtdbtk align --identify_dir /tmp/gtdbtk/identify --out_dir /tmp/gtdbtk/align --cpus N
[2023-XX-Xmas XX:XX] INFO: Employing GTDB-Tk reference data version rN/A from path /srv/db/gtdbtk/reference/rN/A
[2023-XX-Xmas XX:XX] INFO: Processing genome alignment using N threads
[2023-XX-Xmas XX:XX] INFO: Completed processing of M genomes from the GTDB database
==> Achieved X.XX% completion in Y.YY seconds with Z.ZZ remaining estimated
[Year-Month-Day HH:mm:ss] INFO: Masked columns of archaeal multiple sequence alignment using standard masking technique
==> Successfully masked L.LL% of all alignments with an average rate of P.PP items per second
[Year-Month-Day HH:mm:ss] INFO: Generated concatenated alignment matrix for X.XX archaeal and Y.YY user genomes

  • Results
    应高度重视输出结果, 当某个基因组在本阶段分析中未能检测到足够的标记物时, 系统将触发相应警告。

When specifying a domain, a file prefixed with either ar122 or bac120 will be generated. This file contains alignments for both user genomes and GTDB genomes, or only contains alignments for user genomes (e.g., gtdbtk.ar123.msa.fasta and gtdbtk.ar134.user_msa.fasta).

列出目录 /tmp/gtdbtk/align 中的内容
对 FASTA 格式的基因序列文件 gtdbtk.ar122.user_msa.fasta 进行动态对齐
将处理后的 FASTA 格式比对结果文件 gtdbtk.ar122.filtered.tsv 对应到 log 文件 gtdbtk.log
生成多序列对齐(MSA)后的 FASTA 格式文件 gtdbtk.ar122.msa.fasta 并将相关信息记录到 warnings.log 文件中

  • 第4步:对基因组进行分类(分类)
    该步骤将基因组归纳至参考树中,并确定其最有可能的类别。
复制代码
    gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2
    
    
      
    
    代码解读

[2020-08-03 11:04:10] INFO: GTDB-Tk v1.3.0
[2020-08-03 11:04:10] INFO: gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2
[2020-08-03 11:04:10] INFO: Using GTDB-Tk reference data version r95: /srv/db/gtdbtk/official/release95
[2020-08-03 11:04:10] INFO: Placing 2 archaeal genomes into reference tree with pplacer using 2 cpus (be patient).
==> Step 9 of 9: placing genome 2 of 2 (100.00%).
[2020-08-03 11:05:22] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2020-08-03 11:05:22] INFO: Calculating RED values based on reference tree.
[2020-08-03 11:05:22] INFO: Calculating average nucleotide identity using FastANI.
[2020-08-03 11:05:22] INFO: fastANI version: 1.31
==> Processed 4/4 (100%) comparisons [ 2.49it/s, ETA 00:00]
[2020-08-03 11:05:24] INFO: 2 genome(s) have been classified using FastANI and pplacer.
[2020-08-03 11:05:24] INFO: Done.

Results
Two main files are generated (based on their domain) comprising the summary file and the reference tree which contains those genomes (gtdbtk.ar122.summary.tsv and gtdbtk.ar122.classify.tree respectively). The classification of genomes is reflected in the summary file.

列举该目录下的所有文件及子目录。
通过指定路径访问该临时存储目录。
执行分类处理任务以生成结构化的分类结果文件。
分析报告已保存至gtdbtk项目第122次分析的总结表。 处理错误并记录至错误日志记录。 输出分类树状结构至gtdbtk项目第122次分析的分类结果。
记录最终日志至日志记录。

batchfile输入基因组路径及gzip基因组

基因组的位置也可以通过使用 --batchfile标志的批处理文件来指定。

  • 基因组应遵循FASTA格式(可接受扩展名为.gzip的gzip格式文件)。
  • batchfile文件应包含两个字段:第一个字段用于标明每个基因组的路径信息,第二个字段则用于标识所需基因组特征(需与Newick兼容的字母数字字符串),这些字段值将分别以newick和summary ID的形式体现出来。
  • 这些字段需使用制表符进行严格分隔。
复制代码
    # 输入基因组路径,第一列是基因组路径,第二列是基因组名称,便于在后面结果显示,一二列必选;第三列是translation table可选
    (gtdbtk-2.1.1) [yut@io02 TestData]$ jobs
    [1]+  运行中               nohup time gtdbtk classify_wf --batchfile test_genome.path --out_dir gtdb_out --cpus 10 &>gtdb.log &
    
    (gtdbtk-2.1.1) [yut@io02 TestData]$ head test_genome.path
    /Database/GEM_DataSets/fna/OTU-1.fna.gz  OTU-1
    /datanode02/yut/Software/TestData/COVID19.fa    COVID19
    /datanode02/yut/Software/TestData/DRR331386_ma_bin.12.fa        DRR331386_ma_bin.12
    /datanode02/yut/Software/TestData/DRR331386_ma_bin.18.fa        DRR331386_ma_bin.18
    
    
      
      
      
      
      
      
      
      
      
    
    代码解读

批量dRep以及GTDB注释

复制代码
     awk 'NR==1{print "method,"$0}$1!~/genome/{split($1,a,"_");print a[2]","$0}' GTDB_for_matbat2_maxbin2/dRep_out/data_tables/genomeInfo.csv  GTDB_for_vamb/dRep_out/data_tables/genomeInfo.csv >Stats/vamb_maxbin2_metabat2_all_bins_genomeInfo.csv
     awk -F, 'NR==1{print}$3>=50 && $4<=5 || ($3-5*$4)>=50' vamb_maxbin2_metabat2_all_bins_genomeInfo.csv >vamb_maxbin2_metabat2_Hight_quality_bins_genomeInfo.csv ```
    # 注意
    [FAQ](https://ecogenomics.github.io/GTDBTk/faq.html)
    * Validating species assignments with average nucleotide identity
    GTDB-Tk uses FastANI to estimate the ANI between genomes. We recommend you have FastANI >= 1.32 as this version introduces a fix that makes the results deterministic. A query genome is only classified as belonging to the same species as a reference genome if the ANI between the genomes is within the species ANI circumscription radius **(typically, 95%) and the alignment fraction (AF) is >=0.65**. In some circumstances, the phylogenetic placement of a query genome may not support the species assignment. GTDB r89 strictly uses ANI to circumscribe species and GTDB-Tk follows this methodology. The species-specific ANI circumscription radii are available from the GTDB website.
    
    # 报错
    ## 由于基因组存在重复id导致Pfam报错
    Pfam Search step is stopping with the following error ( Sequence identifiers must be unique. Your fasta file contains two sequences with the same id (Ga0456356_000001_1) ).
    Looking at your input file , it seems that every header and sequence is duplicated in the file ( This assembly file has 2 copies of the original assembly).
    ```bash
    grep '>' 3300049427_3.fna | sort -n | uniq -c | sort -n
      2 >Ga0456356_000001
      2 >Ga0456356_000002
      2 >Ga0456356_000003
    ......
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

解决:

复制代码
    for i in *fa;do printf "${i}\t";grep -F ">" $i|sort |uniq -d ;echo;done |awk 'NF==2' >../all_err_id.stats  &
    
    
      
    
    代码解读

查看已经安装的数据库

复制代码
    (gtdbtk) [yutao@myosin release202]$ gtdbtk check_install  # Checking integrity of reference package:即为数据库路径
    # 物种名和GCA号对应表
    (gtdbtk) [yutao@myosin release202]$ ls $PWD/taxonomy/gtdb_taxonomy.tsv
    /share/pasteur/yutao/DATABASE/GTDB_R202/release202/taxonomy/gtdb_taxonomy.tsv
    (gtdbtk) [yutao@myosin release202]$ head -n3 $PWD/taxonomy/gtdb_taxonomy.tsv
    RS_GCF_000970205.1      d__Archaea;p__Halobacteriota;c__Methanosarcinia;o__Methanosarcinales;f__Methanosarcinaceae;g__Methanosarcina;s__Methanosarcina mazei
    RS_GCF_000012285.1      d__Archaea;p__Thermoproteota;c__Thermoproteia;o__Sulfolobales;f__Sulfolobaceae;g__Sulfolobus;s__Sulfolobus acidocaldarius
    GB_GCA_003162175.1      d__Archaea;p__Halobacteriota;c__Bog-38;o__Bog-38;f__Bog-38;g__Bog-38;s__Bog-38 sp003162175     
    
    # 所有reference的位置
    (gtdbtk) [yutao@myosin Two_new_classes]$ find /share/pasteur/yutao/DATABASE/GTDB_R202/release202/fastani/database/ -type f -iname "*.gz" >/share/pasteur/yutao/DATABASE/GTDB_R202/release202/47894_GTDB_GCA_path.tsv                                    
    
    
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读
  • 例如这里提取所有Eisenbacteria和Krumholzibacteriota门的基因组
复制代码
    (gtdbtk) [yutao@myosin Two_new_classes]$ awk '$2~/p__Eisenbacteria/||$2~/p__Krumholzibacteriota/' /share/pasteur/yutao/DATABASE/GTDB_R202/release202/taxonomy/gtdb_taxonomy.tsv >31gtdb.tsv
    # 提取基因组
    (reverse-i-search)`fin': find /share/pasteur/yutao/DATABASE/GTDB_R202/release202/fastani/database/ -type f -iname "*.gz" >/share/pasteur/yutao/DATABASE/GTDB_R202/release202/47894_GTDB_GCA_path.tsv
    (reverse-i-search)`grep': grep -f 15Krumholzibacteriota_gca.id /share/pasteur/yutao/DATABASE/GTDB_R202/release202/47894_GTDB_GCA_path.tsv|while read i;do ln -s $i Krumholzibacteriota/;done
    
    
    
      
      
      
      
      
    
    代码解读

iTOL可视化问题

gtdb生成的新ick格式的树直接导入iTOL可能导致丢失叶子节点的问题, 是因为叶子的名称包含特定newick使用的标志字符

gtdb-tk导出生成的.tree文件即为nwk格式。
但其中一些标注包含";"符号,在nwk中该符号被视为进化树结束标志。
因此,在处理时需将这些带有分号的内容进行特殊处理。
具体来说:

  1. 可删除单引号内用于标记clade的数值与冒号。

  2. 使用notepad++进行正则表达式替换如下:

‘[0,1].([0-9]{0,5}):替换为’

或者通过以下方式执行:
一是删除所有涉及bootstrap的内容;
二是清理inter node节点。

pplacer只是插入树的问题

gtdb中的序列主要集中在clade的根部位置。是典型的pplacer插入操作所产生的结果。该结果仅作为参考使用。必须全部重新构建这些树结构。即使处理超过2000个序列的数据量时也不会出现显著性能问题。

在这里插入图片描述

时间

由于需对每一个基因组进行比对以确定其在数据库中的进化位置, 因此并非呈线性相关关系; 即使仅有两个样本数据也需要耗时达七小时, 而当数量增加至两万时时间控制在15至20小时之间

问题

  • 因为基因组中存在重复的ID而导致Pfam出现错误
    在执行 Pfam 搜索时遇到了错误(序列标识符必须唯一。您的 FASTA 文件包含两个具有相同 ID(Ga0456356_000001_1)的序列)
    查看输入文件时发现每个头目和序列都重复了(这个合成都有两个原始版本)
复制代码
    grep '>' 3300049427_3.fna | sort -n | uniq -c | sort -n
      2 >Ga0456356_000001
      2 >Ga0456356_000002
      2 >Ga0456356_000003
    ......
    
    
      
      
      
      
      
    
    代码解读

解决:

复制代码
    for i in *fa;do printf "${i}\t";grep -F ">" $i|sort |uniq -d ;echo;done |awk 'NR==2' >../all_err_id.stats  &
    
    
      
    
    代码解读

参考

gtdbtk

全部评论 (0)

还没有任何评论哟~