Advertisement

python 基因序列提取_利用biopython批量下载基因序列

阅读量:

构建自己需要的blast库,需要下载所有自己需要的基因。利用biopython可以快速完成。

1.首先,利用 “org.Hs.eg.db” 将自己的基因symbol转成accession中的id

1 library(org.Hs.eg.db)2 symbol=c("PDCD1","CD274","IL4","IL7")3 accession =mapIds(org.Hs.eg.db,keys=x,column = "REFSEQ",keytype = "SYMBOL",multiVals = "first")4 accession=as.matrix(accession)5 write.table(accession,"accession.txt",row.names = T,col.names = F,quote = F,sep = "\t")

结果如下:

1 PDCD1 NM_0050182 CD274 NM_0012677063 IL4 NM_0005894 IL7 NM_000880

2.利用biopython下载序列。

1 from Bio importEntrez2 from Bio importSeqIO3 file_in_name="accession.txt"

4 file_out_name="result.fasta"

5 Entrez.email = 'xxxx@xx.com'##你的邮箱

6 input_file=open(file_in_name,"r")7 output_file=open(file_out_name,"a")8 for record_id ininput_file:9 record_id=record_id.strip().split("\t")[1]10 result_handle = Entrez.efetch(db="nucleotide", rettype="gb", id=record_id)11 seqRecord = SeqIO.read(result_handle, format='gb')12 result_handle.close()13 output_file.write(seqRecord.format('fasta'))14 output_file.close()15 input_file.close()

结果:

NM_005018.3 Homo sapiens programmed cell death 1 (PDCD1), mRNA

GCTCACCTCCGCCTGAGCAGTGGAGAAGGCGGCACTCTGGTGGGGCTGCTCCAGGCATGCAGATCCCACAGGCGCCCTGGCCAGTCGTCTGGGCGGTGCTACAACTGGGCTGGCGGCCAG....

NM_001267706.1 Homo sapiens CD274 molecule (CD274), transcript variant 2, mRNA

GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCCCGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATTCCAGAAAGATGAGGATATTT...

NM_000589.4 Homo sapiens interleukin 4 (IL4), transcript variant 1, mRNA

ATCGTTAGCTTCTCCTGATAAACTAATTGCCTCACATTGTCACTGCAAATCGACACCTATTAATGGGTCTCACCTCCCAACTGCTTCCCCCTCTGTTCTTCCTGCTAGCATGTGCCGGCA...

NM_000880.4 Homo sapiens interleukin 7 (IL7), transcript variant 1, mRNA

ACACTTGTGGCTTCCGTGCACACATTAACAACTCATGGTTCTAGCTCCCAGTCGCCAAGCGTTGCCAAGGCGTTGAGAGATCATCTGGGAAGTCTTTTACCCAGAATTGCTTTGATTCAG...

完成

全部评论 (0)

还没有任何评论哟~