python 基因序列提取_利用biopython批量下载基因序列
构建自己需要的blast库,需要下载所有自己需要的基因。利用biopython可以快速完成。
1.首先,利用 “org.Hs.eg.db” 将自己的基因symbol转成accession中的id
1 library(org.Hs.eg.db)2 symbol=c("PDCD1","CD274","IL4","IL7")3 accession =mapIds(org.Hs.eg.db,keys=x,column = "REFSEQ",keytype = "SYMBOL",multiVals = "first")4 accession=as.matrix(accession)5 write.table(accession,"accession.txt",row.names = T,col.names = F,quote = F,sep = "\t")
结果如下:
1 PDCD1 NM_0050182 CD274 NM_0012677063 IL4 NM_0005894 IL7 NM_000880
2.利用biopython下载序列。
1 from Bio importEntrez2 from Bio importSeqIO3 file_in_name="accession.txt"
4 file_out_name="result.fasta"
5 Entrez.email = 'xxxx@xx.com'##你的邮箱
6 input_file=open(file_in_name,"r")7 output_file=open(file_out_name,"a")8 for record_id ininput_file:9 record_id=record_id.strip().split("\t")[1]10 result_handle = Entrez.efetch(db="nucleotide", rettype="gb", id=record_id)11 seqRecord = SeqIO.read(result_handle, format='gb')12 result_handle.close()13 output_file.write(seqRecord.format('fasta'))14 output_file.close()15 input_file.close()
结果:
NM_005018.3 Homo sapiens programmed cell death 1 (PDCD1), mRNA
GCTCACCTCCGCCTGAGCAGTGGAGAAGGCGGCACTCTGGTGGGGCTGCTCCAGGCATGCAGATCCCACAGGCGCCCTGGCCAGTCGTCTGGGCGGTGCTACAACTGGGCTGGCGGCCAG....
NM_001267706.1 Homo sapiens CD274 molecule (CD274), transcript variant 2, mRNA
GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCCCGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATTCCAGAAAGATGAGGATATTT...
NM_000589.4 Homo sapiens interleukin 4 (IL4), transcript variant 1, mRNA
ATCGTTAGCTTCTCCTGATAAACTAATTGCCTCACATTGTCACTGCAAATCGACACCTATTAATGGGTCTCACCTCCCAACTGCTTCCCCCTCTGTTCTTCCTGCTAGCATGTGCCGGCA...
NM_000880.4 Homo sapiens interleukin 7 (IL7), transcript variant 1, mRNA
ACACTTGTGGCTTCCGTGCACACATTAACAACTCATGGTTCTAGCTCCCAGTCGCCAAGCGTTGCCAAGGCGTTGAGAGATCATCTGGGAAGTCTTTTACCCAGAATTGCTTTGATTCAG...
完成
