Advertisement

python预测疾病_用机器学习方法从症状预测疾病

阅读量:

一、编程环境

Win10

Python3.6

Jupyter Notebook

Graphviz (简介和安装请参考https://www.jianshu.com/p/b559dc689b7f)

二、数据源

三、清洗数据

1 将疾病和对应的多个症状放到字典里,key为疾病,value为多个症状。

注意,有些疾病和症状包含了特殊符号’^’,需要先处理成’_’再切割。import csvfrom collections import defaultdict

disease_list = []def return_list(disease):

disease_list = []

match = disease.replace('^','').split('')

ctr = 1

for group in match: if ctr%2==0:

disease_list.append(group)

ctr = ctr + 1

return disease_listwith open("Scraped-Data/dataset_uncleaned.csv") as csvfile:

reader = csv.reader(csvfile)

disease=""

weight = 0

disease_list = []

dict_wt = {}

dict_=defaultdict(list) for row in reader: if row[0]!="\xc2\xa0" and row[0]!="":

disease = row[0]

disease_list = return_list(disease)

weight = row[1] if row[2]!="\xc2\xa0" and row[2]!="":

symptom_list = return_list(row[2]) for d in disease_list: for s in symptom_list:

dict_[d].append(s)

dict_wt[d] = weight print (dict_)

2 将疾病-症状-样本数写到dataset_clean.csv中,注意,每个疾病对应着一个样本数和多个症状。with open("Scraped-Data/dataset_clean.csv","w") as csvfile:

writer = csv.writer(csvfile) for key,values in dict_.items(): for v in values: #key = str.encode(key)

key = str.encode(key).decode('utf-8') #.strip()

#v = v.encode('utf-8').strip()

#v = str.encode(v)

writer.writerow([key,v,dict_wt[key]])

注意,此时看到的csv中,每行数据下面有一行空行,这个先不用处理,下面的步骤会处理。

3 给数据表dataset_clean.csv中的每列数据加上列标题columns = ['Source','Target','Weight']

data = pd.read_csv("Scraped-Data/dataset_clean.csv",names=columns, encoding ="ISO-8859-1")

data.head()

data.to_csv("Scraped-Data/dataset_clean.csv",index=False)

此时,每行下面的空行消失了。

4 标注数据并存到nodetable.csv中

数据分为三列,第一列ID是疾病名称或症状名称;第二列Label是疾病名称或症状名称,与ID完全一样;第三标属性标明了这个ID或Label是病症或症状。slist = []

dlist = []with open("Scraped-Data/nodetable.csv","w") as csvfile:

writer = csv.writer(csvfile) for key,values in dict_.items(): for v in values: if v not in slist:

writer.writerow([v,v,"symptom"])

slist.append(v) if key not in dlist:

writer.writerow([key,key,"disease"])

dlist.append(key)

nt_columns = ['Id','Label','Attribute']

nt_data = pd.read_csv("Scraped-Data/nodetable.csv",names=nt_columns, encoding ="ISO-8859-1",)

nt_data.head()

nt_data.to_csv("Scraped-Data/nodetable.csv",index=False)

四、分析清洗好的数据data = pd.read_csv("Scraped-Data/dataset_clean.csv", encoding ="ISO-8859-1")

len(data['Source'].unique())

len(data['Target'].unique())

df = pd.DataFrame(data)

df_1 = pd.get_dummies(df.Target)

df_1

df

df_s = df['Source']

df_pivoted = pd.concat([df_s,df_1], axis=1)

df_pivoted.drop_duplicates(keep='first',inplace=True)df_pivotedlen(df_pivoted)cols = df_pivoted.columnsprint(cols)df_pivoted = df_pivoted.groupby('Source').sum()

df_pivoted = df_pivoted.reset_index()df_pivotedlen(df_pivoted)df_pivoted.to_csv("Scraped-Data/df_pivoted.csv")

这此代码主要是分析数据,比如疾病有多少种,症状有多少种。每种疾病对应的症状标记为1,没对应上的症状标记为0,将这些数据合并后存到df_pivoted.csv中。

五、用朴素贝叶斯来训练模型x = df_pivoted[cols]

y = df_pivoted['Source']import pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt

%matplotlib inlinefrom sklearn.naive_bayes import MultinomialNBfrom sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

mnb = MultinomialNB()

mnb = mnb.fit(x_train, y_train)

mnb.score(x_test, y_test)

得分为0,意味着没有预测能力。

这是因为,对于149条数据(对应着149种疾病),被预测的那1/3的疾病是没有见过的,所以算法没有办法对没见过的疾病进行预测。

改为用全部的数据进行训练,并用全部的数据进行预测mnb_tot = MultinomialNB()

mnb_tot = mnb_tot.fit(x, y)

mnb_tot.score(x, y)

得分率为0.8993288590604027

打印出预测不准确的疾病disease_pred = mnb_tot.predict(x)

disease_real = y.valuesfor i in range(0, len(disease_real)): if disease_pred[i]!=disease_real[i]: print ('Pred: {0} Actual:{1}'.format(disease_pred[i].ljust(30), disease_real[i]))

运行结果:Pred: HIV Actual:acquired immuno-deficiency syndromePred: biliary calculus Actual:cholelithiasisPred: coronary arteriosclerosis Actual:coronary heart diseasePred: depression mental Actual:depressive disorderPred: HIV Actual:hiv infectionsPred: carcinoma breast Actual:malignant neoplasm of breastPred: carcinoma of lung Actual:malignant neoplasm of lungPred: carcinoma prostate Actual:malignant neoplasm of prostatePred: carcinoma colon Actual:malignant tumor of colonPred: candidiasis Actual:oralcandidiasisPred: effusion pericardial Actual:pericardial effusion body substancePred: malignant neoplasms Actual:primary malignant neoplasmPred: sepsis (invertebrate) Actual:septicemiaPred: sepsis (invertebrate) Actual:systemic infectionPred: tonic-clonic epilepsy Actual:tonic-clonic seizures

六、用决策树来训练模型from sklearn.tree import DecisionTreeClassifier, export_graphviz

dt = DecisionTreeClassifier()

clf_dt=dt.fit(x,y)print ("Acurracy: ", clf_dt.score(x,y))

得到的分数为0.8993288590604027,这与上面用朴素贝叶斯算法得到的结果一样。

下面要可视化决策树的节点分布

1 生成tree.dotfrom sklearn import tree

from sklearn.tree import export_graphviz

export_graphviz(dt,

out_file='DOT-files/tree.dot',

feature_names=cols)

在工程目录下的DOT-files目录下,可以看到生成了tree.dot文件。

打开cmd终端,进入到tree.dot所在的目录,即DOT-files/中,执行dot -Tpng tree.dot -o ..\tree.png

会得到tree.png

但是如果tree.dot太大的话,有可能报内存不够的错误:

dot: failure to create cairo surface: out of memory

2 在jupyter notebook中显示tree.pngfrom IPython.display import Image

Image(filename='tree.png')

七、版权声明

作者:海天一树X

链接:https://www.jianshu.com/p/882ee4db4e40

全部评论 (0)

还没有任何评论哟~