贝叶斯网络
发布时间
阅读量:
阅读量
一、贝叶斯网络
1、贝叶斯算法: 有监督的学习算法,解决的是分类问题,如新闻分类、评论分类、邮件分类、客户是否流失、是否值得投资、信用等级评定等二分类和多分类问题;
2、贝叶斯理论: 用客观的新信息更新我们最初关于某个事物的信念后,我们就会得到一个新的、改进了的信念;
3、经典统计学vs贝叶斯统计学:
- 经典统计学:抽样信息 = 总体信息 + 样本信息;
- 贝叶斯统计学:总体信息+样本信息+先验信息;
4、贝叶斯核心思想: 选择具有最高概率的决策(计算每种类别,选择最高概率的类别);
5、贝叶斯推断:

二、朴素贝叶斯
1、朴素贝叶斯推断:

2、词向量模型:

2.1、One-hot Representation:
把每个词表示为一个很长的向量。这个向量的维度是词表大小,只有一个维度的值为1,其他元素为0,这个维度就代表了当前的词。
存在问题: 不考虑词距离,词频,维度灾难及数据稀疏;
代码:
a、创建词汇表
import numpy as np
import pandas as pd
postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1] # 类别标签向量,1代表侮辱性词汇,0代表不是
b、词条向量化
# 词条转换为词条向量
def createVocabList(postingList):
vocabSet = set(word for senten in postingList for word in senten) # 集合:去重
return list(vocabSet)
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0] * len(vocabList) # 创建一个其中所含元素都为0的向量
for word in inputSet: # 遍历每个词条
if word in vocabList: # 如果词条存在于词汇表中,则置1
returnVec[vocabList.index(word)] = 1
else: print("the word: %s is not in my Vocabulary!" % word)
return returnVec
myVocabList = createVocabList(postingList) # 词汇表,所有单词出现的集合
print('myVocabList:\n',myVocabList)
trainMat = []
for postinDoc in postingList:
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
print('trainMat:\n', trainMat)
df=pd.DataFrame(trainMat,columns=myVocabList)
df['label'] = classVec
df
c、求先验概率&条件概率

# 求H0和H1的先验概率
ph1 = df['label'].sum()/len(df['label']) # 为1的先验概率
ph0 = 1 - ph1 # 为0的先验概率
# 求条件概率
df1 = df[df['label']==1] # 选择标签为1的所有样本
df0 = df[df['label']==0] # 选择标签为0的所有样本
p1V = np.array(df1.iloc[:,0:-1].sum(axis=0))/len(df1) # p(x|h1)
p0V = np.array(df0.iloc[:,0:-1].sum(axis=0))/len(df0) # p(x|h0)
pd.DataFrame([p0V,p1V],columns=myVocabList)
d、算法改进
- 利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率,即计算p(w0|1)p(w1|1)p(w2|1)。如果其中有一个概率值为0,那么最后的成绩也为0。 为了降低这种影响,可以将所有词的出现数初始化为1,并将分母初始化为2。这种做法就叫做拉普拉斯平滑(Laplace Smoothing)又被称为加1平滑,为了解决0概率问题;
- 下溢出问题:太多很小的数相乘,导致下溢出,结果可能变成零。 为了解决这个问题,对乘积结果取自然对数。通过求对数可以避免下溢出或者浮点数舍入导致的错误;
# 算法的改进1,分子加1,分母加2
p1V = (np.array(df1.iloc[:,0:-1].sum(axis=0))+1)/(len(df1)+2)
p0V = (np.array(df0.iloc[:,0:-1].sum(axis=0))+1)/(len(df0)+2)
# 算法的改进2,取log
p1V = np.log(p1V)
p0V = np.log(p0V)
e、最终测试
test = ['love', 'my', 'dalmation']
testEntry=setOfWords2Vec(myVocabList, test)
# 没有的特征,不考虑
p1=(p1V*testEntry).sum()+np.log(ph1) # 为1类的概率
p0=(p0V*testEntry).sum()+np.log(ph0) # 为0类的概率
if p1 > p0:
print('侮辱类')
else:
print('非侮辱类')
f、代码整合
# 训练集,标签
postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1]
#向量化操作
trainMat = []
for postinDoc in postingList:
trainMat.append(setOfWords2Vec(myVocabList, postinDoc)) # 词条向量化
df = pd.DataFrame(trainMat,columns=myVocabList)
df['label'] = classVec
df0 = df[df['label']==0]
df1 = df[df['label']==1]
# 算法的改进1,分子加1,分母加2
p1V = (np.array(df1.iloc[:,0:-1].sum(axis=0))+1)/(len(df1)+2)
p0V = (np.array(df0.iloc[:,0:-1].sum(axis=0))+1)/(len(df0)+2)
# 算法的改进2,取log
p1V = np.log(p1V)
p0V = np.log(p0V)
# 测试
test = ['love', 'my', 'dalmation']
testEntry=setOfWords2Vec(myVocabList, test)
p1 = (p1V*testEntry).sum()+np.log(ph1)
p0 = (p0V*testEntry).sum()+np.log(ph0)
if p1 > p0:
print('侮辱类')
else:
print('非侮辱类')
2.2、词袋模型: 考虑词出现的频次;
代码:
from sklearn.feature_extraction.text import CountVectorizer
# 语料,词条
corpus = ["I come to to to China travel travel travel",
"This is a car to polupar in China",
"I love tea and to Apple ",
"The work is to write some papers in science"]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus) # 返回值是稀疏矩阵; 停用词不出现(英文单个词)
import pandas as pd
pd.DataFrame(vectorizer.fit_transform(corpus).toarray(),columns=vectorizer.get_feature_names())
2.3、TF-IDF模型: 考虑词的重要性;

代码:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I come to to to China travel travel travel",
"This is a car to polupar in China",
"I love tea and to Apple ",
"The work is to write some papers in science"]
tfidf = TfidfVectorizer()
re = tfidf.fit_transform(corpus)
import pandas as pd
pd.DataFrame(re.toarray(),columns=tfidf.get_feature_names())
结巴库:
# pip install jieba
import jieba
import jieba #结巴
corpus = ["我喜欢去中国旅游",
"这个车在中国很流行",
"我喜欢茶叶和苹果",
"这个工作是写一些科技文献"]
newcorpus = [' '.join(jieba.lcut(sen)) for sen in corpus]
newcorpus
三、sklearn中的贝叶斯
1、多项式模型
特征是离散,在计算先验概率P(yk) 和条件概率P(xi|yk)时,会做一些平滑处理;


代码:
import numpy as np
from sklearn.naive_bayes import MultinomialNB
seed = 10
X = np.random.randint(5, size=(10, 10))
y = np.array([1, 2, 3, 4, 5, 6, 1, 2, 3, 4])
clf = MultinomialNB()
clf.fit(X, y)
print(clf.predict([[1,1,1,1,1,1,1,1,1,1]]))
clf.predict_proba([[1,1,1,1,1,1,1,1,1,1]])
2、伯努利模型
适用于离散特征,每个特征 的取值只能是1和0(多了二值化处理。以文本分类为例,某个单词在文档中 出现过,则其特征值为1,否则为0);


代码:
from sklearn.naive_bayes import BernoulliNB
import numpy as np
X = np.random.randint(2, size=(6, 10))
Y = np.array([1, 2, 3, 4, 4, 5])
clf = BernoulliNB()
clf.fit(X, Y)
print(clf.predict(X[2:3]))
3、高斯模型
- 处理连续的特征变量;
- 高斯模型假设每一维特征都服从高斯分布(正态分布);


代码:
# 导入算法包以及数据集
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.naive_bayes import GaussianNB
# 载入数据
iris = datasets.load_iris()
x_train,x_test,y_train,y_test = train_test_split(iris.data, iris.target,random_state=10)
# 训练,拟合
mul_nb = GaussianNB()
mul_nb.fit(x_train,y_train)
# 预测
mul_nb.predict(x_test)
# 模型得分,准确率
mul_nb.score(x_test,y_test)
# 混淆矩阵
print(confusion_matrix(y_test,mul_nb.predict(x_test)))
# 分类报告
print(classification_report(y_test,mul_nb.predict(x_test)))
四、英文新闻分类
- 20 newsgroups数据集18000篇新闻文章,一共涉及到20种话题,所以称作20 newsgroups text dataset,分文两部分:训练集和测试集,通常用来做文本分类;
- fetch_20newsgroups的参数设置: fetch_20newsgroups(data_home=None, # 文件下载的路径 subset='train', # 加载那一部分数据集 train/test categories=None, # 选取哪一类数据集[类别列表],默认20类 shuffle=True, # 将数据集随机排序 random_state=42, # 随机数生成器 remove=(), # ('headers','footers','quotes') 去除部分文本 download_if_missing=True # 如果没有下载过,重新下载 );
import os
# os.chdir(r'D:\CDA\File')
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(data_home = r'D:\CDA\File',subset='all')
news_train = fetch_20newsgroups(data_home=r'D:\CDA\File',subset='train' ) # 训练集
news_test = fetch_20newsgroups(data_home=r'D:\CDA\File',subset='test' ) # 测试集
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors_train = vectorizer.fit_transform(news_train.data) # 训练集特征提取
vectors_test = vectorizer.transform(news_test.data) # 测试集特征提取
# 多项式模型
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB
nb = MultinomialNB()
nb.fit(vectors_train,news_train.target)
nb.score(vectors_test, news_test.target)
# 参数优化:网格搜索
from sklearn.model_selection import GridSearchCV
nb = MultinomialNB()
params={'alpha':np.linspace(0,0.5,50)}
grid_search=GridSearchCV(nb,param_grid=params,cv=5,verbose=2,n_jobs=-1)
grid_search.fit(vectors_train, news_train.target)
grid_search.best_params_
grid_search.score(vectors_test, news_test.target)
# 增加停用词
def get_stop_words():
result = set()
for line in open('stopwords_en.txt','r').readlines():
result.add(line.strip())
return result
# 加载停用词
stop_words = get_stop_words()
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectors_train = vectorizer.fit_transform(news_train.data)
vectors_test = vectorizer.transform(news_test.data) # 测试集特征提取
from sklearn.model_selection import GridSearchCV
nb = MultinomialNB()
params={'alpha':np.linspace(0,0.5,50)}
grid_search=GridSearchCV(nb,param_grid=params,cv=5,verbose=2,n_jobs=-1)
grid_search.fit(vectors_train, news_train.target)
grid_search.best_params_
grid_search.score(vectors_test, news_test.target)
中文执行过程:
- 分词,转换为用空格分开的字符串列表
- 加载停用词
- 向量化
- 建模
- 调参
- 测试
全部评论 (0)
还没有任何评论哟~
