Hypothesis with R and Understanding of P-value and confidence-interval
Hypothesis with R and Understanding of P-value and confidence-interval
- Hypothesis with R
- 数据集介绍
- 数据展示
- 使用t-test(small samples)进行双尾假设检验
- 数据集介绍
Hypothesis with R
数据集说明
基于Galton数据集,检验儿子和女儿与母亲身高的相关性
library("AzureML")
ws <- workspace()
galton <- download.datasets(ws, "GaltonFamilies.csv")
head(galton)
The first 6 rows of the data and the columns:

dim(galton)
939 rows and 0 columns (attributes)

数据可视化
画直方图展示分别展示母亲与儿子,母亲与女儿的身高关系
hist.plot = function(df, col, bw, max, min){
ggplot(df, aes_string(col)) + geom_histogram( binwidth = bw ) + xlim(min,max)
}
hist.family = function(df, col1, col2, num.bin = 30){
require(ggplot2)
require(gridExtra)
## compute bin width
max = max(c(df[, col1], df[, col2]))
min = min(c(df[, col1], df[, col2]))
bin.width = (max - min)/num.bin
## create a first histogram
p1 = hist.plot(df, col1, bin.width, max, min)
p1 = p1 + geom_vline(xintercept = mean(df[, col1]), color = 'red', size = 1)
## create a first histogram
p2 = hist.plot(df, col2, bin.width, max, min)
p2 = p2 + geom_vline(xintercept = mean(df[, col2]), color = 'red', size = 1)
## stack the plot
grid.arrange(p1,p2, nrow = 2, ncol = 1)
}
sons = galton[galton$gender=='male',]
hist.family(sons,'childHeight','mother')
在画图中,使用geom_vline()来定位均值进行对比。结果如下:
儿子与母亲

女儿与母亲

观察到男性子女与母亲的身高重叠区域较小;相比之下,则是女性子女的身高分布与母亲高度相似;基于此我们提出零假设:母系身高均值相等(即 μ₁ - μ₂ = 0);替代假设则为 μ₁ ≠ μ₂
使用t-test(small samples)进行双边假设检验
##H0: there is no significant difference between the means
families.test <- function(df, col1, col2, paired = TRUE){
t.test(df[,col1],df[,col2],paired=paired)
}
hist.family.conf <- function(df, col1, col2, num.bin = 30, paired=FALSE){
require(ggplot2)
require(gridExtra)
max = max(c(df[,col1], df[,col2]))
min = min(c(df[,col1], df[,col2]))
bin.width = (max-min)/num.bin
mean1 <- mean(df[,col1])
mean2 <- mean(df[,col2])
t <- t.test(df[,col1],df[,col2],paired=paired)
pv1 <- mean2 + t$conf.int[1]
pv2 <- mean2 + t$conf.int[2]
## plot a histogram
p1 <- hist.plot(df,col1,bin.width,max,min)
p1 <- p1 + geom_vline(xintercept = mean1,
color = 'red', size = 1) +
geom_vline(xintercept = pv1,
color = 'red', size = 1, linetype = 2) +
geom_vline(xintercept = pv2,
color = 'red', size = 1, linetype =2)
## A simple boxplot
p2 <- hist.plot(df, col2, bin.width, max, min)
p2 <- p2 + geom_vline(xintercept = mean2,
color = 'red', size = 1.5)
## Now stack the plots
grid.arrange(p1, p2, nrow = 2)
print(t)
}
hist.family.conf(sons,'mother','childHeight')
儿子-母亲身高均值差为0检验结果:

对于置信区间与p-value的理解:
假设身高差服从自由度为k−1的t分布,则在son-mother检验案例中(即儿子与母亲身高的差异分析),95%的置信区间为[-5.514, -4.887](基于该t分布计算得出),意味着\mu_1-\mu_2(即两种群体均值之差)落在该区间内的概率为95%。进一步计算\mu_1-\mu_2=0的概率(基于该t分布),结果仅为<2.2E-16(远低于0.05显著性水平\alpha)。因此我们有足够的证据拒绝null hypothesis(零假设),从而推断儿子身高与母亲身高的均值不相同。

同样的方法得到daughter-mother身高均值检验结果:

miu1与miu2之间的差异值落在区间[-0.25, 0.34]的概率达到95%。根据所采用的t分布模型进行计算得到的p值为0.7701显著高于设定的显著性水平α=0.05,在此情况下我们有足够的统计学证据支持null hypothesis成立

通过计算得出p值时是以零假设成立为基础的概率度量,在统计学中它仅仅起到接受或拒绝假设的作用,并不能直接反映所得到结论的重要性。
