判断数据是否符合正态分布常用的function
发布时间
阅读量:
阅读量
判断数据正态性常用的function:
1、QQ-PLOT(quantile-quantile plot): 比较p-value的实际观测值与理论期望值以评估统计模型的有效性。
参考文献:QQ-PLOT原理详解
import statsmodels.api as sm
sm.qqplot(diff_result_data_df.loc[:,'clicks_diff'], line='q')
plt.title('click diff QQ-plot')
AI写代码

数据点的排列情况与由红色直线表示的理论正态分布曲线之间的匹配程度越佳,则观察到的数据就越符合正态分布。
2、pd.hist():通过数据分布直方图的形状来判断数据正态性
from matplotlib import pyplot as plt
plt.hist(diff_result_data_df.loc[:,'clicks_diff'],50)
plt.title('diff value distribution, daily')
import numpy as np
from matplotlib import pyplot as plt
sample_num=1995
mu=6
sigma=4
data=np.random.normal(mu,sigma,sample_num)
plt.hist(data,100)
AI写代码

![!在这里插入图片描述\((https://ad.itadn.com/c/weblog/blog-img/images/2025-04-04/6r7MaZAEKXzvPCi8F9stIVYo5U4G.png)
观察数据直方图的分布情况,与标准正态分布比较。
3、shapiro(): Shapiro-Wilk test for normality. 原假设为数据服从正态分布
In[1]: shapiro(diff_result_data_df.loc[:,'clicks_diff'])
Out[1]: ShapiroResult(statistic=0.9608051180839539, pvalue=2.1187931537573753e-19)
AI写代码
Here, the statistic is the probability of the null hypothesis happens.
4、normaltest(): 原假设为数据服从正态分布
In[2]: normaltest(diff_result_data_df.loc[:,'clicks_diff'])
Out[2]:NormaltestResult(statistic=114.93947230372301, pvalue=1.099539189052048e-25)
AI写代码
Here, the statistic represents s squared plus k squared, where s stands for the z-score obtained from a skewness test and k represents the z-score from a kurtosis test. Skewness refers to the level of asymmetry in data. When skewness is positive, it indicates that the distribution is positively skewed, meaning its tail extends towards the right side.

kurtosis:峰度,数据分布峰值高低。正峰度有更重的尾部。

实线表示正态分布,虚线表示具有正峰度值的分布。
全部评论 (0)
还没有任何评论哟~
