Advertisement

机器学习——线性回归模型的可解释性

阅读量:
我们使用scikit learn中的波士顿房产数据举例
复制代码
    import numpy as np
    from sklearn import datasets
    
    #加载sklearn-datasets数据集中的波士顿房产数据
    boston = datasets.load_boston()
    
    X = boston.data #所有特征数据
    y = boston.target  #输出标记
    
    #绘制二维的散点图会发现数据中存在一些垃圾数据,将其清除
    X = X[y < 50.0]
    y = y[y < 50.0]
    
    #从sklearn中导入线性回归算法的package
    from sklearn.linear_model import LinearRegression
    
    lin_reg = LinearRegression() #实例化
    lin_reg.fit(X, y) #fit拟合操作
    
    #此时我们查看模型的参数coef_
    #关于模型中的 coef_ 和 interception_ 参数是什么,传送门:https://www.jianshu.com/p/6a818b53a37e
    In[1]: lin_reg.coef_
    Out[1]:array([-1.06715912e-01,  3.53133180e-02, -4.38830943e-02,  4.52209315e-01,
       -1.23981083e+01,  3.75945346e+00, -2.36790549e-02, -1.21096549e+00,
        2.51301879e-01, -1.37774382e-02, -8.38180086e-01,  7.85316354e-03,
       -3.50107918e-01])
       
    In[2]:np.argsort(lin_reg.coef_)   #对coef_进行索引排序
    Out[2]:array([ 4,  7, 10, 12,  0,  2,  6,  9, 11,  1,  8,  3,  5])
    
    In[3]:boston.feature_names  #查看波士顿房产数据中特征对应的名称
    Out[3]:array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
       
    In[4]:boston.feature_names[np.argsort(lin_reg.coef_)]  #对特征名也进行上面一样的索引排序
    Out[4]:array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX',
       'B', 'ZN', 'RAD', 'CHAS', 'RM'], dtype='<U7')
       
    In[5]:print(boston.DESCR)  #我们输出数据的文档,查看每个特征名对应的意义
    Out[5]:.. _boston_dataset:
    
    Boston house prices dataset
    ---------------------------
    
    **Data Set Characteristics:**  
    
    :Number of Instances: 506 
    
    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
    
    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's
    
    :Missing Attribute Values: None
    
    :Creator: Harrison, D. and Rubinfeld, D.L.
    
    This is a copy of UCI ML housing dataset.
    https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
    
    
    This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
    
    The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
    prices and the demand for clean air', J. Environ. Economics & Management,
    vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
    ...', Wiley, 1980.   N.B. Various transformations are used in the table on
    pages 244-261 of the latter.
    
    The Boston house-price data has been used in many machine learning papers that address regression
    problems.   
     
    .. topic:: References
    
       - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
       - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

我们观察到了索引排序后的特征名称:
[‘NOX’, ‘DIS’, ‘PTRATIO’, ‘LSTAT’, ‘CRIM’, ‘INDUS’, ‘AGE’, ‘TAX’, ‘B’, ‘ZN’, ‘RAD’, ‘CHAS’, ‘RM’]
参照参考文档可知:权值最大的 RM 对应 average number of rooms per dwelling(每套住宅的平均房间数)。由此可知:房间数量越多的房屋房价越高。
紧随其后的是 CHAS 特征:Charles River dummy variable(等于 1 表示地块相邻河流;否则为 0)表示房屋是否靠近查尔斯河(相邻为 1)。根据我们的分析发现 CHAS 排在第二大位置上——位于第一大 RM 后面——因此可以看出:靠近查尔斯河的房子房价更高;反之则相对较低。
以此类推……
我们再来看负相关的一端:
NOX:nitric oxides concentration (parts per 10 million) 氧化氮浓度(每十亿分之一百万)——一氧化氮是一种有毒气体;因此 NOX 越高意味着房价越低。
以此类推……
这表明线性回归模型对数据具有良好的可解释性——即可以根据模型结果有针对性地采集更多特征来更好地描述房屋价格情况。例如我们已知房间数量与房价呈正相关关系——这意味着房间数量越大可能意味着房子越大、层数越多等——因此我们可以采集更多相关特征来进行预测工作——看看能否建立出更加准确预测波士顿地区房价的模型。例如 NOX 和房价呈负相关关系——因此我们可以进一步研究附近是否有产生一氧化氮工厂等信息——以期获得更好的预测效果。

全部评论 (0)

还没有任何评论哟~