Advertisement

特征选择 回归_如何执行回归问题的特征选择

阅读量:

特征选择 回归

1.简介 (1. Introduction)

什么是功能选择 ? (What is feature selection ?)

Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable (that we wish to predict).

特征选择 是选择与目标变量(我们希望预测)最相关的输入 变量子集 (某些可用变量 中的一部分)的过程。

Target variable here**** refers to the variable that we wish to predict.

目标变量 在这里 **** 指我们希望预测变量

For this article we will assume that we only have numerical input variables and a numerical target for regression predictive modeling. Assuming that, we can easily estimate the relationship between each input variable and the target variable. This relationship can be established by calculating a metric such as the correlation value for example.

对于本文,我们将假设我们只有数字输入变量和用于回归预测建模的数字目标。 假设,我们可以轻松地估计每个输入 变量和目标 变量之间的关系 。 例如,可以通过计算诸如相关值之类的度量来建立该关系。

2.主要的数值特征选择方法 (2. The main numerical feature selection methods)

The 2 most famous feature selection techniques that can be used for numerical input data and a numerical target variable are the following:

可以用于数字输入数据和数字目标变量的两种最著名的特征选择技术如下:

  • Correlation (Pearson, spearman)

相关性(皮尔逊,斯皮尔曼)

  • Mutual Information (MI, normalized MI)

相互信息(MI,标准化MI)

Correlation is a measure of how two variables change together. The most widely used correlation measure is the Pearson’s correlation that assumes a Gaussian distribution of each variable and detects linear relationship between numerical variables.

相关性 是两个变量如何一起变化的度量。 最广泛使用的相关度量是Pearson相关,它假设每个变量的高斯分布并检测数值变量之间的线性关系。

This is done in 2 steps:

分两个步骤完成:

  1. The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).

计算每个回归变量与目标之间的相关性 ,即(((X [:, i]-mean(X [:, i]))(y-mean_y))/(std(X [:, i] ) std(y))。

  1. It is converted to an F score then to a p-value.

将其转换为F分数, 然后转换为p值

Mutual information originates from the field of information theory. The idea is that the information gain (typically used in the construction of decision trees) is applied in order to perform the feature selection. Mutual information is calculated between two variables and measures as the reduction in uncertainty for one variable given a known value of the other variable.

互信息 起源于信息理论领域。 这个想法是应用信息增益(通常用于构建决策树)来执行特征选择。 互信息是在两个变量之间计算的,并且在给定另一个变量的已知值的情况下,度量为一个变量的不确定性降低。

3.数据集 (3. The dataset)

We will use the boston house -prices dataset. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset consists of the following variables:

我们将使用波士顿 房屋 - 价格的 数据集 。 该数据集包含美国人口普查局收集的有关马萨诸塞州波士顿地区住房的信息。该数据集包含以下变量:

  1. CRIM — per capita crime rate by town

CRIM —城镇居民人均犯罪率

  1. ZN — proportion of residential land zoned for lots over 25,000 sq.ft.

ZN-25,000平方英尺以上的土地划为住宅用地的比例。

  1. INDUS — proportion of non-retail business acres per town.

印度—每个镇非零售业务英亩的比例。

  1. CHAS — Charles River dummy variable (1 if tract bounds river; 0 otherwise)

CHAS —查尔斯河虚拟变量(如果束缚河,则为1;否则为0)

  1. NOX — nitric oxides concentration (parts per 10 million)

NOX-一氧化氮浓度(百万分之几)

  1. RM — average number of rooms per dwelling

RM —每个住宅的平均房间数

  1. AGE — proportion of owner-occupied units built prior to 1940

年龄-1940年之前建造的自有单位的比例

  1. DIS — weighted distances to five Boston employment centres

DIS-与五个波士顿就业中心的加权距离

  1. RAD — index of accessibility to radial highways

RAD —径向公路的可达性指数

  1. TAX — full-value property-tax rate per $10,000

税金-每10,000美元的全值财产税率

  1. PTRATIO — pupil-teacher ratio by town

PTRATIO-各镇师生比例

  1. B — 1000(Bk — 0.63)² where Bk is the proportion of blacks by town

B — 1000(Bk-0.63)²,其中Bk是按城镇划分的黑人比例

  1. LSTAT — % lower status of the population

LSTAT-人口状况降低百分比

  1. MEDV — Median value of owner-occupied homes in $1000's

MEDV —拥有住房的中位数价值(以1000美元计)

4. Python代码和工作示例 (4. Python Code & Working Example)

Let’s load and split the dataset into training (70%) and test (30%) sets.

让我们加载数据集并将其分成训练(70%)和测试(30%)集。

复制代码
    from sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import f_regressionimport matplotlib.pyplot as pltfrom sklearn.feature_selection import mutual_info_regression# load the dataX, y = load_boston(return_X_y=True)# split into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

We will use the well known scikit-learn machine library.

我们将使用众所周知的scikit-learn机器库。

情况1:使用“相关”度量标准选择特征 (Case 1: Feature selection using the Correlation metric)

For thecorrelation statistic we will use the f_regression() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

对于相关统计, 我们将使用f_regression()函数 。 可以在功能选择策略中使用此功能,例如,通过SelectKBest类选择前k个最相关的功能(最大值)。

复制代码
    # feature selectionf_selector = SelectKBest(score_func=f_regression, k='all')# learn relationship from training dataf_selector.fit(X_train, y_train)# transform train input dataX_train_fs = f_selector.transform(X_train)# transform test input dataX_test_fs = f_selector.transform(X_test)# Plot the scores for the featuresplt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")plt.ylabel("F-value (transformed from the correlation values)")plt.show()

Reminder : For the correlation statistic case:

提醒 :对于相关统计情况:

  1. The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).

计算每个回归变量与目标之间的相关性,即(((X [:, i]-mean(X [:, i]))(y-mean_y))/(std(X [:, i] ) std(y))。

  1. It is converted to an F score then to a p-value.

将其转换为F分数, 然后转换为p值
Image for post

Feature Importance plot 特征重要性图

The plot above shows that feature 6 and 13 are more important than the other features. The y-axis represents the F-values that were estimated from the correlation values.

上图显示功能6和13比其他功能更重要。 y轴表示根据相关值估算的F值。

情况2:使用互信息量度选择特征 (Case 2: Feature selection using the Mutual Information metric)

The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and output variables via the mutual_info_regression() function.

scikit-learn机器学习库通过common_info_regression()函数为带有数字输入和输出变量的特征选择提供了互信息 的实现。

复制代码
    # feature selectionf_selector = SelectKBest(score_func=mutual_info_regression, k='all')# learn relationship from training dataf_selector.fit(X_train, y_train)# transform train input dataX_train_fs = f_selector.transform(X_train)# transform test input dataX_test_fs = f_selector.transform(X_test)# Plot the scores for the featuresplt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")plt.ylabel("Estimated MI value")plt.show()
Image for post

Feature Importance plot 特征重要性图

The y-axis represents the estimated mutual information between each feature and the target variable. Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.

y轴表示每个特征和目标变量之间的估计互信息。 与相关特征选择方法相比,我们可以清楚地看到更多的特征被标记为相关。 这可能是因为数据集中可能存在统计噪声。

5.结论 (5. Conclusion)

In this article I have provided two ways in order to perform feature selection. Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable. Target variable here**** refers to the variable that we wish to predict.

在本文中,我提供了两种方法来执行特征选择。 特征选择 是从输入变量中选择一个与目标变量最相关的子集的过程。 目标变量 在这里 **** 指我们希望预测变量

Using either the Correlation metric or the Mutual Information metric , we can easily estimate the relationship between each input variable and the target variable.

使用“ 相关” 度量或“ 信息” 度量,我们可以轻松估计每个输入 变量和目标 变量之间的关系

Correlation vs Mutual Information: Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.

关联 信息: 与关联特征选择方法相比,我们可以清楚地看到更多的特征被标记为相关。 这可能是因为数据集中可能存在统计噪声。

您可能还喜欢: (You might also like:)

请继续关注并支持这项工作 (Stay tuned & support this effort)

If you liked and found this article useful, follow me to be able to see all my new posts.

如果您喜欢并认为本文有用,请关注 我以查看我的所有新帖子。

Questions? Post them as a comment and I will reply as soon as possible.

有什么问题吗 将其发布为评论,我会尽快回复。

最新帖子 (Latest posts)

与我取得联系 (Get in touch with me)

领英https : //www.linkedin.com/in/serafeim-loukas/

ResearchGatehttps : //www.researchgate.net/profile/Serafeim_Loukas

EPFL 个人资料https//people.epfl.ch/serafeim.loukas

堆栈 溢出https : //stackoverflow.com/users/5025009/seralouk

翻译自: https://towardsdatascience.com/how-to-perform-feature-selection-for-regression-problems-c928e527bbfa

特征选择 回归

全部评论 (0)

还没有任何评论哟~