特征选择回归_如何执行回归问题的特征选择

阅读量：

特征选择回归

1.简介 (1. Introduction)

什么是功能选择？ (What is feature selection ?)

Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable (that we wish to predict).

特征选择 是选择与目标变量(我们希望预测)最相关的输入变量的子集 (某些可用变量中的一部分)的过程。

Target variable here**** refers to the variable that we wish to predict.

目标变量 在这里 **** 指我们希望预测的变量。

For this article we will assume that we only have numerical input variables and a numerical target for regression predictive modeling. Assuming that, we can easily estimate the relationship between each input variable and the target variable. This relationship can be established by calculating a metric such as the correlation value for example.

对于本文，我们将假设我们只有数字输入变量和用于回归预测建模的数字目标。假设，我们可以轻松地估计每个输入变量和目标变量之间的关系。例如，可以通过计算诸如相关值之类的度量来建立该关系。

2.主要的数值特征选择方法 (2. The main numerical feature selection methods)

The 2 most famous feature selection techniques that can be used for numerical input data and a numerical target variable are the following:

可以用于数字输入数据和数字目标变量的两种最著名的特征选择技术如下：

Correlation (Pearson, spearman)

3.数据集 (3. The dataset)

We will use the boston house -prices dataset. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset consists of the following variables:

我们将使用波士顿 房屋 - 价格的 数据集 。该数据集包含美国人口普查局收集的有关马萨诸塞州波士顿地区住房的信息。该数据集包含以下变量：

CRIM — per capita crime rate by town

CRIM —城镇居民人均犯罪率

ZN — proportion of residential land zoned for lots over 25,000 sq.ft.

ZN-25,000平方英尺以上的土地划为住宅用地的比例。

INDUS — proportion of non-retail business acres per town.

印度—每个镇非零售业务英亩的比例。

CHAS — Charles River dummy variable (1 if tract bounds river; 0 otherwise)

CHAS —查尔斯河虚拟变量(如果束缚河，则为1；否则为0)

NOX — nitric oxides concentration (parts per 10 million)

NOX-一氧化氮浓度(百万分之几)

RM — average number of rooms per dwelling

RM —每个住宅的平均房间数

AGE — proportion of owner-occupied units built prior to 1940

年龄-1940年之前建造的自有单位的比例

DIS — weighted distances to five Boston employment centres

DIS-与五个波士顿就业中心的加权距离

RAD — index of accessibility to radial highways

RAD —径向公路的可达性指数

TAX — full-value property-tax rate per $10,000

税金-每10,000美元的全值财产税率

PTRATIO — pupil-teacher ratio by town

PTRATIO-各镇师生比例

B — 1000(Bk — 0.63)² where Bk is the proportion of blacks by town

B — 1000(Bk-0.63)²，其中Bk是按城镇划分的黑人比例

LSTAT — % lower status of the population

LSTAT-人口状况降低百分比

MEDV — Median value of owner-occupied homes in $1000's

MEDV —拥有住房的中位数价值(以1000美元计)

4. Python代码和工作示例 (4. Python Code & Working Example)

Let’s load and split the dataset into training (70%) and test (30%) sets.

让我们加载数据集并将其分成训练(70％)和测试(30％)集。

复制代码

    from sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import f_regressionimport matplotlib.pyplot as pltfrom sklearn.feature_selection import mutual_info_regression# load the dataX, y = load_boston(return_X_y=True)# split into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

We will use the well known scikit-learn machine library.

我们将使用众所周知的scikit-learn机器库。

情况1：使用“相关”度量标准选择特征 (Case 1: Feature selection using the Correlation metric)

For thecorrelation statistic we will use the f_regression() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

对于相关统计， 我们将使用f_regression()函数。可以在功能选择策略中使用此功能，例如，通过SelectKBest类选择前k个最相关的功能(最大值)。

复制代码

    # feature selectionf_selector = SelectKBest(score_func=f_regression, k='all')# learn relationship from training dataf_selector.fit(X_train, y_train)# transform train input dataX_train_fs = f_selector.transform(X_train)# transform test input dataX_test_fs = f_selector.transform(X_test)# Plot the scores for the featuresplt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")plt.ylabel("F-value (transformed from the correlation values)")plt.show()

Reminder : For the correlation statistic case:

提醒：对于相关统计情况：

The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).

计算每个回归变量与目标之间的相关性，即(((X [:, i]-mean(X [:, i]))(y-mean_y))/(std(X [:, i] ) std(y))。

It is converted to an F score then to a p-value.

将其转换为F分数， 然后转换为p值。
Image for post

Feature Importance plot 特征重要性图

The plot above shows that feature 6 and 13 are more important than the other features. The y-axis represents the F-values that were estimated from the correlation values.

上图显示功能6和13比其他功能更重要。 y轴表示根据相关值估算的F值。

情况2：使用互信息量度选择特征 (Case 2: Feature selection using the Mutual Information metric)

The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and output variables via the mutual_info_regression() function.

scikit-learn机器学习库通过common_info_regression()函数为带有数字输入和输出变量的特征选择提供了互信息 的实现。

复制代码

    # feature selectionf_selector = SelectKBest(score_func=mutual_info_regression, k='all')# learn relationship from training dataf_selector.fit(X_train, y_train)# transform train input dataX_train_fs = f_selector.transform(X_train)# transform test input dataX_test_fs = f_selector.transform(X_test)# Plot the scores for the featuresplt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")plt.ylabel("Estimated MI value")plt.show()

Feature Importance plot 特征重要性图

The y-axis represents the estimated mutual information between each feature and the target variable. Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.

y轴表示每个特征和目标变量之间的估计互信息。 与相关特征选择方法相比，我们可以清楚地看到更多的特征被标记为相关。 这可能是因为数据集中可能存在统计噪声。

5.结论 (5. Conclusion)

In this article I have provided two ways in order to perform feature selection. Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable. Target variable here**** refers to the variable that we wish to predict.

在本文中，我提供了两种方法来执行特征选择。 特征选择 是从输入变量中选择一个与目标变量最相关的子集的过程。 目标变量 在这里 **** 指我们希望预测的变量。

Using either the Correlation metric or the Mutual Information metric , we can easily estimate the relationship between each input variable and the target variable.

使用“ 相关” 度量或“ 互 信息” 度量，我们可以轻松估计每个输入变量和目标变量之间的关系。

Correlation vs Mutual Information: Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.

关联与互 信息： 与关联特征选择方法相比，我们可以清楚地看到更多的特征被标记为相关。这可能是因为数据集中可能存在统计噪声。

请继续关注并支持这项工作 (Stay tuned & support this effort)

If you liked and found this article useful, follow me to be able to see all my new posts.

如果您喜欢并认为本文有用，请关注我以查看我的所有新帖子。

Questions? Post them as a comment and I will reply as soon as possible.

有什么问题吗将其发布为评论，我会尽快回复。

与我取得联系 (Get in touch with me)

LinkedIn : https://www.linkedin.com/in/serafeim-loukas/

领英： https : //www.linkedin.com/in/serafeim-loukas/

ResearchGate : https://www.researchgate.net/profile/Serafeim_Loukas

ResearchGate ： https : //www.researchgate.net/profile/Serafeim_Loukas

EPFL profile : https://people.epfl.ch/serafeim.loukas

EPFL 个人资料 ： https ： //people.epfl.ch/serafeim.loukas

Stack Overflow : https://stackoverflow.com/users/5025009/seralouk

堆栈溢出： https : //stackoverflow.com/users/5025009/seralouk

翻译自: https://towardsdatascience.com/how-to-perform-feature-selection-for-regression-problems-c928e527bbfa

特征选择回归

全部评论 (0)

还没有任何评论哟~

特征选择回归_如何执行回归问题的特征选择

特征选择回归 1.简介1\.Introduction 什么是功能选择？Whatisfeatureselection? Featureselectionistheprocedureofselecting...

回归特征选取_[特征选择]特征重要性

特征重要性指的是根据输入特征在预测目标变量时的有用程度给它们打分的技术。计算特征重要性的方法很多，如统计上的相关系数，线性模型的相关系数、决策树和排列重要性分数。

回归特征选取_如何进行高维变量筛选和特征选择(一)？Lasso回归

01模型简介随着海量电子病历的挖掘，影像学、基因组学等数据进入医学统计分析，经常会面临对高维变量特征选择的问题，Lasso回归是在线性回归模型的代价函数后面加上L1范数的约束项的模型，它通过控制参数...

Logistic逻辑回归特征选择分类

特征选择很重要，除了人工选择，还可以用其他机器学习方法，如逻辑回归、随机森林、PCA、 LDA等。分享一下逻辑回归做特征选择特征选择包括: 特征升维特征降维特征升维如一个样本有少量特征，可...

R语言特征选择——逐步回归

变量选择方法所有可能的回归 model<lmmpgdisp+hp+wt+qsec,data=mtcars olsallsubsetmodel Atibble:15x6 IndexNPredictor...

机器学习-特征选择：使用Lassco回归精确选择最佳特征

机器学习特征选择：使用Lassco回归精确选择最佳特征一、Lasso回归简介 1.1Lasso回归的基本原理 1.2Lasso回归与普通最小二乘法区别二、特征选择的方法 2.1过滤方法 2.2包装...

lasso回归筛选变量_如何进行高维变量筛选和特征选择(一)？Lasso回归

python 回归问题特征筛选

importstatsmodels.formula.apiassmf importpandasaspd defforwardselecteddata,response: 前向逐步回归算法，源代码来自h...

特征选择方法——最佳子集回归、逐步回归

原文链接：http://tecdat.cn/?p=5453 变量选择方法（点击文末“阅读原文”获取完整代码数据）。相关视频所有可能的回归 model < lmmpg disp + hp + wt ...

分类算法----逻辑回归特征选择

备注：以下均参考Python数据分析和数据挖掘实战在利用ScikitLearn对数据进行逻辑回归之前。首先进行特征筛选。特征筛选的方法很多，主要包含在ScikitLearn的featureselec...

特征选择回归_如何执行回归问题的特征选择

1.简介 (1. Introduction)

什么是功能选择？ (What is feature selection ?)

2.主要的数值特征选择方法 (2. The main numerical feature selection methods)

3.数据集 (3. The dataset)

4. Python代码和工作示例 (4. Python Code & Working Example)

情况1：使用“相关”度量标准选择特征 (Case 1: Feature selection using the Correlation metric)

情况2：使用互信息量度选择特征 (Case 2: Feature selection using the Mutual Information metric)

5.结论 (5. Conclusion)

您可能还喜欢： (You might also like:)

请继续关注并支持这项工作 (Stay tuned & support this effort)

最新帖子 (Latest posts)

与我取得联系 (Get in touch with me)

全部评论 (0)

是否确定退出登录?

特征选择 回归_如何执行回归问题的特征选择

1.简介 (1. Introduction)

什么是功能选择 ？ (What is feature selection ?)

2.主要的数值特征选择方法 (2. The main numerical feature selection methods)

3.数据集 (3. The dataset)

4. Python代码和工作示例 (4. Python Code & Working Example)

情况1：使用“相关”度量标准选择特征 (Case 1: Feature selection using the Correlation metric)

情况2：使用互信息量度选择特征 (Case 2: Feature selection using the Mutual Information metric)

5.结论 (5. Conclusion)

您可能还喜欢： (You might also like:)

请继续关注并支持这项工作 (Stay tuned & support this effort)

最新帖子 (Latest posts)

与我取得联系 (Get in touch with me)

全部评论 (0)

相关文章推荐

特征选择 回归_如何执行回归问题的特征选择

回归特征选取_[特征选择]特征重要性

回归特征选取_如何进行高维变量筛选和特征选择(一)？Lasso回归

Logistic逻辑回归 特征选择 分类

R语言特征选择——逐步回归

机器学习-特征选择：使用Lassco回归精确选择最佳特征

lasso回归筛选变量_如何进行高维变量筛选和特征选择(一)？Lasso回归

python 回归问题特征筛选

特征选择方法——最佳子集回归、逐步回归

分类算法----逻辑回归特征选择

特征选择回归_如何执行回归问题的特征选择

什么是功能选择？ (What is feature selection ?)

特征选择回归_如何执行回归问题的特征选择

Logistic逻辑回归特征选择分类