Introduction to Emsembling/Stacking in Python

阅读量：

泰坦尼克号

https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

代码理解

1. Preprocessing

1.库加载

复制代码

    import plotly.offline as py
    py.init_notebook_mode(connected=True)
    import plotly.graph_objs as go
    import plotly.tools as tls
    
    import warnings
    warnings.filterwarnings('ignore')
    import xgboost as xgb

Going to use these 5 base models for the stacking

从sklearn.ensemble导入集成分类算法（如随机森林、提升树、梯度提升和极端随机森林），以及从sklearn.svm导入支持向量机模型，并从sklearn.model_selection导入K折交叉验证模块。

在处理特征时发现存在较高的缺失比例时

复制代码

     train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

分桶处理

复制代码

     train['CategoricalFare'] = pd.qcut(train['Fare'], 4)

     dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] 						        = 0
     dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
     dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
     dataset.loc[ dataset['Fare'] > 31, 'Fare'] 							        = 3
     dataset['Fare'] = dataset['Fare'].astype(int)

填充缺失值（通过均值和标准差来填充）

复制代码

     for dataset in full_data:

     age_avg = dataset['Age'].mean()
     age_std = dataset['Age'].std()
     age_null_count = dataset['Age'].isnull().sum()
     age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
     dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
     dataset['Age'] = dataset['Age'].astype(int)

替换

复制代码

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')

编码mapping

复制代码

    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

热力图（去除冗余变量）

复制代码

     colormap = plt.cm.RdBu

     plt.figure(figsize=(14,12))
     plt.title('Pearson Correlation of Features', y=1.05, size=15)
     sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
             square=True, cmap=colormap, linecolor='white', annot=True)

2. Ensembling & Stacking models

复制代码

    # Some useful parameters which will come in handy later on
    ntrain = train.shape[0]
    ntest = test.shape[0]
    SEED = 0 # for reproducibility
    NFOLDS = 5 # set folds for out-of-fold prediction
    kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)
    
    # Class to extend the Sklearn classifier
    class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)
    
    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)
    
    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)
    
    # Class to extend XGboost classifer

注释：采用5折交叉验证机制，并创建一个类用于实现训练、预测、拟合以及特征重要性分析的方法。

Out-of-Fold Predictions

其原因在于，在堆叠法中，基分类器的预测结果被用作第二层模型的训练数据。因此，在实际操作中应避免以下两种情况：第一种情况是在完整的数据集中对基础模型进行全量训练；第二种情况是在测试集中生成预测结果，并将这些预测结果作为输入传递给第二层模型进行再训练。这种做法可能会导致基础模型提前接触测试数据信息而出现过拟合现象

复制代码

    def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))
    
    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]
    
        clf.train(x_tr, y_tr)
    
        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)
    
    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

首先定义三个矩阵：oof_train、oof_test和oof_test_skf。这些矩阵分别用于存储对应阶段的预测结果。其中oof_test_skf用于存储经过5折交叉验证后的预测结果。具体来说，在采用K折交叉验证（k=5）的方法下，在每一轮迭代中分别对每个分割后的训练集进行预测。

Generating our Base First-Level Models

本研究选取了5种模型用于第一层分类任务

复制代码

      # Put in our parameters for said classifiers

      # Random Forest parameters
      rf_params = {
      'n_jobs': -1,
      'n_estimators': 500,
       'warm_start': True, 
       #'max_features': 0.2,
      'max_depth': 6,
      'min_samples_leaf': 2,
      'max_features' : 'sqrt',
      'verbose': 0
      }
      
      # Extra Trees Parameters
      et_params = {
      'n_jobs': -1,
      'n_estimators':500,
      #'max_features': 0.5,
      'max_depth': 8,
      'min_samples_leaf': 2,
      'verbose': 0
      }
      
      # AdaBoost parameters
      ada_params = {
      'n_estimators': 500,
      'learning_rate' : 0.75
      }
      
      # Gradient Boosting parameters
      gb_params = {
      'n_estimators': 500,
       #'max_features': 0.2,
      'max_depth': 5,
      'min_samples_leaf': 2,
      'verbose': 0
      }
      
      # Support Vector Classifier parameters 
      svc_params = {
      'kernel' : 'linear',
      'C' : 0.025
      }

接下来创建5个对象代表5个学习模型

复制代码

      rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)

      et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
      ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
      gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
      svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

第一层模型创建好了后，准备训练和测试的数据

复制代码

      # Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models

      y_train = train['Survived'].ravel()
      train = train.drop(['Survived'], axis=1)
      x_train = train.values # Creates an array of the train data
      x_test = test.values # Creats an array of the test data

第一层模型的预测

复制代码

      # Create our OOF train and test predictions. These base results will be used as new features

      et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
      rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
      ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
      gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
      svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier
      
      print("Training is complete")

查看特征重要性

复制代码

    rf_feature = rf.feature_importances(x_train,y_train)
    et_feature = et.feature_importances(x_train, y_train)
    ada_feature = ada.feature_importances(x_train, y_train)
    gb_feature = gb.feature_importances(x_train,y_train)

下面为记录

复制代码

      rf_features = [0.10474135,  0.21837029,  0.04432652,  0.02249159,  0.05432591,  0.02854371

    ,0.07570305,  0.01088129 , 0.24247496,  0.13685733 , 0.06128402]
      et_features = [ 0.12165657,  0.37098307  ,0.03129623 , 0.01591611 , 0.05525811 , 0.028157
    ,0.04589793 , 0.02030357 , 0.17289562 , 0.04853517,  0.08910063]
      ada_features = [0.028 ,   0.008  ,      0.012   ,     0.05866667,   0.032 ,       0.008
    ,0.04666667 ,  0.     ,      0.05733333,   0.73866667,   0.01066667]
      gb_features = [ 0.06796144 , 0.03889349 , 0.07237845 , 0.02628645 , 0.11194395,  0.04778854
    ,0.05965792 , 0.02774745,  0.07462718,  0.4593142 ,  0.01340093]

制作成一个dataframe形式

复制代码

      cols = train.columns.values

      # Create a dataframe with features
      feature_dataframe = pd.DataFrame( {'features': cols,
       'Random Forest feature importances': rf_features,
       'Extra Trees  feature importances': et_features,
        'AdaBoost feature importances': ada_features,
      'Gradient Boost feature importances': gb_features
      })

下面为画出每个模型的每个特征的特征重要性图

复制代码

      # Scatter plot

      trace = go.Scatter(
      y = feature_dataframe['Random Forest feature importances'].values,
      x = feature_dataframe['features'].values,
      mode='markers',
      marker=dict(
          sizemode = 'diameter',
          sizeref = 1,
          size = 25,
      #       size= feature_dataframe['AdaBoost feature importances'].values,
          #color = np.random.randn(500), #set color equal to a variable
          color = feature_dataframe['Random Forest feature importances'].values,
          colorscale='Portland',
          showscale=True
      ),
      text = feature_dataframe['features'].values
      )
      data = [trace]
      
      layout= go.Layout(
      autosize= True,
      title= 'Random Forest Feature Importance',
      hovermode= 'closest',
      #     xaxis= dict(
      #         title= 'Pop',
      #         ticklen= 5,
      #         zeroline= False,
      #         gridwidth= 2,
      #     ),
      yaxis=dict(
          title= 'Feature Importance',
          ticklen= 5,
          gridwidth= 2
      ),
      showlegend= False
      )
      fig = go.Figure(data=data, layout=layout)
      py.iplot(fig,filename='scatter2010')
      
      # Scatter plot 
      trace = go.Scatter(
      y = feature_dataframe['Extra Trees  feature importances'].values,
      x = feature_dataframe['features'].values,
      mode='markers',
      marker=dict(
          sizemode = 'diameter',
          sizeref = 1,
          size = 25,
      #       size= feature_dataframe['AdaBoost feature importances'].values,
          #color = np.random.randn(500), #set color equal to a variable
          color = feature_dataframe['Extra Trees  feature importances'].values,
          colorscale='Portland',
          showscale=True
      ),
      text = feature_dataframe['features'].values
      )
      data = [trace]
      
      layout= go.Layout(
      autosize= True,
      title= 'Extra Trees Feature Importance',
      hovermode= 'closest',
      #     xaxis= dict(
      #         title= 'Pop',
      #         ticklen= 5,
      #         zeroline= False,
      #         gridwidth= 2,
      #     ),
      yaxis=dict(
          title= 'Feature Importance',
          ticklen= 5,
          gridwidth= 2
      ),
      showlegend= False
      )
      fig = go.Figure(data=data, layout=layout)
      py.iplot(fig,filename='scatter2010')
      
      # Scatter plot 
      trace = go.Scatter(
      y = feature_dataframe['AdaBoost feature importances'].values,
      x = feature_dataframe['features'].values,
      mode='markers',
      marker=dict(
          sizemode = 'diameter',
          sizeref = 1,
          size = 25,
      #       size= feature_dataframe['AdaBoost feature importances'].values,
          #color = np.random.randn(500), #set color equal to a variable
          color = feature_dataframe['AdaBoost feature importances'].values,
          colorscale='Portland',
          showscale=True
      ),
      text = feature_dataframe['features'].values
      )
      data = [trace]
      
      layout= go.Layout(
      autosize= True,
      title= 'AdaBoost Feature Importance',
      hovermode= 'closest',
      #     xaxis= dict(
      #         title= 'Pop',
      #         ticklen= 5,
      #         zeroline= False,
      #         gridwidth= 2,
      #     ),
      yaxis=dict(
          title= 'Feature Importance',
          ticklen= 5,
          gridwidth= 2
      ),
      showlegend= False
      )
      fig = go.Figure(data=data, layout=layout)
      py.iplot(fig,filename='scatter2010')
      
      # Scatter plot 
      trace = go.Scatter(
      y = feature_dataframe['Gradient Boost feature importances'].values,
      x = feature_dataframe['features'].values,
      mode='markers',
      marker=dict(
          sizemode = 'diameter',
          sizeref = 1,
          size = 25,
      #       size= feature_dataframe['AdaBoost feature importances'].values,
          #color = np.random.randn(500), #set color equal to a variable
          color = feature_dataframe['Gradient Boost feature importances'].values,
          colorscale='Portland',
          showscale=True
      ),
      text = feature_dataframe['features'].values
      )
      data = [trace]
      
      layout= go.Layout(
      autosize= True,
      title= 'Gradient Boosting Feature Importance',
      hovermode= 'closest',
      #     xaxis= dict(
      #         title= 'Pop',
      #         ticklen= 5,
      #         zeroline= False,
      #         gridwidth= 2,
      #     ),
      yaxis=dict(
          title= 'Feature Importance',
          ticklen= 5,
          gridwidth= 2
      ),
      showlegend= False
      )
      fig = go.Figure(data=data, layout=layout)
      py.iplot(fig,filename='scatter2010')

然后取均值，得出每个特征的重要性

复制代码

      # Create the new column containing the average of values

      
      feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise
      feature_dataframe.head(3)

画图

复制代码

    y = feature_dataframe['mean'].values
    x = feature_dataframe['features'].values
    data = [go.Bar(
            x= x,
             y= y,
            width = 0.5,
            marker=dict(
               color = feature_dataframe['mean'].values,
            colorscale='Portland',
            showscale=True,
            reversescale = False
            ),
            opacity=0.6
        )]
    
    layout= go.Layout(
    autosize= True,
    title= 'Barplots of Mean Feature Importance',
    hovermode= 'closest',
    #     xaxis= dict(
    #         title= 'Pop',
    #         ticklen= 5,
    #         zeroline= False,
    #         gridwidth= 2,
    #     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
    )
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig, filename='bar-direct-labels')

第二层的输入来自第一层的输

第一层的输出作为第二层的特征

复制代码

    base_predictions_train = pd.DataFrame( {'RandomForest': rf_oof_train.ravel(),

       'ExtraTrees': et_oof_train.ravel(),
       'AdaBoost': ada_oof_train.ravel(),
        'GradientBoost': gb_oof_train.ravel()
      })
      base_predictions_train.head()

查看训练集的相关性

复制代码

      data = [

      go.Heatmap(
          z= base_predictions_train.astype(float).corr().values ,
          x=base_predictions_train.columns.values,
          y= base_predictions_train.columns.values,
            colorscale='Viridis',
              showscale=True,
              reversescale = True
      )
      ]
      py.iplot(data, filename='labelled-heatmap')

注：因为特征之家你的相关性可能导致数据的冗余，不利于结果

构造新的训练集和测试集

复制代码

      x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)

      x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

构造第二层模型通过XGBoost

复制代码

    gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
     n_estimators= 2000,
     max_depth= 4,
     min_child_weight= 2,
     #gamma=1,
     gamma=0.9,                        
     subsample=0.8,
     colsample_bytree=0.8,
     objective= 'binary:logistic',
     nthread= -1,
     scale_pos_weight=1).fit(x_train, y_train)
    predictions = gbm.predict(x_test)

参数介绍

max_depth : 树的高度, 超高可能使模型过于复杂
gamma: 在决策树叶子节点上进行子节点划分时所需付出的成本最小化, 值越大则算法更加谨慎
eta: 学习率控制, 有助于防止模型过拟合

数据的提交

复制代码

      StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,

                              'Survived': predictions })
      StackingSubmission.to_csv("StackingSubmission.csv", index=False)

全部评论 (0)

还没有任何评论哟~

Introduction to Emsembling/Stacking in Python

泰坦尼克号 https://www.kaggle.com/arthurtok/introductiontoensemblingstackinginpython 代码理解 1\.Preprocessin...

Introduction to Ensembling/Stacking

Introduction 这个笔记本是一个非常基本和简单的入门入门，用于集成（组合）基础学习模型的方法，特别是称为堆叠的集成变体。简而言之，堆叠用作第一级（基础），预测几个基本分类器，然后在第二级使用...

Introduction to Time Series Forecasting in Python

作者：禅与计算机程序设计艺术 1.背景介绍时间序列预测是机器学习领域的一项重要任务。它可以应用于各种领域，包括金融、经济和环境等领域。本文将介绍Python中时间序列预测的相关知识和工具。

Evolutionary Algorithms in Python: An Introduction to G

作者：禅与计算机程序设计艺术 1.简介本文将向读者展示基于遗传编程和进化策略的方法来解决复杂多变的优化问题。首先，我们会讨论什么是遗传编程和进化策略？其次，我们将介绍遗传编程的基本概念、算法流程和操...

An Introduction to Interactive Programming in Python（week 7）

本文为coursera中AnIntroductiontoInteractiveProgramminginPython第七周的project代码，由于本周只实现了最终project的一部分功能，剩下部分...

Introduction to Unit Testing in Java

作者：禅与计算机程序设计艺术 1.简介 UNITTESTINGUNIT测试，是在软件开发生命周期中不可或缺的一环。单元测试是一个模块化的测试工作，它的目标是验证某个函数、模块或者类的某个功能是否符合设...

Introduction to Master Data in SAP

WhatisMasterData? DatastoredinSAPR/3iscategorizedas 1.MasterDataand 2.TransactionalData. Masterdatai...

Introduction to Machine Learning in Production

文章目录一、机器学习生命周期和部署概述 1\.概述 2.机器学习的生命周期 3\.机器学习模型部署的挑战 4\.项目部署模式 5\.项目监控二、建模 1.模型训练的挑战 2\.误差分析 3\.性能...

Introduction to Data Science in Python 第 4 周 Assignment

Assignment4HypothesisTesting Thisassignmentrequiresmoreindividuallearningthanpreviousassignmentsyoua...

Introduction to Data Science in Python 第 3 周 Assignment

Assignment3MorePandas Question120% LoadtheenergydatafromthefileEnergyIndicators.xls,whichisalistofin...

是否确定退出登录?

Introduction to Emsembling/Stacking in Python

泰坦尼克号

代码理解

1. Preprocessing

Going to use these 5 base models for the stacking

2. Ensembling & Stacking models

Out-of-Fold Predictions

Generating our Base First-Level Models

查看特征重要性

第二层的输入来自第一层的输

构造第二层模型通过XGBoost

全部评论 (0)

相关文章推荐

Introduction to Emsembling/Stacking in Python

Introduction to Ensembling/Stacking

Introduction to Time Series Forecasting in Python

Evolutionary Algorithms in Python: An Introduction to G

An Introduction to Interactive Programming in Python（week 7）

Introduction to Unit Testing in Java

Introduction to Master Data in SAP

Introduction to Machine Learning in Production

Introduction to Data Science in Python 第 4 周 Assignment

Introduction to Data Science in Python 第 3 周 Assignment