Advertisement

Introduction to Emsembling/Stacking in Python

阅读量:

泰坦尼克号

https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python

代码理解

1. Preprocessing

1.库加载

复制代码
    import plotly.offline as py
    py.init_notebook_mode(connected=True)
    import plotly.graph_objs as go
    import plotly.tools as tls
    
    import warnings
    warnings.filterwarnings('ignore')
    import xgboost as xgb

Going to use these 5 base models for the stacking

从sklearn.ensemble导入集成分类算法(如随机森林、提升树、梯度提升和极端随机森林),以及从sklearn.svm导入支持向量机模型,并从sklearn.model_selection导入K折交叉验证模块。

在处理特征时发现存在较高的缺失比例时

复制代码
     train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

分桶处理

复制代码
     train['CategoricalFare'] = pd.qcut(train['Fare'], 4)

     dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] 						        = 0
     dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
     dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
     dataset.loc[ dataset['Fare'] > 31, 'Fare'] 							        = 3
     dataset['Fare'] = dataset['Fare'].astype(int)

填充缺失值(通过均值和标准差来填充)

复制代码
     for dataset in full_data:

     age_avg = dataset['Age'].mean()
     age_std = dataset['Age'].std()
     age_null_count = dataset['Age'].isnull().sum()
     age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
     dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
     dataset['Age'] = dataset['Age'].astype(int)

替换

复制代码
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')

编码mapping

复制代码
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

热力图(去除冗余变量)

复制代码
     colormap = plt.cm.RdBu

     plt.figure(figsize=(14,12))
     plt.title('Pearson Correlation of Features', y=1.05, size=15)
     sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
             square=True, cmap=colormap, linecolor='white', annot=True)

2. Ensembling & Stacking models

复制代码
    # Some useful parameters which will come in handy later on
    ntrain = train.shape[0]
    ntest = test.shape[0]
    SEED = 0 # for reproducibility
    NFOLDS = 5 # set folds for out-of-fold prediction
    kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)
    
    # Class to extend the Sklearn classifier
    class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)
    
    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)
    
    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)
    
    # Class to extend XGboost classifer

注释:采用5折交叉验证机制,并创建一个类用于实现训练、预测、拟合以及特征重要性分析的方法。

Out-of-Fold Predictions

其原因在于,在堆叠法中,基分类器的预测结果被用作第二层模型的训练数据。因此,在实际操作中应避免以下两种情况:第一种情况是在完整的数据集中对基础模型进行全量训练;第二种情况是在测试集中生成预测结果,并将这些预测结果作为输入传递给第二层模型进行再训练。这种做法可能会导致基础模型提前接触测试数据信息而出现过拟合现象

复制代码
    def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))
    
    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]
    
        clf.train(x_tr, y_tr)
    
        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)
    
    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

首先定义三个矩阵:oof_train、oof_test和oof_test_skf。这些矩阵分别用于存储对应阶段的预测结果。其中oof_test_skf用于存储经过5折交叉验证后的预测结果。具体来说,在采用K折交叉验证(k=5)的方法下,在每一轮迭代中分别对每个分割后的训练集进行预测。

Generating our Base First-Level Models

本研究选取了5种模型用于第一层分类任务

复制代码
      # Put in our parameters for said classifiers

      # Random Forest parameters
      rf_params = {
      'n_jobs': -1,
      'n_estimators': 500,
       'warm_start': True, 
       #'max_features': 0.2,
      'max_depth': 6,
      'min_samples_leaf': 2,
      'max_features' : 'sqrt',
      'verbose': 0
      }
      
      # Extra Trees Parameters
      et_params = {
      'n_jobs': -1,
      'n_estimators':500,
      #'max_features': 0.5,
      'max_depth': 8,
      'min_samples_leaf': 2,
      'verbose': 0
      }
      
      # AdaBoost parameters
      ada_params = {
      'n_estimators': 500,
      'learning_rate' : 0.75
      }
      
      # Gradient Boosting parameters
      gb_params = {
      'n_estimators': 500,
       #'max_features': 0.2,
      'max_depth': 5,
      'min_samples_leaf': 2,
      'verbose': 0
      }
      
      # Support Vector Classifier parameters 
      svc_params = {
      'kernel' : 'linear',
      'C' : 0.025
      }

接下来创建5个对象代表5个学习模型

复制代码
      rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)

      et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
      ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
      gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
      svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

第一层模型创建好了后,准备训练和测试的数据

复制代码
      # Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models

      y_train = train['Survived'].ravel()
      train = train.drop(['Survived'], axis=1)
      x_train = train.values # Creates an array of the train data
      x_test = test.values # Creats an array of the test data

第一层模型的预测

复制代码
      # Create our OOF train and test predictions. These base results will be used as new features

      et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
      rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
      ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
      gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
      svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier
      
      print("Training is complete")

查看特征重要性

复制代码
    rf_feature = rf.feature_importances(x_train,y_train)
    et_feature = et.feature_importances(x_train, y_train)
    ada_feature = ada.feature_importances(x_train, y_train)
    gb_feature = gb.feature_importances(x_train,y_train)

下面为记录

复制代码
      rf_features = [0.10474135,  0.21837029,  0.04432652,  0.02249159,  0.05432591,  0.02854371

    ,0.07570305,  0.01088129 , 0.24247496,  0.13685733 , 0.06128402]
      et_features = [ 0.12165657,  0.37098307  ,0.03129623 , 0.01591611 , 0.05525811 , 0.028157
    ,0.04589793 , 0.02030357 , 0.17289562 , 0.04853517,  0.08910063]
      ada_features = [0.028 ,   0.008  ,      0.012   ,     0.05866667,   0.032 ,       0.008
    ,0.04666667 ,  0.     ,      0.05733333,   0.73866667,   0.01066667]
      gb_features = [ 0.06796144 , 0.03889349 , 0.07237845 , 0.02628645 , 0.11194395,  0.04778854
    ,0.05965792 , 0.02774745,  0.07462718,  0.4593142 ,  0.01340093]

制作成一个dataframe形式

复制代码
      cols = train.columns.values

      # Create a dataframe with features
      feature_dataframe = pd.DataFrame( {'features': cols,
       'Random Forest feature importances': rf_features,
       'Extra Trees  feature importances': et_features,
        'AdaBoost feature importances': ada_features,
      'Gradient Boost feature importances': gb_features
      })

下面为画出每个模型的每个特征的特征重要性图

复制代码
      # Scatter plot

      trace = go.Scatter(
      y = feature_dataframe['Random Forest feature importances'].values,
      x = feature_dataframe['features'].values,
      mode='markers',
      marker=dict(
          sizemode = 'diameter',
          sizeref = 1,
          size = 25,
      #       size= feature_dataframe['AdaBoost feature importances'].values,
          #color = np.random.randn(500), #set color equal to a variable
          color = feature_dataframe['Random Forest feature importances'].values,
          colorscale='Portland',
          showscale=True
      ),
      text = feature_dataframe['features'].values
      )
      data = [trace]
      
      layout= go.Layout(
      autosize= True,
      title= 'Random Forest Feature Importance',
      hovermode= 'closest',
      #     xaxis= dict(
      #         title= 'Pop',
      #         ticklen= 5,
      #         zeroline= False,
      #         gridwidth= 2,
      #     ),
      yaxis=dict(
          title= 'Feature Importance',
          ticklen= 5,
          gridwidth= 2
      ),
      showlegend= False
      )
      fig = go.Figure(data=data, layout=layout)
      py.iplot(fig,filename='scatter2010')
      
      # Scatter plot 
      trace = go.Scatter(
      y = feature_dataframe['Extra Trees  feature importances'].values,
      x = feature_dataframe['features'].values,
      mode='markers',
      marker=dict(
          sizemode = 'diameter',
          sizeref = 1,
          size = 25,
      #       size= feature_dataframe['AdaBoost feature importances'].values,
          #color = np.random.randn(500), #set color equal to a variable
          color = feature_dataframe['Extra Trees  feature importances'].values,
          colorscale='Portland',
          showscale=True
      ),
      text = feature_dataframe['features'].values
      )
      data = [trace]
      
      layout= go.Layout(
      autosize= True,
      title= 'Extra Trees Feature Importance',
      hovermode= 'closest',
      #     xaxis= dict(
      #         title= 'Pop',
      #         ticklen= 5,
      #         zeroline= False,
      #         gridwidth= 2,
      #     ),
      yaxis=dict(
          title= 'Feature Importance',
          ticklen= 5,
          gridwidth= 2
      ),
      showlegend= False
      )
      fig = go.Figure(data=data, layout=layout)
      py.iplot(fig,filename='scatter2010')
      
      # Scatter plot 
      trace = go.Scatter(
      y = feature_dataframe['AdaBoost feature importances'].values,
      x = feature_dataframe['features'].values,
      mode='markers',
      marker=dict(
          sizemode = 'diameter',
          sizeref = 1,
          size = 25,
      #       size= feature_dataframe['AdaBoost feature importances'].values,
          #color = np.random.randn(500), #set color equal to a variable
          color = feature_dataframe['AdaBoost feature importances'].values,
          colorscale='Portland',
          showscale=True
      ),
      text = feature_dataframe['features'].values
      )
      data = [trace]
      
      layout= go.Layout(
      autosize= True,
      title= 'AdaBoost Feature Importance',
      hovermode= 'closest',
      #     xaxis= dict(
      #         title= 'Pop',
      #         ticklen= 5,
      #         zeroline= False,
      #         gridwidth= 2,
      #     ),
      yaxis=dict(
          title= 'Feature Importance',
          ticklen= 5,
          gridwidth= 2
      ),
      showlegend= False
      )
      fig = go.Figure(data=data, layout=layout)
      py.iplot(fig,filename='scatter2010')
      
      # Scatter plot 
      trace = go.Scatter(
      y = feature_dataframe['Gradient Boost feature importances'].values,
      x = feature_dataframe['features'].values,
      mode='markers',
      marker=dict(
          sizemode = 'diameter',
          sizeref = 1,
          size = 25,
      #       size= feature_dataframe['AdaBoost feature importances'].values,
          #color = np.random.randn(500), #set color equal to a variable
          color = feature_dataframe['Gradient Boost feature importances'].values,
          colorscale='Portland',
          showscale=True
      ),
      text = feature_dataframe['features'].values
      )
      data = [trace]
      
      layout= go.Layout(
      autosize= True,
      title= 'Gradient Boosting Feature Importance',
      hovermode= 'closest',
      #     xaxis= dict(
      #         title= 'Pop',
      #         ticklen= 5,
      #         zeroline= False,
      #         gridwidth= 2,
      #     ),
      yaxis=dict(
          title= 'Feature Importance',
          ticklen= 5,
          gridwidth= 2
      ),
      showlegend= False
      )
      fig = go.Figure(data=data, layout=layout)
      py.iplot(fig,filename='scatter2010')

然后取均值,得出每个特征的重要性

复制代码
      # Create the new column containing the average of values

      
      feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise
      feature_dataframe.head(3)

画图

复制代码
    y = feature_dataframe['mean'].values
    x = feature_dataframe['features'].values
    data = [go.Bar(
            x= x,
             y= y,
            width = 0.5,
            marker=dict(
               color = feature_dataframe['mean'].values,
            colorscale='Portland',
            showscale=True,
            reversescale = False
            ),
            opacity=0.6
        )]
    
    layout= go.Layout(
    autosize= True,
    title= 'Barplots of Mean Feature Importance',
    hovermode= 'closest',
    #     xaxis= dict(
    #         title= 'Pop',
    #         ticklen= 5,
    #         zeroline= False,
    #         gridwidth= 2,
    #     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
    )
    fig = go.Figure(data=data, layout=layout)
    py.iplot(fig, filename='bar-direct-labels')

第二层的输入来自第一层的输

第一层的输出作为第二层的特征

复制代码
    base_predictions_train = pd.DataFrame( {'RandomForest': rf_oof_train.ravel(),

       'ExtraTrees': et_oof_train.ravel(),
       'AdaBoost': ada_oof_train.ravel(),
        'GradientBoost': gb_oof_train.ravel()
      })
      base_predictions_train.head()

查看训练集的相关性

复制代码
      data = [

      go.Heatmap(
          z= base_predictions_train.astype(float).corr().values ,
          x=base_predictions_train.columns.values,
          y= base_predictions_train.columns.values,
            colorscale='Viridis',
              showscale=True,
              reversescale = True
      )
      ]
      py.iplot(data, filename='labelled-heatmap')

注:因为特征之家你的相关性可能导致数据的冗余,不利于结果

构造新的训练集和测试集

复制代码
      x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)

      x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

构造第二层模型通过XGBoost

复制代码
    gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
     n_estimators= 2000,
     max_depth= 4,
     min_child_weight= 2,
     #gamma=1,
     gamma=0.9,                        
     subsample=0.8,
     colsample_bytree=0.8,
     objective= 'binary:logistic',
     nthread= -1,
     scale_pos_weight=1).fit(x_train, y_train)
    predictions = gbm.predict(x_test)

参数介绍

  1. max_depth : 树的高度, 超高可能使模型过于复杂
  2. gamma: 在决策树叶子节点上进行子节点划分时所需付出的成本最小化, 值越大则算法更加谨慎
  3. eta: 学习率控制, 有助于防止模型过拟合

数据的提交

复制代码
      StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,

                              'Survived': predictions })
      StackingSubmission.to_csv("StackingSubmission.csv", index=False)

全部评论 (0)

还没有任何评论哟~