Advertisement

个人信贷评估模型研究

阅读量:

个人信贷评估模型研究

  • 数据初探和可视化分析

    • 介绍

      • 一般信息统计
      • 数据分布
    • 好贷款与坏贷款

      • 贷款类型
      • 各地区发放的贷款
      • 深入研究不良贷款
    • 商业视角

      • 了解业务的操作方面
      • 按收入类别分析
    • 评估风险

      • 了解业务的风险方面
      • 信用评分的重要性
      • 不良贷款的决定因素
      • 目的的风险
  • 数据清洗

    • 数据过滤

      • 过滤缺失值多的特征
      • 过滤重复值特征
      • 经验性过滤数据
      • 更改数据类型
    • 缺失值处理

      • 缺失值处理——分类变量
      • 缺失值处理——数值变量
  • 特征工程

    • 特征衍生
    • 特征抽象
    • 分箱
    • 特征缩放(Feature Scaling)
    • 特征选择
  • 分类器选择

  • 验证算法方法

数据初探和可视化分析

这一部分主要是对数据可视化分析,使用常识和专家经验寻找关键特征和预测的量之间的大致关系,在这里主要学习的pandas的主要使用以及seaborn和matplotlib的可视化方法和数据分析的思路。

介绍

本文数据来源于Lending Club平台,主要目的是对客户的信用状态进行评估,其信用状态如下表:
在这里插入图片描述
由人工把7种再次划分为良好与不良两种状态,主要使用分析工具是pandas、sklearn、keras和seaborn、matplotlib;用pandas做数据清洗和数据规整分析,用sklearn做特征工程,使用keras进行分类,用seaborn、matplotlib进行可视化分析。下面是所需要的包

复制代码
    # Import our libraries we are going to use for our data analysis.
    import keras 
    import pandas as pd
    import seaborn as sns
    import numpy as np
    import matplotlib.pyplot as plt
    
    # Plotly visualizations
    from plotly import tools
    import plotly.plotly as py
    import plotly.figure_factory as ff
    import plotly.graph_objs as go
    from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
    init_notebook_mode(connected=True)
    # plotly.tools.set_credentials_file(username='AlexanderBach', api_key='o4fx6i1MtEIJQxfWYvU1')
    
    
    # For oversampling Library (Dealing with Imbalanced Datasets)
    from imblearn.over_sampling import SMOTE
    from collections import Counter
    
    # Other Libraries
    import time
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/nW8b6FEk2ZmYXJB0lchtd4ApUMDq.png)

一般信息统计

主要是用pandas读取数据,查看数据信息

复制代码
    % matplotlib inline
    
    df = pd.read_csv('../input/loan.csv', low_memory=False)
    
    # Copy of the dataframe
    original_df = df.copy()
    #查看表中的基本样本,以及总体信息
    df.head()
    df.info()
    
    
    python
    
    

然后根据习惯可以重命名,删掉没用的信息,如成员ID

复制代码
    df = df.rename(columns={"loan_amnt": "loan_amount", "funded_amnt": "funded_amount"})
    df.drop([ 'emp_title',  'zip_code', 'title'], axis=1, inplace=True)#inplace覆盖原来的
    
    
    python
    
    

数据分布

画直方图
看一些变量的直方图,这里使用sns的displot函数来画直方图

复制代码
    fig, ax = plt.subplots(1, 3, figsize=(16,5))
    
    loan_amount = df["loan_amount"].values
    funded_amount = df["funded_amount"].values
    investor_funds = df["investor_funds"].values
    
    sns.distplot(loan_amount, ax=ax[0], color="#F7522F")
    ax[0].set_title("Loan Applied by the Borrower", fontsize=14)
    sns.distplot(funded_amount, ax=ax[1], color="#2F8FF7")
    ax[1].set_title("Amount Funded by the Lender", fontsize=14)
    sns.distplot(investor_funds, ax=ax[2], color="#2EAD46")
    ax[2].set_title("Total committed by Investors", fontsize=14)
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/A5KEWXhikQG6rPNm38YOug1TeBqS.png)

在这里插入图片描述
画饼状图
先对loan_status特征重新划分为两类

复制代码
    bad_loan = ["Charged Off", "Default", "Does not meet the credit policy. Status:Charged Off", "In Grace Period", 
           "Late (16-30 days)", "Late (31-120 days)"]
    
    
    df['loan_condition'] = np.nan
    
    def loan_condition(status):
       if status in bad_loan:
       return 'Bad Loan'
       else:
       return 'Good Loan'
       
       
    df['loan_condition'] = df['loan_status'].apply(loan_condition)
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/T7ACrml3wPWK0bDdkaNGq1V8Zg6o.png)

用plot画饼状图

复制代码
    colors = ["#3791D7", "#D72626"]
    labels ="Good Loans", "Bad Loans"
    df["loan_condition"].value_counts().plot.pie(explode=[0,0.25], 
    											autopct='%1.2f%%',
    											shadow=True, 	
                                            colors=colors,
                                            labels=labels, 
                                            fontsize=12, startangle=70)
    #x       :(每一块)的比例,如果sum(x) > 1会使用sum(x)归一化;
    #labels  :(每一块)饼图外侧显示的说明文字;
    #explode :(每一块)离开中心距离;
    #shadow  :在饼图下面画一个阴影。默认值:False,即不画阴影;
    #autopct :控制饼图内百分比设置,可以使用format字符串或者format function
     #       '%1.1f'指小数点前后位数(没有用空格补齐);
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/fmGUO19rcVJeXoZWIx3i24kEbzNA.png)

在这里插入图片描述
画柱状图
将信息转化为时间变量

复制代码
    # Lets' transform the issue dates by year.
    df['issue_d'].head()
    dt_series = pd.to_datetime(df['issue_d'])
    df['year'] = dt_series.dt.year
    
    
    python
    
    

根据年份画贷款金额,这里用sns的barplot

复制代码
    plt.figure(figsize=(12,8))
    #非常方便的传参形式,直接在DataFrame上对某两列进行可视化,另外可以还有一个参量“hue”,表示另一个维度,每一年按这个维度划分
    sns.barplot('year', 'loan_amount', data=df, palette='tab10')
    plt.title('Issuance of Loans', fontsize=16)
    plt.xlabel('Year', fontsize=14)
    plt.ylabel('Average loan amount issued', fontsize=14)
    
    
    python
    
    
在这里插入图片描述

好贷款与坏贷款

贷款类型

用pandas查看某列分量的值

复制代码
    df['loan_status'].value_counts()
    
    
    python
    
    

各地区发放的贷款

对区域进行划分,重组;在这里可以好好体会pandas中apply函数的使用。

复制代码
    df['addr_state'].unique()#看不同值
    
    # Make a list with each of the regions by state.
    
    west = ['CA', 'OR', 'UT','WA', 'CO', 'NV', 'AK', 'MT', 'HI', 'WY', 'ID']
    south_west = ['AZ', 'TX', 'NM', 'OK']
    south_east = ['GA', 'NC', 'VA', 'FL', 'KY', 'SC', 'LA', 'AL', 'WV', 'DC', 'AR', 'DE', 'MS', 'TN' ]
    mid_west = ['IL', 'MO', 'MN', 'OH', 'WI', 'KS', 'MI', 'SD', 'IA', 'NE', 'IN', 'ND']
    north_east = ['CT', 'NY', 'PA', 'NJ', 'RI','MA', 'MD', 'VT', 'NH', 'ME']
    
    df['region'] = np.nan
    
    def finding_regions(state):
    if state in west:
        return 'West'
    elif state in south_west:
        return 'SouthWest'
    elif state in south_east:
        return 'SouthEast'
    elif state in mid_west:
        return 'MidWest'
    elif state in north_east:
        return 'NorthEast'
    
    df['region'] = df['addr_state'].apply(finding_regions)
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/7xkhfD6L3H20uawIWTpqQmrsFAi5.png)

深入研究不良贷款

按贷款状况分类为每个地区的不良贷款的贷款数量。
首先把不良贷款找出来,然后按地区分组.
要点1: pd.crosstab(badloans_df[‘region’], badloans_df[‘loan_status’]).apply(lambda x: x/x.sum() * 100)
pd.crosstab()是交叉列表,第一个参量为行引索,第二个参量为列引索;后面跟了apply函数,其中x指的是整个DataFrame本身;关于lambda函数详情请参考这里

复制代码
    badloans_df = df.loc[df["loan_condition"] == "Bad Loan"]
    
    # loan_status cross
    loan_status_cross = pd.crosstab(badloans_df['region'], badloans_df['loan_status']).apply(lambda x: x/x.sum() * 100)
    number_of_loanstatus = pd.crosstab(badloans_df['region'], badloans_df['loan_status'])
    
    
    # Round our values
    loan_status_cross['Charged Off'] = loan_status_cross['Charged Off'].apply(lambda x: round(x, 2))
    loan_status_cross['Default'] = loan_status_cross['Default'].apply(lambda x: round(x, 2))
    # loan_status_cross['Does not meet the credit policy. Status:Charged Off'] = loan_status_cross['Does not meet the credit policy. Status:Charged Off'].apply(lambda x: round(x, 2))
    loan_status_cross['In Grace Period'] = loan_status_cross['In Grace Period'].apply(lambda x: round(x, 2))
    loan_status_cross['Late (16-30 days)'] = loan_status_cross['Late (16-30 days)'].apply(lambda x: round(x, 2))
    loan_status_cross['Late (31-120 days)'] = loan_status_cross['Late (31-120 days)'].apply(lambda x: round(x, 2))
    
    #按行求和
    number_of_loanstatus['Total'] = number_of_loanstatus.sum(axis=1) 
    # number_of_badloans
    number_of_loanstatus
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/lZI2pACx3kdrERLsuSbna46H5YOy.png)

然后可视化
先把各个Series转化为list,这是非常实用的

复制代码
    charged_off = loan_status_cross['Charged Off'].values.tolist()
    default = loan_status_cross['Default'].values.tolist()
    # not_meet_credit = loan_status_cross['Does not meet the credit policy. Status:Charged Off'].values.tolist()
    grace_period = loan_status_cross['In Grace Period'].values.tolist()
    short_pay = loan_status_cross['Late (16-30 days)'] .values.tolist()
    long_pay = loan_status_cross['Late (31-120 days)'].values.tolist()
    
    
    
    charged = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y= charged_off,
    name='Charged Off',
    marker=dict(
        color='rgb(192, 148, 246)'
    ),
    text = '%'
    )
    
    defaults = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y=default,
    name='Defaults',
    marker=dict(
        color='rgb(176, 26, 26)'
    ),
    text = '%'
    )
    
    # credit_policy = go.Bar(
    #     x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    #     y= not_meet_credit,
    #     name='Does not meet Credit Policy',
    #     marker = dict(
    #         color='rgb(229, 121, 36)'
    #     ),
    #     text = '%'
    # )
    
    grace = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y= grace_period,
    name='Grace Period',
    marker = dict(
        color='rgb(147, 147, 147)'
    ),
    text = '%'
    )
    
    short_pays = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y= short_pay,
    name='Late Payment (16-30 days)', 
    marker = dict(
        color='rgb(246, 157, 135)'
    ),
    text = '%'
    )
    
    long_pays = go.Bar(
    x=['MidWest', 'NorthEast', 'SouthEast', 'SouthWest', 'West'],
    y= long_pay,
    name='Late Payment (31-120 days)',
    marker = dict(
        color = 'rgb(238, 76, 73)'
        ),
    text = '%'
    )
    
    
    
    
    data = [charged, defaults,  grace, short_pays, long_pays]
    layout = go.Layout(
    barmode='stack',#['stack', 'group', 'overlay', 'relative']可选
    title = '% of Bad Loan Status by Region',
    xaxis=dict(title='US Regions')
    )
    
    fig = go.Figure(data=data, layout=layout)
    iplot(fig, filename='stacked-bar')
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/JBI5MRc1uoyaksgmxYLUZH3AdSje.png)
在这里插入图片描述

商业视角

了解业务的操作方面

我们将重点关注三个关键指标:国家发放的贷款(总和),向客户收取的平均利率以及各州的所有客户的平均年收入。

复制代码
    #按州绘制
    
    #按我们的指标进行分组
    #First Plotly Graph(我们评估业务的运营方面)
    by_loan_amount = df.groupby(['region','addr_state'], as_index=False).loan_amount.sum()
    by_interest_rate = df.groupby(['region', 'addr_state'], as_index=False).interest_rate.mean()
    by_income = df.groupby(['region', 'addr_state'], as_index=False).annual_income.mean()
    
    
    
    # Take the values to a list for visualization purposes.
    states = by_loan_amount['addr_state'].values.tolist()
    average_loan_amounts = by_loan_amount['loan_amount'].values.tolist()
    average_interest_rates = by_interest_rate['interest_rate'].values.tolist()
    average_annual_income = by_income['annual_income'].values.tolist()
    
    
    from collections import OrderedDict
    
    # 创造一个有序的字典
    metrics_data = OrderedDict([('state_codes', states),
                            ('issued_loans', average_loan_amounts),
                            ('interest_rate', average_interest_rates),
                            ('annual_income', average_annual_income)])
                     
    
    metrics_df = pd.DataFrame.from_dict(metrics_data)
    metrics_df = metrics_df.round(decimals=2)
    metrics_df.head()
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/kA7dGcfLElJnqbj2TyaUXY6s4xgu.png)

在地图上可视化,这是一个画美国地图的模板,可以套用;

复制代码
    # Now it comes the part where we plot out plotly United States map
    import plotly.plotly as py
    import plotly.graph_objs as go
    #metrics_df的索引是每一个州
    for col in metrics_df.columns:
    metrics_df[col] = metrics_df[col].astype(str)
    
    scl = [[0.0, 'rgb(210, 241, 198)'],[0.2, 'rgb(188, 236, 169)'],[0.4, 'rgb(171, 235, 145)'],\
            [0.6, 'rgb(140, 227, 105)'],[0.8, 'rgb(105, 201, 67)'],[1.0, 'rgb(59, 159, 19)']]
    
    metrics_df['text'] = metrics_df['state_codes'] + '<br>' +\
    'Average loan interest rate: ' + metrics_df['interest_rate'] + '<br>'+\
    'Average annual income: ' + metrics_df['annual_income'] 
    
    
    data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = metrics_df['state_codes'],
        z = metrics_df['issued_loans'], 
        locationmode = 'USA-states',
        text = metrics_df['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "$s USD")
        ) ]
    
    
    layout = dict(
    title = 'Lending Clubs Issued Loans <br> (A Perspective for the Business Operations)',
    geo = dict(
        scope = 'usa',
        projection=dict(type='albers usa'),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)')
    )
    
    fig = dict(data=data, layout=layout)
    iplot(fig, filename='d3-cloropleth-map')
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/Av3EICJn67iuhtZO0fVsc9eRFlxm.png)
在这里插入图片描述

按收入类别分析

我们将创建不同的收入类别,以便检测重要的模式,并在我们的分析中更深入。
低收入类别:年收入低于或等于100,000美元的借款人。
中等收入类别:年收入高于100,000美元但低于或等于200,000美元的借款人。
高收入类别:年收入高于200,000美元的借款人。
作为高收入类别的一部分的借款人获得的贷款额高于中低收入类别的人。当然,年收入较高的人更有可能支付更高金额的贷款。 (子图左侧第一行)
低收入类别借入的贷款成为不良贷款的变动略高。 (子图右侧第一行)
高收入和中等年收入的借款人的就业时间比收入较低的人长。(在次要情况的左边第二行)
低收入的借款人平均利率较高,而年收入较高的人贷款利率较低。 (子图右侧第二行)

复制代码
    #让我们为annual_income创建类别,因为大多数不良贷款都位于100k以下
    
    df['income_category'] = np.nan
    lst = [df]
    #这是一个很好的遍历整个数据表格的方法,其实也可以用apply
    for col in lst:
    col.loc[col['annual_income'] <= 100000, 'income_category'] = 'Low'
    col.loc[(col['annual_income'] > 100000) & (col['annual_income'] <= 200000), 'income_category'] = 'Medium'
    col.loc[col['annual_income'] > 200000, 'income_category'] = 'High'
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/J7jD40mgP5Kntiw6obIlyuE2TkaN.png)

上面的写法用apply法代替

复制代码
    df['income_category'] = np.nan
    
    def income_category(income):
    if income<=100000:
        return 'Low'
    elif (income > 100000) & (income <= 200000):
        return  'Medium'
    elif  income>=200000:
        return 'High'
    
    df['income_category'] = df['annual_income'].apply(income_category)
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/FBOH5cavDe3qV8TPLJSlbu1fr09j.png)

对于像好坏的变量可以用上面的方法,要是object是不多的,可以用有序特征映射,这里还是用遍历法

复制代码
    # Let's transform the column loan_condition into integrers.
    
    lst = [df]
    df['loan_condition_int'] = np.nan
    
    for col in lst:
    col.loc[df['loan_condition'] == 'Good Loan', 'loan_condition_int'] = 0 # Negative (Bad Loan)
    col.loc[df['loan_condition'] == 'Bad Loan', 'loan_condition_int'] = 1 # Positive (Good Loan)
    
    # Convert from float to int the column (This is our label)  
    df['loan_condition_int'] = df['loan_condition_int'].astype(int)
    
    employment_length = ['10+ years', '< 1 year', '1 year', '3 years', '8 years', '9 years',
                    '4 years', '5 years', '6 years', '2 years', '7 years', 'n/a']
    
    # Create a new column and convert emp_length to integers.
    
    lst = [df]
    df['emp_length_int'] = np.nan
    
    for col in lst:
    col.loc[col['emp_length'] == '10+ years', "emp_length_int"] = 10
    col.loc[col['emp_length'] == '9 years', "emp_length_int"] = 9
    col.loc[col['emp_length'] == '8 years', "emp_length_int"] = 8
    col.loc[col['emp_length'] == '7 years', "emp_length_int"] = 7
    col.loc[col['emp_length'] == '6 years', "emp_length_int"] = 6
    col.loc[col['emp_length'] == '5 years', "emp_length_int"] = 5
    col.loc[col['emp_length'] == '4 years', "emp_length_int"] = 4
    col.loc[col['emp_length'] == '3 years', "emp_length_int"] = 3
    col.loc[col['emp_length'] == '2 years', "emp_length_int"] = 2
    col.loc[col['emp_length'] == '1 year', "emp_length_int"] = 1
    col.loc[col['emp_length'] == '< 1 year', "emp_length_int"] = 0.5
    col.loc[col['emp_length'] == 'n/a', "emp_length_int"] = 0
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/u4dFkIqNJW0pxXfCVEO5B6obtQyK.png)

画收入与贷款金额、信用状态、平均工作时间、利率高低进行可视化,选用violinplot进行绘图。扩展阅读

复制代码
    fig, ((ax1, ax2), (ax3, ax4))= plt.subplots(nrows=2, ncols=2, figsize=(14,6))
    
    # Change the Palette types tomorrow!
    
    sns.violinplot(x="income_category", y="loan_amount", data=df, palette="Set2", ax=ax1 )
    sns.violinplot(x="income_category", y="loan_condition_int", data=df, palette="Set2", ax=ax2)
    sns.boxplot(x="income_category", y="emp_length_int", data=df, palette="Set2", ax=ax3)
    sns.boxplot(x="income_category", y="interest_rate", data=df, palette="Set2", ax=ax4)
    plt.savefig('plot2.png', format='png')
    
    
    python
    
    
在这里插入图片描述

评估风险

了解业务的风险方面

虽然业务的运营方面很重要,但我们还必须分析每个州的风险水平。信用评分是分析单个客户风险水平的重要指标。但是,还有其他重要指标以某种方式估计其他国家的风险水平。
看看违约和地区之间的关系

复制代码
    by_condition = df.groupby('addr_state')['loan_condition'].value_counts()/ df.groupby('addr_state')['loan_condition'].count()
    by_emp_length = df.groupby(['region', 'addr_state'], as_index=False).emp_length_int.mean().sort_values(by="addr_state")
    
    loan_condition_bystate = pd.crosstab(df['addr_state'], df['loan_condition'] )
    
    cross_condition = pd.crosstab(df["addr_state"], df["loan_condition"])
    # Percentage of condition of loan
    percentage_loan_contributor = pd.crosstab(df['addr_state'], df['loan_condition']).apply(lambda x: x/x.sum() * 100)
    condition_ratio = cross_condition["Bad Loan"]/cross_condition["Good Loan"]
    by_dti = df.groupby(['region', 'addr_state'], as_index=False).dti.mean()
    state_codes = sorted(states)
    
    
    # Take to a list
    default_ratio = condition_ratio.values.tolist()
    average_dti = by_dti['dti'].values.tolist()
    average_emp_length = by_emp_length["emp_length_int"].values.tolist()
    number_of_badloans = loan_condition_bystate['Bad Loan'].values.tolist()
    percentage_ofall_badloans = percentage_loan_contributor['Bad Loan'].values.tolist()
    
    
    # Figure Number 2
    risk_data = OrderedDict([('state_codes', state_codes),
                         ('default_ratio', default_ratio),
                         ('badloans_amount', number_of_badloans),
                         ('percentage_of_badloans', percentage_ofall_badloans),
                         ('average_dti', average_dti),
                         ('average_emp_length', average_emp_length)])
    
    
    # Figure 2 Dataframe 
    risk_df = pd.DataFrame.from_dict(risk_data)
    risk_df = risk_df.round(decimals=3)
    risk_df.head()
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/RxXuDy3EVl8hdJaKWeNtA5Gm9FPS.png)

在这里插入图片描述
然后可视化’default_ratio’(每个州不良贷款占良好贷款的比率)

复制代码
    # Now it comes the part where we plot out plotly United States map
    import plotly.plotly as py
    import plotly.graph_objs as go
    
    
    for col in risk_df.columns:
    risk_df[col] = risk_df[col].astype(str)
    
    scl = [[0.0, 'rgb(202, 202, 202)'],[0.2, 'rgb(253, 205, 200)'],[0.4, 'rgb(252, 169, 161)'],\
            [0.6, 'rgb(247, 121, 108  )'],[0.8, 'rgb(232, 70, 54)'],[1.0, 'rgb(212, 31, 13)']]
    
    risk_df['text'] = risk_df['state_codes'] + '<br>' +\
    'Number of Bad Loans: ' + risk_df['badloans_amount'] + '<br>' + \
    'Percentage of all Bad Loans: ' + risk_df['percentage_of_badloans'] + '%' +  '<br>' + \
    'Average Debt-to-Income Ratio: ' + risk_df['average_dti'] + '<br>'+\
    'Average Length of Employment: ' + risk_df['average_emp_length'] 
    
    
    data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = risk_df['state_codes'],
        z = risk_df['default_ratio'], 
        locationmode = 'USA-states',
        text = risk_df['text'],
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "%")
        ) ]
    
    
    layout = dict(
    title = 'Lending Clubs Default Rates <br> (Analyzing Risks)',
    geo = dict(
        scope = 'usa',
        projection=dict(type='albers usa'),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)')
    )
    
    fig = dict(data=data, layout=layout)
    iplot(fig, filename='d3-cloropleth-map')
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/iPhrC0McIX5RWAsGHxqw3TJB89Ed.png)
在这里插入图片描述

信用评分的重要性

信用评分是评估整体风险水平的重要指标。 在本节中,我们将根据客户信用评分中收到的等级类型分析整体风险水平和不良贷款数量。下图是绘制不同的信用等级随着时间变化,与贷款金额以及贷款利率之间的关系;其中,unstack是在groupby之后将列引索转化为行引索,具体参考这里

复制代码
    # Let's visualize how many loans were issued by creditscore
    f, ((ax1, ax2)) = plt.subplots(1, 2)
    cmap = plt.cm.coolwarm
    
    by_credit_score = df.groupby(['year', 'grade']).loan_amount.mean()
    by_credit_score.unstack().plot(legend=False, ax=ax1, figsize=(14, 4), colormap=cmap)
    ax1.set_title('Loans issued by Credit Score', fontsize=14)
    
    by_inc = df.groupby(['year', 'grade']).interest_rate.mean()
    by_inc.unstack().plot(ax=ax2, figsize=(14, 4), colormap=cmap)
    ax2.set_title('Interest Rates by Credit Score', fontsize=14)
    
    ax2.legend(bbox_to_anchor=(-1.0, -0.3, 1.7, 0.1), loc=5, prop={'size':12},
           ncol=7, mode="expand", borderaxespad=0.)
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/fiYPSVFyHAa9bkQDsCj4h5pnz2rU.png)

在这里插入图片描述
进一步分析不同不同大等级和和不同小等级良好贷款与不良贷款人数的差距

复制代码
    fig = plt.figure(figsize=(16,12))
    
    ax1 = fig.add_subplot(221)
    ax2 = fig.add_subplot(222)
    ax3 = fig.add_subplot(212)
    
    cmap = plt.cm.coolwarm_r
    
    loans_by_region = df.groupby(['grade', 'loan_condition']).size()
    # stacked=True是两图叠加,更容易比较
    loans_by_region.unstack().plot(kind='bar', stacked=True, colormap=cmap, ax=ax1, grid=False)
    ax1.set_title('Type of Loans by Grade', fontsize=14)
    
    loans_by_grade = df.groupby(['sub_grade', 'loan_condition']).size()
    loans_by_grade.unstack().plot(kind='bar', stacked=True, colormap=cmap, ax=ax2, grid=False)
    ax2.set_title('Type of Loans by Sub-Grade', fontsize=14)
    
    by_interest = df.groupby(['year', 'loan_condition']).interest_rate.mean()
    by_interest.unstack().plot(ax=ax3, colormap=cmap)
    ax3.set_title('Average Interest rate by Loan Condition', fontsize=14)
    ax3.set_ylabel('Interest Rate (%)', fontsize=12)
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/YGBHF7mM3nhDTtzOa82sodw45eKW.png)
在这里插入图片描述

不良贷款的决定因素

画房子所有权与不良贷款贷款金额之间的关系

复制代码
    import seaborn as sns
    
    plt.figure(figsize=(18,18))
    
    # Create a dataframe for bad loans
    bad_df = df.loc[df['loan_condition'] == 'Bad Loan']
    
    plt.subplot(211)
    g = sns.boxplot(x='home_ownership', y='loan_amount', hue='loan_condition',
               data=bad_df, color='r')
    
    g.set_xticklabels(g.get_xticklabels(),rotation=45)
    g.set_xlabel("Type of Home Ownership", fontsize=12)
    g.set_ylabel("Loan Amount", fontsize=12)
    g.set_title("Distribution of Amount Borrowed \n by Home Ownership", fontsize=16)
    
    
    plt.subplot(212)
    g1 = sns.boxplot(x='year', y='loan_amount', hue='home_ownership',
               data=bad_df, palette="Set3")
    g1.set_xticklabels(g1.get_xticklabels(),rotation=45)
    g1.set_xlabel("Type of Home Ownership", fontsize=12)
    g1.set_ylabel("Loan Amount", fontsize=12)
    g1.set_title("Distribution of Amount Borrowed \n through the years", fontsize=16)
    
    
    plt.subplots_adjust(hspace = 0.6, top = 0.8)
    
    plt.show()
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/1lAqbDtHfQx0OcmKRGLUkIvaXnor.png)

在这里插入图片描述
画利率的高低和贷款好坏的关系,以及利率与贷款时间的关系;这里将利率的高低分箱,化成两类,高和低区划分

复制代码
    from scipy.stats import norm
    
    plt.figure(figsize=(20,10))
    
    palette = ['#009393', '#930000']
    plt.subplot(221)
    ax = sns.countplot(x='interest_payments', data=df, 
                  palette=palette, hue='loan_condition')
    
    ax.set_title('The impact of interest rate \n on the condition of the loan', fontsize=14)
    ax.set_xlabel('Level of Interest Payments', fontsize=12)
    ax.set_ylabel('Count')
    
    plt.subplot(222)
    ax1 = sns.countplot(x='interest_payments', data=df, 
                   palette=palette, hue='term')
    
    ax1.set_title('The impact of maturity date \n on interest rates', fontsize=14)
    ax1.set_xlabel('Level of Interest Payments', fontsize=12)
    ax1.set_ylabel('Count')
    
    
    plt.subplot(212)
    low = df['loan_amount'].loc[df['interest_payments'] == 'Low'].values
    high = df['loan_amount'].loc[df['interest_payments'] == 'High'].values
    
    #会有四条线的原因是fit=norm多了条拟合线;
    ax2= sns.distplot(low, color='#009393', label='Low Interest Payments', fit=norm, fit_kws={"color":"#483d8b"}, kde=False) # Dark Blue Norm Color
    ax3 = sns.distplot(high, color='#930000', label='High Interest Payments', fit=norm, fit_kws={"color":"#c71585"},kde=False) #  Red Norm Color
    plt.axis([0, 36000, 0, 0.00016])
    plt.legend()
    
    
    plt.show()
    plt.savefig('plot5.png', format='png')
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/rkpDJUlm8MndbfWcAzB4Zo2euRsN.png)
在这里插入图片描述

目的的风险

探究目的和贷款状态之间的关系。

数据清洗

数据过滤

过滤缺失值多的特征

我们先查看每个属性的缺失情况,使用pandas的语句如下;

复制代码
    #读取文件
    df = pd.read_csv('Train_set.csv', low_memory=False)
    #每个特征缺失值的百分比并排序
    check_null = df.isnull().sum().sort_values(ascending=False)/float(len(df))
    #查看大于0.2缺失值的特征
    print(check_null[check_null > 0.2])
    
    
    python
    
    

然后去掉缺失值的大于阈值的特征

复制代码
    # 设定阀值
    thresh_count = len(df)*0.4 
    #若某一列数据缺失的数量超过阀值就会被删除
    df= df.dropna(thresh=thresh_count, axis=1) 
    
    
    python
    
    

过滤重复值特征

如果一个变量大部分的观测都是相同的特征,那么这个特征或者输入变量就是无法用来区分目标。
我们踢出这样的变量

复制代码
    loans = df.loc[:,df.apply(pd.Series.nunique) != 1]
    
    
    python
    
    

经验性过滤数据

用常识去掉和分类任务无关的数据,比如说任务是个人信用值,那么关于电话、生日、邮政编码等信息一定是没用的。

复制代码
    drop_list = ['sub_grade', 'emp_title',  'title', 'zip_code', 'addr_state', 'earliest_cr_line',
       'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d', 'disbursement_method','debt_settlement_flag','pymnt_plan',
             'revol_util', 'initial_list_status', 'hardship_flag']
    
    loans.drop(drop_list,axis=1,inplace=True)
    
    
    python
    
    

更改数据类型

针对百分数,时间等参数要更改数字类型。如5%改为0.05

复制代码
    loans['int_rate']=loans['int_rate'].astype(str).str.strip("%").astype(float)/100
    
    
    python
    
    

缺失值处理

缺失值处理——分类变量

分类变量这里我们用‘unknown’来填充。先可视化分类变量

复制代码
    objectColumns = loans.select_dtypes(include=["object"]).columns  
    msno.matrix(loans[objectColumns])  # 缺失值可视化
    
    
    python
    
    

然后用‘unknown’填充

复制代码
    loans[objectColumns] = loans[objectColumns].fillna("Unknown") 
    
    
    python
    
    

缺失值处理——数值变量

这里使用可sklearn的Preprocessing模块,参数strategy选用most_frequent,采用众数插补的方法填充缺失值。

复制代码
    numColumns = loans.select_dtypes(include=[np.number]).columns
    #设置最大显示
    pd.set_option('display.max_columns', len(numColumns))
    # 采用众数插补的方法填充缺失值
    from sklearn.preprocessing import Imputer
    imr = Imputer(missing_values='NaN', strategy='most_frequent', axis = 0)  #  axis=0  针对列来处理
    imr = imr.fit(loans[numColumns])
    
    
    python
    
    

特征工程

特征衍生

“installment"代表贷款每月分期的金额,我们将’annual_inc’除以12个月获得贷款申请人的月收入金额,然后再把"installment”(月负债)与(‘annual_inc’/12)(月收入)相除生成新的特征’installment_feat’,新特征’installment_feat’代表客户每月还款支出占月收入的比,'installment_feat’的值越大,意味着贷款人的偿债压力越大,违约的可能性越大。

复制代码
    loans['installment_feat'] = loans['installment'] / ((loans['annual_inc']+1) / 12)
    
    
    python
    
    

特征抽象

把LoanStatus按照正常和违约重新分为两类别(原本有7类)

复制代码
    def coding(col, codeDict):
    	colCoded = pd.Series(col, copy=True)
    for key, value in codeDict.items():
        colCoded.replace(key, value, inplace=True)
     
    return colCoded
     
    #把贷款状态LoanStatus编码为违约=1, 正常=0:
     
    loans["loan_status"] = coding(loans["loan_status"], {'Current':0,'Fully Paid':0,'In Grace Period':1,'Late (31-120 days)':1,'Late (16-30 days)':1,'Default':1,'Charged Off':1})
     
    print( '\nAfter Coding:')
     
    pd.value_counts(loans["loan_status"])
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/SJ8ZotqFy6N37LXcMCVxwKYBG0pg.png)

然后将“emp_length”(工作年限)、“grade”(信用等级)进行特征抽象化,变成数字。

复制代码
    # 有序特征的映射
    mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "Unknown": 0
    },
    "grade":{
        "A": 1,
        "B": 2,
        "C": 3,
        "D": 4,
        "E": 5,
        "F": 6,
        "G": 7
    }
    }
     
    loans = loans.replace(mapping_dict) 
    loans[['emp_length','grade']].head() 
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/Aptun8dcKYDI4hS9fow20bPUORET.png)

然后对剩下的特征进行one-hot编码

复制代码
    n_columns = ["home_ownership", "verification_status", "application_type","purpose", "term"] 
    dummy_df = pd.get_dummies(loans[n_columns]) # 用get_dummies进行one hot编码
    loans = pd.concat([loans, dummy_df], axis=1) #当axis = 1的时候,concat就是行对齐,然后将不同列名称的两张表合并
    #提出以前的量
    loans = loans.drop(n_columns, axis=1)
    
    
    python
    
    

分箱

将连续变量离散化或者把多状态的离散变量合并成少状态一方面避免特征中无意义的波动对评分带来的波动,使其更加稳定。另一方面避免了极端值的影响。同时可以将缺失值作为独立的一个箱将所有变量变换到相似的尺度。常用的分箱方法有有监督分箱方法和无监督分箱方法 ,有监督分箱方法包括best-ks和卡方分箱,无监督学习包括等频 等距 聚类等,这里选用无监督分箱方法。

特征缩放(Feature Scaling)

我们采用的是标准化的方法,调用scikit-learn模块preprocessing的子模块StandardScaler。

复制代码
    sc =StandardScaler()  # 初始化缩放器
    loans_ml_df[col] =sc.fit_transform(loans_ml_df[col])  #对数据进行标准化
    
    
    python
    
    

特征选择

主要三种方法,详细见这里,详细介绍了sklearn库中的特征选择方法。这里选择递归特征消除 Recursive feature elimination (RFE)*,根据模型不同,种类也不同,有基于SVM、逻辑回归的、梯度提升树的,等等。这里选择基于逻辑回归的

复制代码
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.feature_selection import RFE
    # 建立逻辑回归分类器
    model = LogisticRegression()
    # 建立递归特征消除筛选器
    rfe = RFE(model, 40) #通过递归选择特征,选择40个特征
    rfe = rfe.fit(x_val, y_val)
    # 打印筛选结果
    print(rfe.n_features_)
    print(rfe.estimator_ )
    print(rfe.support_)
    print(rfe.ranking_) #ranking 为 1代表被选中,其他则未被代表未被选中
    
    
    python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-12/XJVDeYWs4ibQ8qTA9ZvL7gcIyo1l.png)

分类器选择

分类其选择就前篇一律了,可以基于选神经网络也可以选择xgboost

验证算法方法

主要是混淆矩阵以及ROC曲线。

参考:kaggle

全部评论 (0)

还没有任何评论哟~