「数据科学」天池金融风控-贷款违约预测数据分析

阅读量：

天池金融风控-贷款违约预测数据分析

1. 学习目的
2. 了解数据
- 2.1 导入第三方库
- 2.2 读取文件
- - 2.2.1 拓展知识
- 2.3 总体了解
- - 2.3.1 样本个数和特征维度
3. 数据分析与可视化
- - 3.1 缺失数据与唯一标识
- 3.2 数值型变量与对象型变量
- - 3.2.1 数值型变量分析
- 3.2.2 对象型变量分析
3.3 变量的分布展示
- 3.3.1 单一变量的分布展示
- 3.3.2 在不同y值情况下展示x某特性的分布在各个y值下的情况
  - 3.4 时间数据处理与可视化
  - 3.5 透视图
  - 3.6 生成数据报告
- 4. 总结

1. 学习目的

上一篇文章介绍了天池“金融风控-贷款违约预测”的赛题分析。

该平台结合机器学习算法进行风控评估的比赛题目聚焦于贷款违约预测

本文是第二章——数据分析部分，在深入探索数据的基础上，全面掌握数据信息，并助力后续的特征工程开展。

目的：

EDA的核心作用在于对整个数据集进行深入分析（包括缺失值与异常值），以检验其是否适合后续的机器学习或深度学习建模工作流程。
研究变量间的关联以及它们对预测结果的影响。
在准备阶段涉及多个步骤来完成特征工程的任务。

比赛地址：https://tianchi.aliyun.com/competition/entrance/531830/introduction

2. 了解数据

2.1 导入第三方库

复制代码

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import datetime
    
    
    AI生成项目python

2.2 读取文件

复制代码

    data_train = pd.read_csv('./train.csv')
    data_test_a = pd.read_csv('./testA.csv')
    
    
    AI生成项目python

2.2.1 拓展知识

当pandas读取数据遇到基于相对路径的错误时, 建议检查当前工作目录的位置, 并调用os模块获取当前工作目录信息.
在Python中对TSV文件的支持方面, 则需注意以下几点:
- 首先, csv模块默认确实采用制表符作为字段值之间的分隔符 delimiter
- 因此, 它本质上支持DSV delimiter-separated values 格式
- 如果希望处理使用制表符分隔的数据, 则需手动设置delimiter参数为'\t'

在处理较大的文件时（适用于文件特别大的场景），我们采用分块读取的方法来优化资源利用率。
通过指定nrows参数可以控制读取前多少行数据。
其中nrows为非负整数。
采用分块读取的方式以提高数据加载效率。

复制代码

    data_train_sample = pd.read_csv("./train.csv", nrows=5)
    #设置chunksize参数，来控制每次迭代数据的大小
    chunker = pd.read_csv("./train.csv", chunksize=5)
    for item in chunker:
    print(type(item))
    #<class 'pandas.core.frame.DataFrame'>
    print(len(item))
    #5
    
    
    AI生成项目python

2.3 总体了解

2.3.1 样本个数和特征维度

复制代码

    data_test_a.shape
    
    (200000, 48)
    
    
    AI生成项目python

复制代码

    data_train.shape
    
    (800000, 47)
    
    
    AI生成项目python

复制代码

    data_train.columns
    
    Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')
    
    
    AI生成项目python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/1prnhOYZyTHmdI76bMsvK40acx2E.png)

请核实具体的字段名称，在赛题理解部分已经明确了各个特征的具体含义，请您查阅以便更清晰地了解相关内容

Field	Description
id	为贷款清单分配的唯一信用证标识
loanAmnt	贷款金额
term	贷款期限（year）
interestRate	贷款利率
installment	分期付款金额
grade	贷款等级
subGrade	贷款等级之子级
employmentTitle	就业职称
employmentLength	就业年限（年）
homeOwnership	借款人在登记时提供的房屋所有权状况
annualIncome	年收入
verificationStatus	验证状态
issueDate	贷款发放的月份
purpose	借款人在贷款申请时的贷款用途类别
postCode	借款人在贷款申请中提供的邮政编码的前3位数字
regionCode	地区编码
dti	债务收入比
delinquency_2years	借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow	借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh	借款人在贷款发放时的fico所属的上限范围
openAcc	借款人信用档案中未结信用额度的数量
pubRec	贬损公共记录的数量
pubRecBankruptcies	公开记录清除的数量
revolBal	信贷周转余额合计
revolUtil	循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc	借款人信用档案中当前的信用额度总数
initialListStatus	贷款的初始列表状态
applicationType	表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine	借款人最早报告的信用额度开立的月份
title	借款人提供的贷款名称
policyCode	公开可用的策略_代码=1新产品不公开可用的策略_代码=2
n系列匿名特征	匿名特征n0-n14，为一些贷款人行为计数特征的处理

通过info()来熟悉数据类型

复制代码

    data_train.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 800000 entries, 0 to 799999
    Data columns (total 47 columns):
     #   Column              Non-Null Count   Dtype  
    ---  ------              --------------   -----  
     0   id                  800000 non-null  int64  
     1   loanAmnt            800000 non-null  float64
     2   term                800000 non-null  int64  
     3   interestRate        800000 non-null  float64
     4   installment         800000 non-null  float64
     5   grade               800000 non-null  object 
     6   subGrade            800000 non-null  object 
     7   employmentTitle     799999 non-null  float64
     8   employmentLength    753201 non-null  object 
     9   homeOwnership       800000 non-null  int64  
     10  annualIncome        800000 non-null  float64
     11  verificationStatus  800000 non-null  int64  
     12  issueDate           800000 non-null  object 
     13  isDefault           800000 non-null  int64  
     14  purpose             800000 non-null  int64  
     15  postCode            799999 non-null  float64
     16  regionCode          800000 non-null  int64  
     17  dti                 799761 non-null  float64
     18  delinquency_2years  800000 non-null  float64
     19  ficoRangeLow        800000 non-null  float64
     20  ficoRangeHigh       800000 non-null  float64
     21  openAcc             800000 non-null  float64
     22  pubRec              800000 non-null  float64
     23  pubRecBankruptcies  799595 non-null  float64
     24  revolBal            800000 non-null  float64
     25  revolUtil           799469 non-null  float64
     26  totalAcc            800000 non-null  float64
     27  initialListStatus   800000 non-null  int64  
     28  applicationType     800000 non-null  int64  
     29  earliesCreditLine   800000 non-null  object 
     30  title               799999 non-null  float64
     31  policyCode          800000 non-null  float64
     32  n0                  759730 non-null  float64
     33  n1                  759730 non-null  float64
     34  n2                  759730 non-null  float64
     35  n2.1                759730 non-null  float64
     36  n4                  766761 non-null  float64
     37  n5                  759730 non-null  float64
     38  n6                  759730 non-null  float64
     39  n7                  759730 non-null  float64
     40  n8                  759729 non-null  float64
     41  n9                  759730 non-null  float64
     42  n10                 766761 non-null  float64
     43  n11                 730248 non-null  float64
     44  n12                 759730 non-null  float64
     45  n13                 759730 non-null  float64
     46  n14                 759730 non-null  float64
    dtypes: float64(33), int64(9), object(5)
    memory usage: 286.9+ MB
    
    
    AI生成项目python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/YaTqAKfQno3BRm7eXICcGJD2Z69b.png)

总体粗略的查看数据集各个特征的一些基本统计量

复制代码

    data_train.describe()
    
    
    AI生成项目python

id	loanAmnt	term	interestRate	installment	employmentTitle	homeOwnership	annualIncome	verificationStatus	isDefault	…	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
count	800000.000000	800000.000000	800000.000000	800000.000000	800000.000000	799999.000000	800000.000000	8.000000e+05	800000.000000	800000.000000	…	759730.000000	759730.000000	759730.000000	759729.000000	759730.000000	766761.000000	730248.000000	759730.000000	759730.000000	759730.000000
mean	399999.500000	14416.818875	3.482745	13.238391	437.947723	72005.351714	0.614213	7.613391e+04	1.009683	0.199513	…	8.107937	8.575994	8.282953	14.622488	5.592345	11.643896	0.000815	0.003384	0.089366	2.178606
std	230940.252015	8716.086178	0.855832	4.765757	261.460393	106585.640204	0.675749	6.894751e+04	0.782716	0.399634	…	4.799210	7.400536	4.561689	8.124610	3.216184	5.484104	0.030075	0.062041	0.509069	1.844377
min	0.000000	500.000000	3.000000	5.310000	15.690000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	…	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	199999.750000	8000.000000	3.000000	9.750000	248.450000	427.000000	0.000000	4.560000e+04	0.000000	0.000000	…	5.000000	4.000000	5.000000	9.000000	3.000000	8.000000	0.000000	0.000000	0.000000	1.000000
50%	399999.500000	12000.000000	3.000000	12.740000	375.135000	7755.000000	1.000000	6.500000e+04	1.000000	0.000000	…	7.000000	7.000000	7.000000	13.000000	5.000000	11.000000	0.000000	0.000000	0.000000	2.000000
75%	599999.250000	20000.000000	3.000000	15.990000	580.710000	117663.500000	1.000000	9.000000e+04	2.000000	0.000000	…	11.000000	11.000000	10.000000	19.000000	7.000000	14.000000	0.000000	0.000000	0.000000	3.000000
max	799999.000000	40000.000000	5.000000	30.990000	1715.420000	378351.000000	5.000000	1.099920e+07	2.000000	1.000000	…	70.000000	132.000000	79.000000	128.000000	45.000000	82.000000	4.000000	4.000000	39.000000	30.000000

8 rows × 42 columns

复制代码

    data_train.head(3).append(data_train.tail(3))
    
    
    AI生成项目python

id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	…	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	…	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	…	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	…	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
799997	799997	6000.0	3	13.33	203.12	C	C3	2582.0	10+ years	1	…	4.0	26.0	4.0	10.0	4.0	5.0	0.0	0.0	1.0	4.0
799998	799998	19200.0	3	6.92	592.14	A	A4	151.0	10+ years	0	…	10.0	6.0	12.0	22.0	8.0	16.0	0.0	0.0	0.0	5.0
799999	799999	9000.0	3	11.06	294.91	B	B3	13.0	5 years	0	…	3.0	4.0	4.0	8.0	3.0	7.0	0.0	0.0	0.0	2.0

6 rows × 47 columns

3. 数据分析与可视化

3.1 缺失值与唯一值

查看缺失值

复制代码

    print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')
    
    
    AI生成项目python

复制代码

    There are 22 columns in train dataset with missing values.
    
    
    AI生成项目

从训练集中获取到包含22个特征存在缺失值的信息，并对其中缺失率超过50%的特征进行进一步分析。

复制代码

    have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
    fea_null_moreThanHalf = {}
    for key,value in have_null_fea_dict.items():
    if value > 0.5:
        fea_null_moreThanHalf[key] = value
    
    
    AI生成项目python

复制代码

    fea_null_moreThanHalf
    
    
    AI生成项目python

复制代码

    {}
    
    
    AI生成项目

具体的查看缺失特征及缺失率

复制代码

    # nan可视化
    missing = data_train.isnull().sum()/len(data_train)
    missing = missing[missing > 0]
    missing.sort_values(inplace=True)
    missing.plot.bar()
    
    
    AI生成项目python

复制代码

    <matplotlib.axes._subplots.AxesSubplot at 0x1229ab890>
    
    
    AI生成项目

首先需要识别哪些字段包含缺失值（即"nan"），并统计这些缺失值的数量。核心目标是检查某一字段中缺失值的比例是否显著高到可能影响模型性能的程度。如果是这种情况，则建议丢弃该特征；否则，在缺失值比例较低的情况下可以选择填充策略以避免数据损失。
此外还可以进行横向对比分析：当在数据集中存在某些样本其大部分字段出现缺失且样本量足够大时，则应考虑删除这些样本。

Tips: 比赛大杀器lgb模型可以自动处理缺失值！

查看训练集测试集中特征属性只有一值的特征

复制代码

    one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]
    
    
    AI生成项目python

复制代码

    one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]
    
    
    AI生成项目python

复制代码

    one_value_fea
    
    
    AI生成项目python

复制代码

    ['policyCode']
    
    
    AI生成项目

复制代码

    one_value_fea_test
    
    
    AI生成项目python

复制代码

    ['policyCode']
    
    
    AI生成项目

复制代码

    print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.')
    print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.')
    
    
    AI生成项目python

复制代码

    There are 1 columns in train dataset with one unique value.
    There are 1 columns in test dataset with one unique value.
    
    
    AI生成项目

总结：

在分析过程中发现共有47项指标存在显著问题，在实际研究场景中这种情况并不少见。具体而言，“policyCode”这一字段要么仅包含单一取值（或完全缺失特征），要么出现完全缺失特征的情况较为常见。研究中还涉及多个连续型指标以及若干分类指标。

3.2 数值类型与对象类型

特征主要由类别型特征与数值型特征构成；其中数值型特征又被划分为连续型与离散型。
- 类别型特征在部分情况下表现出非数值关系，在另一部分情况下则可能表现出一定的数值关系。例如，在'grade'中等级A、B、C等的划分仅作为分类使用时，则无需考虑其顺序；但如果需要评估A是否优于其他等级，则需结合业务背景进行判断。
- 数值型特征原本可以直接用于建模分析；但实际操作中常被风控人员进行分箱处理并转换为WOE编码形式以便构建评分卡模型等统计分析工具。这种做法的主要目的是为了减少数据噪声对模型的影响并提高自变量与因变量之间的相关性；最终使得构建出的模型更加稳定可靠。

复制代码

    numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
    category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
    # category_fea = [i for i in data_train.columns if i not in numerical_fea]
    
    
    AI生成项目python

复制代码

    numerical_fea
    
    
    AI生成项目python

复制代码

    ['id',
     'loanAmnt',
     'term',
     'interestRate',
     'installment',
     'employmentTitle',
     'homeOwnership',
     'annualIncome',
     'verificationStatus',
     'isDefault',
     'purpose',
     'postCode',
     'regionCode',
     'dti',
     'delinquency_2years',
     'ficoRangeLow',
     'ficoRangeHigh',
     'openAcc',
     'pubRec',
     'pubRecBankruptcies',
     'revolBal',
     'revolUtil',
     'totalAcc',
     'initialListStatus',
     'applicationType',
     'title',
     'policyCode',
     'n0',
     'n1',
     'n2',
     'n2.1',
     'n4',
     'n5',
     'n6',
     'n7',
     'n8',
     'n9',
     'n10',
     'n11',
     'n12',
     'n13',
     'n14']
    
    
    AI生成项目
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/fLO2RkN4I6lAP98ZhXSC3vDW0apj.png)

复制代码

    category_fea
    
    
    AI生成项目python

复制代码

    ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
    
    
    AI生成项目

复制代码

    data_train.grade
    
    
    AI生成项目python

复制代码

    0         E
    1         D
    2         D
    3         A
    4         C
         ..
    799995    C
    799996    A
    799997    C
    799998    A
    799999    B
    Name: grade, Length: 800000, dtype: object
    
    
    AI生成项目
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/7LfAy10BQ8UdoREenl6azHT2KWgC.png)

3.2.1 数值类型变量分析

划分数值型变量中的连续变量和离散型变量

复制代码

    # 过滤数值型类别特征
    def get_numerical_serial_fea(data,feas):
    numerical_serial_fea = []
    numerical_noserial_fea = []
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
    numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)
    
    
    AI生成项目python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/4N2J96jEivbC1nh3Tocxz5PMKOpD.png)

复制代码

    numerical_serial_fea
    
    
    AI生成项目python

复制代码

    ['id',
     'loanAmnt',
     'interestRate',
     'installment',
     'employmentTitle',
     'annualIncome',
     'purpose',
     'postCode',
     'regionCode',
     'dti',
     'delinquency_2years',
     'ficoRangeLow',
     'ficoRangeHigh',
     'openAcc',
     'pubRec',
     'pubRecBankruptcies',
     'revolBal',
     'revolUtil',
     'totalAcc',
     'title',
     'n0',
     'n1',
     'n2',
     'n2.1',
     'n4',
     'n5',
     'n6',
     'n7',
     'n8',
     'n9',
     'n10',
     'n13',
     'n14']
    
    
    AI生成项目
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/2gf8MEHFypBROKqrCjAhLPWkn39w.png)

复制代码

    numerical_noserial_fea
    
    
    AI生成项目python

复制代码

    ['term',
     'homeOwnership',
     'verificationStatus',
     'isDefault',
     'initialListStatus',
     'applicationType',
     'policyCode',
     'n11',
     'n12']
    
    
    AI生成项目

离散数值类型变量分析

复制代码

    for i in numerical_noserial_fea:
    print(data_train[i].value_counts())
    print()
    
    
    AI生成项目python

复制代码

    3    606902
    5    193098
    Name: term, dtype: int64
    
    0    395732
    1    317660
    2     86309
    3       185
    5        81
    4        33
    Name: homeOwnership, dtype: int64
    
    1    309810
    2    248968
    0    241222
    Name: verificationStatus, dtype: int64
    
    0    640390
    1    159610
    Name: isDefault, dtype: int64
    
    0    466438
    1    333562
    Name: initialListStatus, dtype: int64
    
    0    784586
    1     15414
    Name: applicationType, dtype: int64
    
    1.0    800000
    Name: policyCode, dtype: int64
    
    0.0    729682
    1.0       540
    2.0        24
    4.0         1
    3.0         1
    Name: n11, dtype: int64
    
    0.0    757315
    1.0      2281
    2.0       115
    3.0        16
    4.0         3
    Name: n12, dtype: int64
    
    
    AI生成项目
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/mr5INo3Mlj9kKCZeGL86yEBRniXw.png)

连续数值类型变量分析

复制代码

    #每个数字特征得分布可视化
    f = pd.melt(data_train, value_vars=numerical_serial_fea)
    g = sns.FacetGrid(f, col="variable",  col_wrap=3, sharex=False, sharey=False)
    g = g.map(sns.distplot, "value")
    
    
    AI生成项目python

分析某一数值字段的分布特征, 检验其是否具备明显的正态特性. 若经检验发现存在显著偏离, 可对原始数值采取取对数处理后再进行重新评估.
为实现一批数据的统一归一化处理, 必须将其中已完成标准化的样本予以剔除.
正规化操作的主要目的: 首先是为了加快模型训练过程中的收敛速度; 其次是针对部分特定算法(如高斯混合聚类GMM、k最近邻分类KNN等)有特殊要求必须接受标准化输入的前提条件; 最后是为了防止过偏态的数据可能导致的预测结果失真风险.

复制代码

    #Ploting Transaction Amount Values Distribution
    plt.figure(figsize=(16,12))
    plt.suptitle('Transaction Values Distribution', fontsize=22)
    plt.subplot(221)
    sub_plot_1 = sns.distplot(data_train['loanAmnt'])
    sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
    sub_plot_1.set_xlabel("")
    sub_plot_1.set_ylabel("Probability", fontsize=15)
    
    plt.subplot(222)
    sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
    sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
    sub_plot_2.set_xlabel("")
    sub_plot_2.set_ylabel("Probability", fontsize=15)
    
    
    AI生成项目python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/PHKpiTA1B4FYvdRykzDxgeMZXlSm.png)

复制代码

    Text(0, 0.5, 'Probability')
    
    
    AI生成项目

3.2.2 对象类型变量分析

复制代码

    category_fea
    
    
    AI生成项目python

复制代码

    ['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
    
    
    AI生成项目

复制代码

    for i in category_fea:
    print(data_train[i].value_counts())
    print()
    
    
    AI生成项目python

复制代码

    B    233690
    C    227118
    A    139661
    D    119453
    E     55661
    F     19053
    G      5364
    Name: grade, dtype: int64
    
    C1    50763
    B4    49516
    B5    48965
    B3    48600
    C2    47068
    C3    44751
    C4    44272
    B2    44227
    B1    42382
    C5    40264
    A5    38045
    A4    30928
    D1    30538
    D2    26528
    A1    25909
    D3    23410
    A3    22655
    A2    22124
    D4    21139
    D5    17838
    E1    14064
    E2    12746
    E3    10925
    E4     9273
    E5     8653
    F1     5925
    F2     4340
    F3     3577
    F4     2859
    F5     2352
    G1     1759
    G2     1231
    G3      978
    G4      751
    G5      645
    Name: subGrade, dtype: int64
    
    10+ years    262753
    2 years       72358
    < 1 year      64237
    3 years       64152
    1 year        52489
    5 years       50102
    4 years       47985
    6 years       37254
    8 years       36192
    7 years       35407
    9 years       30272
    Name: employmentLength, dtype: int64
    
    2016-03-01    29066
    2015-10-01    25525
    2015-07-01    24496
    2015-12-01    23245
    2014-10-01    21461
              ...  
    2007-08-01       23
    2007-07-01       21
    2008-09-01       19
    2007-09-01        7
    2007-06-01        1
    Name: issueDate, Length: 139, dtype: int64
    
    Aug-2001    5567
    Sep-2003    5403
    Aug-2002    5403
    Oct-2001    5258
    Aug-2000    5246
            ... 
    Nov-1954       1
    Jan-1944       1
    Oct-1954       1
    Nov-1953       1
    Mar-1958       1
    Name: earliesCreditLine, Length: 720, dtype: int64
    
    
    AI生成项目
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/Qtl0n7LpEgSNXTiBAZDcoah4ROxI.png)

总结：

我们通过value_counts()等函数观察了特征属性的分布情况；然而图表是最快捷地概括原始信息的方式。
数据无形则会削弱直觉。
同一个数据集在不同尺度下呈现出来的图形所反映的规律各不相同；而Python将数据转化为可视化图表时，默认的行为可能并不完全符合你的需求。

3.3 变量分布可视化

3.3.1 单一变量分布可视化

复制代码

    plt.figure(figsize=(8, 8))
    sns.barplot(data_train["employmentLength"].value_counts(dropna=False)[:20],
            data_train["employmentLength"].value_counts(dropna=False).keys()[:20])
    plt.show()
    
    
    AI生成项目python

3.3.2 根绝y值不同可视化x某个特征的分布

首先查看类别型变量在不同y值上的分布

复制代码

    train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
    train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]
    
    
    AI生成项目python

复制代码

    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
    train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of grade fraud')
    train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of grade non-fraud')
    train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax3, title='Count of employmentLength fraud')
    train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh', ax=ax4, title='Count of employmentLength non-fraud')
    plt.show()
    
    
    AI生成项目python

其次查看连续型变量在不同y值上的分布

复制代码

    fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
    data_train.loc[data_train['isDefault'] == 1] \
    ['loanAmnt'].apply(np.log) \
    .plot(kind='hist',
          bins=100,
          title='Log Loan Amt - Fraud',
          color='r',
          xlim=(-3, 10),
         ax= ax1)
    data_train.loc[data_train['isDefault'] == 0] \
    ['loanAmnt'].apply(np.log) \
    .plot(kind='hist',
          bins=100,
          title='Log Loan Amt - Not Fraud',
          color='b',
          xlim=(-3, 10),
         ax=ax2)
    
    
    AI生成项目python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/gneOkYdKlrHy6qNc3hW4BvzxwPXL.png)

复制代码

    <matplotlib.axes._subplots.AxesSubplot at 0x126a44b50>
    
    
    AI生成项目

复制代码

    total = len(data_train)
    total_amt = data_train.groupby(['isDefault'])['loanAmnt'].sum().sum()
    plt.figure(figsize=(12,5))
    plt.subplot(121) ## 1代表行，2代表列，所以一共有2个图，1代表此时绘制第一个图。
    plot_tr = sns.countplot(x='isDefault',data=data_train)#data_train‘isDefault’这个特征每种类别的数量**
    plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
    plot_tr.set_xlabel("Is fraud by count", fontsize=16)
    plot_tr.set_ylabel('Count', fontsize=16)
    for p in plot_tr.patches:
    height = p.get_height()
    plot_tr.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total*100),
            ha="center", fontsize=15) 
    
    percent_amt = (data_train.groupby(['isDefault'])['loanAmnt'].sum())
    percent_amt = percent_amt.reset_index()
    plt.subplot(122)
    plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt',  dodge=True, data=percent_amt)
    plot_tr_2.set_title("Total Amount in loanAmnt  \n 0: good user | 1: bad user", fontsize=14)
    plot_tr_2.set_xlabel("Is fraud by percent", fontsize=16)
    plot_tr_2.set_ylabel('Total Loan Amount Scalar', fontsize=16)
    for p in plot_tr_2.patches:
    height = p.get_height()
    plot_tr_2.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/total_amt * 100),
            ha="center", fontsize=15)     
    
    
    AI生成项目python
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-07-13/lP4b29UFwYAIB1M5mJi6VHsqGzQa.png)

3.4 时间数据处理与可视化

复制代码

    # 转化成时间格式 issueDateDT特征表示数据日期离数据集中日期最早的日期（2007-06-01）的天数
    data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
    data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days
    
    
    AI生成项目python

复制代码

    #转化成时间格式
    data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
    data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days
    
    
    AI生成项目python

复制代码

    plt.hist(data_train['issueDateDT'], label='train');
    plt.hist(data_test_a['issueDateDT'], label='test');
    plt.legend();
    plt.title('Distribution of issueDateDT dates');
    #train 和 test issueDateDT 日期有重叠 所以使用基于时间的分割进行验证是不明智的
    
    
    AI生成项目python

3.5 透视图

复制代码

    # 透视图 索引可以有多个，“columns（列）”是可选的，聚合函数aggfunc最后是被应用到了变量“values”中你所列举的项目上。
    pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)
    
    
    AI生成项目python

复制代码

    pivot
    
    
    AI生成项目python

loanAmnt
issueDateDT	0	30	61	92	122	153	183	214	245	274	…	3926	3957	3987	4018	4048	4079	4110	4140	4171	4201
grade
A	NaN	53650.0	42000.0	19500.0	34425.0	63950.0	43500.0	168825.0	85600.0	101825.0	…	13093850.0	11757325.0	11945975.0	9144000.0	7977650.0	6888900.0	5109800.0	3919275.0	2694025.0	2245625.0
B	NaN	13000.0	24000.0	32125.0	7025.0	95750.0	164300.0	303175.0	434425.0	538450.0	…	16863100.0	17275175.0	16217500.0	11431350.0	8967750.0	7572725.0	4884600.0	4329400.0	3922575.0	3257100.0
C	NaN	68750.0	8175.0	10000.0	61800.0	52550.0	175375.0	151100.0	243725.0	393150.0	…	17502375.0	17471500.0	16111225.0	11973675.0	10184450.0	7765000.0	5354450.0	4552600.0	2870050.0	2246250.0
D	NaN	NaN	5500.0	2850.0	28625.0	NaN	167975.0	171325.0	192900.0	269325.0	…	11403075.0	10964150.0	10747675.0	7082050.0	7189625.0	5195700.0	3455175.0	3038500.0	2452375.0	1771750.0
E	7500.0	NaN	10000.0	NaN	17975.0	1500.0	94375.0	116450.0	42000.0	139775.0	…	3983050.0	3410125.0	3107150.0	2341825.0	2225675.0	1643675.0	1091025.0	1131625.0	883950.0	802425.0
F	NaN	NaN	31250.0	2125.0	NaN	NaN	NaN	49000.0	27000.0	43000.0	…	1074175.0	868925.0	761675.0	685325.0	665750.0	685200.0	316700.0	315075.0	72300.0	NaN
G	NaN	NaN	NaN	NaN	NaN	NaN	NaN	24625.0	NaN	NaN	…	56100.0	243275.0	224825.0	64050.0	198575.0	245825.0	53125.0	23750.0	25100.0	1000.0

7 rows × 139 columns

3.6 生成数据报告

复制代码

    import pandas_profiling
    
    
    AI生成项目python

复制代码

    pfr = pandas_profiling.ProfileReport(data_train)
    pfr.to_file("./example.html")
    
    
    AI生成项目python

4. 总结

进行数据的探索性分析是我们初识数据并为其后续特征工程奠定基础的重要环节。即使在很多情况下，在EDA（探索性数据分析）过程中提取出的关键特征可以直接作为决策依据或模型输入的基础要素。由此可见，在整个数据分析流程中进行充分而深入的探索性数据分析具有不可替代的价值。这一阶段的核心任务就在于通过一系列基本统计指标和简单模型来全面把握数据的基本分布情况以及各变量之间的相互关联关系，并通过适当的图形可视化手段直观地呈现出来以辅助后续的数据挖掘与分析工作。

全部评论 (0)

还没有任何评论哟~

「数据科学」天池金融风控-贷款违约预测数据分析

天池金融风控贷款违约预测数据分析 1\.学习目的 2\.了解数据 2.1导入第三方库 2.2读取文件 2.2.1拓展知识 2.3总体了解 2.3.1样本个数和特征维度 3\.数据分析与可视化 3.1缺...

【天池】金融风控-贷款违约预测（二）—— 数据分析

【天池】金融风控贷款违约预测（二）——数据分析前言内容介绍代码示例总结前言【天池】金融风控贷款违约预测（赛题链接）。上一篇赛题理解时已经对赛题背景、数据字段等进行了介绍。本篇是数据分析部...

「数据科学」天池金融风控-贷款违约预测模型融合

天池金融风控贷款违约预测模型融合 1\.学习目标 2\.stacking/blending详解 3\.代码示例 3.1平均 3.2投票 3.3Stacking 3.4blending 4\.经验总结 ...

「数据科学」天池金融风控-贷款违约预测建模调参

天池金融风控贷款违约预测建模调参 1\.学习目标 2\.模型相关原理介绍 2.1逻辑回归模型 2.2决策树模型 2.3GBDT模型 2.4XGBoost模型 2.5LightGBM模型 2.6Catb...

「数据科学」天池金融风控-贷款违约预测特征工程

天池金融风控贷款违约预测特征工程 1\.学习目标 2\.代码示例 2.1导入包并读取数据 2.2特征预处理 2.2.2缺失值填充 2.2.3时间格式处理 2.2.4对象类型特征转换到数值 2.2.5类...

天池金融风控-贷款违约预测

比赛链接：https://tianchi.aliyun.com/competition/entrance/531830/introduction 因为这是一个金融风控专题的数据挖掘实战，在开始之前先引...

【天池】金融风控贷款违约预测task5

【天池】金融风控贷款违约预测task5 task5学习总结： 1）简单平均和加权平均是常用的两种比赛中模型融合的方式。其优点是快速、简单。 2）stacking在众多比赛中大杀四方，但是跑过代码的小伙...

金融风控-贷款违约预测 Task2 数据分析

github链接：FinancialRiskControl/Task2数据分析.md importpandasaspd importnumpyasnp importmatplotlib.pyplota...

2.天池金融风控-贷款违约预测新人赛之数据分析

前一部分我们对一些分类指标等一些预备知识进行了学习，接下来我们要进行的是探索性的数据分析（EDA）。 EDA是我们进行数据挖掘非常重要的一步，做的好的EDA可以让我们对数据作出更准确的分析，一方面是让...

天池竞赛：金融风控-贷款违约预测

竞赛地址：<https://tianchi.aliyun.com/competition/entrance/531830/information 一、简介 1.1赛题描述赛题以金融风控中的个人信贷为...

是否确定退出登录?

「数据科学」天池金融风控-贷款违约预测数据分析

天池金融风控-贷款违约预测数据分析

1. 学习目的

2. 了解数据

2.1 导入第三方库

2.2 读取文件

2.2.1 拓展知识

2.3 总体了解

2.3.1 样本个数和特征维度

3. 数据分析与可视化

3.1 缺失值与唯一值

3.2 数值类型与对象类型

3.2.1 数值类型变量分析

3.2.2 对象类型变量分析

3.3 变量分布可视化

3.3.1 单一变量分布可视化

3.3.2 根绝y值不同可视化x某个特征的分布

3.4 时间数据处理与可视化

3.5 透视图

3.6 生成数据报告

4. 总结

全部评论 (0)

相关文章推荐

「数据科学」天池金融风控-贷款违约预测数据分析

【天池】金融风控-贷款违约预测（二）—— 数据分析

「数据科学」天池金融风控-贷款违约预测模型融合

「数据科学」天池金融风控-贷款违约预测建模调参

「数据科学」天池金融风控-贷款违约预测特征工程

天池金融风控-贷款违约预测

【天池】金融风控贷款违约预测task5

金融风控-贷款违约预测 Task2 数据分析

2.天池金融风控-贷款违约预测新人赛之数据分析

天池竞赛：金融风控-贷款违约预测