个人如何通过python做金融,python金融数据分析案例

阅读量：

本文向大家介绍Python在金融领域的一些实例及在金融行业中Python也有诸多应用实例，并希望能帮助各位读者更好地理解这些应用场景。如需了解更多信息，请别忘了收藏我们的网站哦。

背景

就用户留存而言，在行业内存在一个普遍观点：若能将用户的流失率降低5%，公司的盈利能力可望提升25%至85%之间。当前不断攀升的获客成本令电信运营商面临"瓶颈"困境，并陷入了获取新用户的困境。由此可见，在市场竞争加剧的情况下， telecommunications operators increasingly need to address the challenge of enhancing customer retention. 而电信领域用户的流失分析及预测具有重要意义

本文将从以下方面进行分析：

1.背景
2.提出问题
3.理解数据
4.数据清洗
5.可视化分析
6.用户流失预测
7.结论和建议

提出问题

探讨用户特征与流失之间的关系。
就整体情况来看，流失用户的普遍特征是什么？
探索能够有效识别流失用户的模型。
针对提升客户忠诚度和减少流失的可能性提出一些建议。

理解数据

按照介绍, 该数据集包括21个字段, 并包含7043条记录. 每一条记录都包含了客户的详细信息. 我们的目的是分析前二十大特征对客户流失的影响.

数据清洗

数据清洗的“完全合一”规则：

完整性检查：确保单条数据不存在空值，并统计的数据字段是否完整。
全面性审查：对某一列的所有数值进行详细检查，并通过常识判断该列是否存在潜在的问题。
合法性的验证：确保数据属性及其值的有效性。
唯一性的保证：避免任何重复记录。
导入必要的工具包完成后续操作。

复制代码

 import pandas as pd

    
 import numpy as np
    
 import matplotlib.pyplot as plt
    
 import seaborn as sns
    
 customerDF = pd.read_csv('/home/kesci/input/yidong4170/WA_Fn-UseC_-Telco-Customer-Churn.csv')
    
 # 查看数据集大小
    
 customerDF.shape
    
 # 运行结果：(7043, 21)
    
  
    
 # 设置查看列不省略
    
 pd.set_option('display.max_columns',None)
    
  
    
 # 查看前10条数据
    
 customerDF.head(10)

复制代码

 # Null计数

    
 pd.isnull(customerDF).sum()
    
 # 查看数据类型
    
 customerDF.info()
    
 #customerDf.dtypes
    
  
    
 #将‘TotalCharges’总消费额的数据类型转换为浮点型，发现错#误：字符串无法转换为数字。

逐一排查各字段的数据类型、内容和数量情况；经调查发现'TotalCharges'列共有11个用户的数值信息未记录。

复制代码

 # 查看每一列数据取值

    
 for x in customerDF.columns:
    
     test=customerDF.loc[:,x].value_counts()
    
     print('{0} 的行数是：{1}'.format(x,test.sum()))
    
     print('{0} 的数据类型是：{1}'.format(x,customerDF[x].dtypes))
    
     print('{0} 的内容是：\n{1}\n'.format(x,test))

采用强制转换，将“TotalCharges”（总消费额）转换为浮点型数据。

复制代码

 #强制转换为数字，不可转换的变为NaN

    
 customerDF['TotalCharges']=customerDF['TotalCharges'].convert_objects(convert_numeric=True)

#强制要求将非数字字段转为NaN值，默认无法转换的字段会赋值为NaN#
customerDF['TotalCharges'] = customerDF['TotalCharges'].astype('float64')

复制代码

 test=customerDF.loc[:,'TotalCharges'].value_counts().sort_index()

    
 print(test.sum())
    
 #运行结果：7032
    
  
    
 print(customerDF.tenure[customerDF['TotalCharges'].isnull().values==True])
    
 #运行结果：11
    
  
    
 #将总消费额填充为月消费额
    
 customerDF.loc[:,'TotalCharges'].replace(to_replace=np.nan,value=customerDF.loc[:,'MonthlyCharges'],inplace=True)
    
 #查看是否替换成功
    
 print(customerDF[customerDF['tenure']==0][['tenure','MonthlyCharges','TotalCharges']])
    
  
    
 # 将‘tenure’入网时长从0修改为1
    
 customerDF.loc[:,'tenure'].replace(to_replace=0,value=1,inplace=True)
    
 print(pd.isnull(customerDF['TotalCharges']).sum())
    
 print(customerDF['TotalCharges'].dtypes)

查看数据的描述统计信息，根据一般经验，所有数据正常。

复制代码

    查看数据的描述统计信息，根据一般经验，所有数据正常。

可视化分析

基于通常的经验和实践，在数据分析领域将用户的特征进行系统性划分是一个重要课题；具体而言, 该研究可将用户特征划分为三类: 用户属性、服务属性以及合同属性, 并通过这三个维度展开可视化研究

查看流失用户数量和占比。

复制代码

 plt.rcParams['figure.figsize']=6,6

    
 plt.pie(customerDF['Churn'].value_counts(),labels=customerDF['Churn'].value_counts().index,autopct='%1.2f%%',explode=(0.1,0))
    
 plt.title('Churn(Yes/No) Ratio')
    
 plt.show()

复制代码

 churnDf=customerDF['Churn'].value_counts().to_frame()

    
 x=churnDf.index
    
 y=churnDf['Churn']
    
 plt.bar(x,y,width = 0.5,color = 'c')
    
  
    
 #用来正常显示中文标签（需要安装字库）
    
 plt.title('Churn(Yes/No) Num')
    
 plt.show()

属于不平衡数据集，流失用户占比达26.54%。

（1）用户属性分析

复制代码

 def barplot_percentages(feature,orient='v',axis_name="percentage of customers"):

    
     ratios = pd.DataFrame()
    
     g = (customerDF.groupby(feature)["Churn"].value_counts()/len(customerDF)).to_frame()
    
     g.rename(columns={"Churn":axis_name},inplace=True)
    
     g.reset_index(inplace=True)
    
  
    
     #print(g)
    
     if orient == 'v':
    
     ax = sns.barplot(x=feature, y= axis_name, hue='Churn', data=g, orient=orient)
    
     ax.set_yticklabels(['{:,.0%}'.format(y) for y in ax.get_yticks()])
    
     plt.rcParams.update({'font.size': 13})
    
     #plt.legend(fontsize=10)
    
     else:
    
     ax = sns.barplot(x= axis_name, y=feature, hue='Churn', data=g, orient=orient)
    
     ax.set_xticklabels(['{:,.0%}'.format(x) for x in ax.get_xticks()])
    
     plt.legend(fontsize=10)
    
     plt.title('Churn(Yes/No) Ratio as {0}'.format(feature))
    
     plt.show()
    
 barplot_percentages("SeniorCitizen")
    
 barplot_percentages("gender")

复制代码

 customerDF['churn_rate'] = customerDF['Churn'].replace("No", 0).replace("Yes", 1)

    
 g = sns.FacetGrid(customerDF, col="SeniorCitizen", height=4, aspect=.9)
    
 ax = g.map(sns.barplot, "gender", "churn_rate", palette = "Blues_d", order= ['Female', 'Male'])
    
 plt.rcParams.update({'font.size': 13})
    
 plt.show()

小结：
用户流失与性别基本无关；
年老用户流失占显著高于年轻用户。

复制代码

 fig, axis = plt.subplots(1, 2, figsize=(12,4))

    
 axis[0].set_title("Has Partner")
    
 axis[1].set_title("Has Dependents")
    
 axis_y = "percentage of customers"
    
  
    
 # Plot Partner column
    
 gp_partner = (customerDF.groupby('Partner')["Churn"].value_counts()/len(customerDF)).to_frame()
    
 gp_partner.rename(columns={"Churn": axis_y}, inplace=True)
    
 gp_partner.reset_index(inplace=True)
    
 ax1 = sns.barplot(x='Partner', y= axis_y, hue='Churn', data=gp_partner, ax=axis[0])
    
 ax1.legend(fontsize=10)
    
 #ax1.set_xlabel('伴侣')
    
  
    
  
    
 # Plot Dependents column
    
 gp_dep = (customerDF.groupby('Dependents')["Churn"].value_counts()/len(customerDF)).to_frame()
    
 #print(gp_dep)
    
 gp_dep.rename(columns={"Churn": axis_y} , inplace=True)
    
 #print(gp_dep)
    
 gp_dep.reset_index(inplace=True)
    
 #print(gp_dep)
    
  
    
 ax2 = sns.barplot(x='Dependents', y= axis_y, hue='Churn', data=gp_dep, ax=axis[1])
    
 #ax2.set_xlabel('家属')
    
  
    
  
    
 #设置字体大小
    
 plt.rcParams.update({'font.size': 20})
    
 ax2.legend(fontsize=10)
    
  
    
 #设置
    
 plt.show()

复制代码

 # Kernel density estimaton核密度估计

    
 def kdeplot(feature,xlabel):
    
     plt.figure(figsize=(9, 4))
    
     plt.title("KDE for {0}".format(feature))
    
     ax0 = sns.kdeplot(customerDF[customerDF['Churn'] == 'No'][feature].dropna(), color= 'navy', label= 'Churn: No', shade='True')
    
     ax1 = sns.kdeplot(customerDF[customerDF['Churn'] == 'Yes'][feature].dropna(), color= 'orange', label= 'Churn: Yes',shade='True')
    
     plt.xlabel(xlabel)
    
     #设置字体大小
    
     plt.rcParams.update({'font.size': 20})
    
     plt.legend(fontsize=10)
    
 kdeplot('tenure','tenure')
    
 plt.show()

小结：
相比无伴侣的用户而言，在线购物的伴侣用户的流失比例较低；
在线购物的家庭数量相对较少；
与无家庭用户的对比中，在线购物的家庭用户的流失比例同样较低；
使用网络时长越长，则流失率会随之降低，并且这一现象与人们的日常经验是一致的；
当使用网络的时间达到三个月后，在线购物者的流失比率将低于持续时间内的比率，并由此可推断出用户的稳定期通常为三个月。

（2）服务属性分析

复制代码

 plt.figure(figsize=(9, 4.5))

    
 barplot_percentages("MultipleLines", orient='h')

复制代码

 plt.figure(figsize=(9, 4.5))

    
 barplot_percentages("InternetService", orient="h")

复制代码

 cols = ["PhoneService","MultipleLines","OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"]

    
 df1 = pd.melt(customerDF[customerDF["InternetService"] != "No"][cols])
    
 df1.rename(columns={'value': 'Has service'},inplace=True)
    
 plt.figure(figsize=(20, 8))
    
 ax = sns.countplot(data=df1, x='variable', hue='Has service')
    
 ax.set(xlabel='Internet Additional service', ylabel='Num of customers')
    
 plt.rcParams.update({'font.size':20})
    
 plt.legend( labels = ['No Service', 'Has Service'],fontsize=15)
    
 plt.title('Num of Customers as Internet Additional Service')
    
 plt.show()

复制代码

 plt.figure(figsize=(20, 8))

    
 df1 = customerDF[(customerDF.InternetService != "No") & (customerDF.Churn == "Yes")]
    
 df1 = pd.melt(df1[cols])
    
 df1.rename(columns={'value': 'Has service'}, inplace=True)
    
 ax = sns.countplot(data=df1, x='variable', hue='Has service', hue_order=['No', 'Yes'])
    
 ax.set(xlabel='Internet Additional service', ylabel='Churn Num')
    
 plt.rcParams.update({'font.size':20})
    
 plt.legend( labels = ['No Service', 'Has Service'],fontsize=15)
    
 plt.title('Num of Churn Customers as Internet Additional Service')
    
 plt.show()

电话服务的整体作用对客户留存影响较小。

（3）合同属性分析¶

复制代码

 plt.figure(figsize=(9, 4.5))

    
 barplot_percentages("PaymentMethod",orient='h')

复制代码

 g = sns.FacetGrid(customerDF, col="PaperlessBilling", height=6, aspect=.9)

    
 ax = g.map(sns.barplot, "Contract", "churn_rate", palette = "Blues_d", order= ['Month-to-month', 'One year', 'Two year'])
    
 plt.rcParams.update({'font.size':18})
    
 plt.show()

复制代码

 kdeplot('MonthlyCharges','MonthlyCharges')

    
 kdeplot('TotalCharges','TotalCharges')
    
 plt.show()

小结：

电子支票支付方式的客户流失率最高；推测这种支付方式的服务体验较为一般性；
经分析发现，在70至110元之间的月消费金额是导致客户流失的关键区间；
从长期数据来看，在总消费金额上具有明显优势的客户往往能获得更高的留存率。

用户流失预测

对该数据集实施进一步的清洗与特征提取过程；基于特征选择方法对该数据实施降维处理；将选定的机器学习模型应用至测试数据集上；随后评估所构建分类模型的准确率

（1）数据清洗

复制代码

 customerID=customerDF['customerID']

    
 customerDF.drop(['customerID'],axis=1, inplace=True)
    
 cateCols = [c for c in customerDF.columns if customerDF[c].dtype == 'object' or c == 'SeniorCitizen']
    
 dfCate = customerDF[cateCols].copy()
    
 dfCate.head(3)
    
  
    
 #进行特征编码。
    
  
    
 for col in cateCols:
    
     if dfCate[col].nunique() == 2:
    
     dfCate[col] = pd.factorize(dfCate[col])[0]
    
     else:
    
     dfCate = pd.get_dummies(dfCate, columns=[col])
    
 dfCate['tenure']=customerDF[['tenure']]
    
 dfCate['MonthlyCharges']=customerDF[['MonthlyCharges']]
    
 dfCate['TotalCharges']=customerDF[['TotalCharges']]
    
  
    
 ##查看关联关系
    
  
    
 plt.figure(figsize=(16,8))
    
 dfCate.corr()['Churn'].sort_values(ascending=False).plot(kind='bar')
    
 plt.show()

（2）特征选取

复制代码

 # 特征选择

    
 dropFea = ['gender','PhoneService',
    
        'OnlineSecurity_No internet service', 'OnlineBackup_No internet service',
    
        'DeviceProtection_No internet service', 'TechSupport_No internet service',
    
        'StreamingTV_No internet service', 'StreamingMovies_No internet service',
    
        #'OnlineSecurity_No', 'OnlineBackup_No',
    
        #'DeviceProtection_No','TechSupport_No',
    
        #'StreamingTV_No', 'StreamingMovies_No',
    
        ]
    
 dfCate.drop(dropFea, inplace=True, axis =1) 
    
 #最后一列是作为标识
    
 target = dfCate['Churn'].values
    
 #列表：特征和1个标识
    
 columns = dfCate.columns.tolist()

（3）构建模型

复制代码

 # 构造各种分类器

    
 classifiers = [
    
     SVC(random_state = 1, kernel = 'rbf'),    
    
     DecisionTreeClassifier(random_state = 1, criterion = 'gini'),
    
     RandomForestClassifier(random_state = 1, criterion = 'gini'),
    
     KNeighborsClassifier(metric = 'minkowski'),
    
     AdaBoostClassifier(random_state = 1),   
    
 ]
    
 # 分类器名称
    
 classifier_names = [
    
         'svc', 
    
         'decisiontreeclassifier',
    
         'randomforestclassifier',
    
         'kneighborsclassifier',
    
         'adaboostclassifier',
    
 ]
    
 # 分类器参数
    
 #注意分类器的参数，字典键的格式，GridSearchCV对调优的参数格式是"分类器名"+"__"+"参数名"
    
 classifier_param_grid = [
    
         {'svc__C':[0.1], 'svc__gamma':[0.01]},
    
         {'decisiontreeclassifier__max_depth':[6,9,11]},
    
         {'randomforestclassifier__n_estimators':range(1,11)} ,
    
         {'kneighborsclassifier__n_neighbors':[4,6,8]},
    
         {'adaboostclassifier__n_estimators':[70,80,90]}
    
 ]

(4）模型参数调优和评估

对分类器进行参数优化与模型评估，并经过实验得出试用AdaBoostClassifier(n_estimators=80)表现最佳

复制代码

 # 对具体的分类器进行 GridSearchCV 参数调优

    
 def GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, param_grid, score = 'accuracy_score'):
    
     response = {}
    
     gridsearch = GridSearchCV(estimator = pipeline, param_grid = param_grid, cv=3, scoring = score)
    
     # 寻找最优的参数 和最优的准确率分数
    
     search = gridsearch.fit(train_x, train_y)
    
     print("GridSearch 最优参数：", search.best_params_)
    
     print("GridSearch 最优分数： %0.4lf" %search.best_score_)
    
     #采用predict函数（特征是测试数据集）来预测标识，预测使用的参数是上一步得到的最优参数
    
     predict_y = gridsearch.predict(test_x)
    
     print(" 准确率 %0.4lf" %accuracy_score(test_y, predict_y))
    
     response['predict_y'] = predict_y
    
     response['accuracy_score'] = accuracy_score(test_y,predict_y)
    
     return response
    
  
    
 for model, model_name, model_param_grid in zip(classifiers, classifier_names, classifier_param_grid):
    
     #采用 StandardScaler 方法对数据规范化：均值为0，方差为1的正态分布
    
     pipeline = Pipeline([
    
         #('scaler', StandardScaler()),
    
         #('pca',PCA),
    
         (model_name, model)
    
     ])
    
     result = GridSearchCV_work(pipeline, train_x, train_y, test_x, test_y, model_param_grid , score = 'accuracy')

结论和建议

根据以上分析，得到高流失率用户的特征：

根据分析结果：
老年客户群体中的老年人、单身状态的客户以及无直系亲属的客户更容易出现流失现象。
从服务层面来看：
当上网时长不足半年时、拥有电话服务的企业或个人以及光纤套餐中包含流媒体电视和电影服务选项的服务对象容易出现流失现象。
从合同层面来看：
签订期限较短的企业或个人以及采用电子支票支付方式并享受电子账单业务的企业或个人如果其月均付费金额在70至110元之间容易出现流失现象。
综合以上特征建议企业采取以下改进措施：一方面优化老客户管理策略；另一方面加强新客户的开发力度；同时针对不同类别客户采取差异化的营销策略以提升整体客户满意度进而降低 churn 率

基于预测模型构建具有较高流失风险的用户群体；通过用户调研筛选出最小可行化的功能原型，并邀请早期试点用户进行测试；从老年用户提供个性化服务方案；针对特定类型的老年用户提供定制化服务方案；针对老年用户提供个性化服务方案；针对老年用户提供个性化服务方案；从老年用户提供定制化服务方案；针对老年用户提供定制化服务方案；从老年用户提供个性化服务方案；针对老年用户提供定制化服务方案；从老年用户提供个性化服务方案；针对老年用户提供定制化服务方案；从老年提供优质专属套餐如亲情套餐温暖套餐等

全部评论 (0)

还没有任何评论哟~

个人如何通过python做金融,python金融数据分析案例

大家好，小编来为大家解答以下问题，python在金融领域的应用例子，python在金融行业的应用案例，今天让我们一起来看看吧！大家好，给大家分享一下python在金融领域的应用例子，很多人还不知道这...

个人如何通过python做金融,python金融数据分析案例

这篇文章主要介绍了python在金融领域的应用例子，具有一定借鉴价值，需要的朋友可以参考下。希望大家阅读完这篇文章后大有收获，下面让小编带着大家一起了解一下。

个人如何通过python做金融,python金融数据分析案例

大家好，给大家分享一下python在金融领域的应用例子，很多人还不知道这一点。下面详细解释一下。现在让我们来看看！我的学习笔记之基于Python的金融分析与风险管理写在开篇我对于Python的学...

个人如何通过python做金融,python金融数据分析案例

本篇文章给大家谈谈python在金融领域的应用例子，以及python在金融行业的应用案例，希望对各位有所帮助，不要忘了收藏本站喔。 Sourcecodedownload:本文相关源码背景关于用户留...

(转)python+Windpy做金融数据分析

作者：瓜瓜南南一、windpy+python初步想法 Wind有python接口，可以导出相关的金融数据，特别适用于要导出大量数据，或者想看不同区间的数据的情况，特别是有定期更新数据的需求时，用py...

Python金融=＞[数据分析]

目录导言投资思路剔除亏损鸡黑名单重审择优录取安远债基50 强者恒强效果源代码导言集中精力看下投资理财，毕竟一百万多一个点收益就是一万，用大数据加算法找到最佳投资方法！6000只债券...

Python金融数据分析 1.2

1.4.5lPythonNotebook简单的练习接下来创建一个新的notebook，插入不同类型对象演示不同任务 1.建立包含标题和Markdown单元格的notebook创建一个新的notebo...

Python金融数据分析 1.3

2.2套利定价模型资本资产定价模型有许多局限性，如均值一方差理论框架的应用以及回报率仅受市场风险一项风险因素的影响。一个多元化投资组合，基本可以消除股票的非系统性风险。

Python金融数据分析 1.1

Python在金融中的应用本章将探讨Python作为金融编程语言的实用性。时至今日，Python已经在银行业、投资管理和保险业等金融领域取得广泛应用，甚至帮助房地产行业开发金融建模、风险管理和交易的...

Python金融大数据分析

无意中发现了一个巨牛的人工智能教程，忍不住分享一下给大家。教程不仅是零基础，通俗易懂，而且非常风趣幽默，像看小说一样！觉得太牛了，所以分享给大家。教程链接： https://www.cbedai.ne...

是否确定退出登录?

个人如何通过python做金融,python金融数据分析案例

背景

本文将从以下方面进行分析：

提出问题

理解数据

数据清洗

可视化分析

（1）用户属性分析

（2）服务属性分析

（3）合同属性分析¶

用户流失预测

（1）数据清洗

（2）特征选取

（3）构建模型

(4）模型参数调优和评估

结论和建议

全部评论 (0)

相关文章推荐

个人如何通过python做金融,python金融数据分析案例

个人如何通过python做金融,python金融数据分析案例

个人如何通过python做金融,python金融数据分析案例

个人如何通过python做金融,python金融数据分析案例

(转)python+Windpy做金融数据分析

Python金融=＞[数据分析]

Python金融数据分析 1.2

Python金融数据分析 1.3

Python金融数据分析 1.1

Python金融大数据分析