Advertisement

智能供应链分析

阅读量:

文章目录

  • 项目背景
    • 项目工作内容

      • 供应链数据探索
      • 对用户进行分层
      • 对欺诈订单进行预测
    • 构建模型对数据进行训练和预测

    • 结论1:

    • 对于迟交订单进行预测

    • 对销售额进行预测

    • 对订单数量进行预测

    • 读取数据

    • 对数据进行探索及数据清洗

      • 统计为空的值
      • 数据相关性统计(使用皮尔森系数)

项目背景

项目工作内容

供应链数据探索

数据空置处理:
确实值设置为 0
数据特征相似度:
使用皮尔森系数

复制代码

对数据进行初步探索:
按照不同的市场,销售区域进行探索
按照不同的类别进行探索
按照不同的时间维度趋势

对用户进行分层

使用RFM对用户进行分层
统计最后一笔的时间R_value

计算用户总购买金额M_value
计算用户购买频率F_value

复制代码
    cust_seg = data.groupby('Order Customer Id').agg({'order date (DateOrders)':lambda x: (prent - x.max()).days,'Order Id':lambda x: len(x),'Total Price':lambda x:x.sum()})
    cust_seg

将R_value,M_value,F_value 转换为R_score M_score F_score

复制代码
    cust_seg.rename(columns = {'order date (DateOrders)':'R','Order Id':'F','Total Price':'M'},inplace = True)
    cust_seg

将数据划分为四个尺度

复制代码
    quan = cust_seg.quantile([0.25,0.5,0.75])
    quan

注: 计算分位数的方法
分位数采用position=1+(n-1)*p
首先通过上式确定划分的点在哪个位置

1 2 3 4
1 10 100 100
比如p=0.25
通过第一行确定位置在1+(4-1)X0.25=1.75
那么划分前25% 的位置在1.75
计算fraction 1.75-1=0.75
则 第二列 前25%的值为1+0.75X(10-1)=7.75

复制代码
    def R_VALUE(a,b,c):
    if a <= c[b][0.25]:
        return 4
    elif a <= c[b][0.5]:
        return 3
    elif a <= c[b][0.75]:
        return 2
    else:
        return 1
    
    def F_M_VALUE(a,b,c):
    if a <= c[b][0.25]:
        return 1
    elif a <= c[b][0.5]:
        return 2
    elif a <= c[b][0.75]:
        return 3
    else:
        return 4
复制代码
    cust_seg['R_SCORE'] = cust_seg['R'].apply(R_VALUE,args=('R',quan))
    cust_seg['F_SCORE'] = cust_seg['F'].apply(F_M_VALUE,args=('F',quan))
    cust_seg['M_SCORE'] = cust_seg['M'].apply(F_M_VALUE,args=('M',quan))
复制代码
    cust_seg['R_SCORE'] = cust_seg['R'].apply(R_VALUE,args=('R',quan))

注:apply函数的应用
自定义函数R_value,需要传入三个参数,args 按照参数顺序对自定义函数进行传参,第一个参数为
cust_seg[‘R’],第二个,第三个参数为(‘R’,quan)

根据R_v,F_v,M_v 三个值对用户进行分层,定义函数:

复制代码
    def rfm_score(data):
    if data['M_SCORE'] > 2 and data['F_SCORE'] >2 and data['R_SCORE'] > 2:
        return '重要价值用户'
    if data['M_SCORE'] > 2 and data['F_SCORE'] <=2 and data['R_SCORE'] > 2:
        return '重要发展用户'
    if data['M_SCORE'] > 2 and data['F_SCORE'] >2 and data['R_SCORE'] <= 2:
        return '重要保持用户'
    if data['M_SCORE'] > 2 and data['F_SCORE'] <=2 and data['R_SCORE'] <= 2:
        return '重要挽留用户'
    if data['M_SCORE'] <= 2 and data['F_SCORE'] >2 and data['R_SCORE'] > 2:
        return '一般价值用户'
    if data['M_SCORE'] <= 2 and data['F_SCORE'] <=2 and data['R_SCORE'] > 2:
        return '一般发展用户'
    if data['M_SCORE'] <= 2 and data['F_SCORE'] >2 and data['R_SCORE'] <= 2:
        return '一般保持用户'
    if data['M_SCORE'] <= 2 and data['F_SCORE'] <=2 and data['R_SCORE'] <= 2:
        return '一般挽留用户'

对欺诈订单进行预测

导入需要用到的包

复制代码
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder, MinMaxScaler

载入数据

复制代码
    import pickle
    with open('/data2.pkl', 'rb') as file:
      train_data = pickle.load(file)

对数据进行处理

复制代码
    train_data.columns

将需要预测的列中相应内容标记为 1

复制代码
    train_data['fraud'] = np.where(train_data['Order Status'] == 'SUSPECTED_FRAUD', 1, 0)
    train_data['late delivery'] = np.where(train_data['Delivery Status'] == 'Late delivery', 1, 0)
复制代码
    # 由于数据列数较多,需要设置pandas的显示方式为display.max_columns
    pd.set_option('display.max_columns', None)
    train_data.head()

分析数据可以看出,数据的列中有很多行与数据的预测结果没有直接关系,可以直接删除,如客户姓名,客户密码,商品名称,商品ID等特征。

复制代码
    del_col = ['Customer Lname', 'Customer Fname', 'Customer Password', 'Customer Zipcode',
           'Order Status', 'Delivery Status', 'Customer Full Name',
           'Product Image', 'Product Description', 'Product Name',
           'Product Status', 'Customer Street', 'Order Id',
           'Order Item Id', 'Customer Email', 'Customer Id', 
           'Order Customer Id', 'order date (DateOrders)',
           'Order Zipcode', 'Product Card Id', 'shipping date (DateOrders)',
           'Category Id', 'Department Id', 'Order Item Cardprod Id'
           ]
复制代码
    train_data_1 = train_data.drop(del_col, axis=1)

由于数据中有数值型的特征,也有字符串型特征,使用
pd.select_dtypes()

复制代码
    categorical_cols = train_data_1.select_dtypes(include='object').columns
    numerical_cols = train_data_1.select_dtypes(exclude='object').columns

对于数值型数据可以观察两两特征之间的皮尔森系数,如果该系数过高则表明这两个特征关联型很强,可以去掉一个特征

复制代码
    temp = train_data_1[numerical_cols].corr()
复制代码
    ## 显示热力图
    plt.figure(figsize=(20,10))
    sns.heatmap(temp, annot=True, linewidths=0.5, fmt='.1g', cmap='coolwarm')
    plt.show()
在这里插入图片描述

通过对比可以看出,‘Order Profit Per Order’ and ‘Benefit per order’
‘Sales’ and ‘Sales per customer’
‘Order Item Total’ and ‘Sales per customer’
‘Sales’ and ‘Order Item Total’
这些特征之间的关联性比较强,可以考虑去掉特种下三角的特征:
‘Sales’, ‘Order Item Total’, ‘Order Profit Per Order’, ‘Prodcut Price’, ‘TotalPrice’, ‘late_delivery_risk’

复制代码
    train_data_1.columns
复制代码
    drop_cols = ['Sales', 'Order Item Total', 'Order Profit Per Order', 'Product Price', 'TotalPrice', 'Late_delivery_risk']
复制代码
    train_data_1 = train_data_1.drop(
    drop_cols,
    axis=1
    )
复制代码
    numerical_cols = [i for i in numerical_cols if i not in drop_cols]

删除特征后可以再检查一下删除后特征之间的相关性

复制代码
    ## 检查热力图
    temp = train_data_1[train_data_1.select_dtypes(exclude='object').columns].corr()
    plt.figure(figsize=(20,10))
    sns.heatmap(temp, annot=True, cmap='coolwarm')
    plt.show()

在这里插入图片描述
对字符串型特征进行处理

由于部分categorical features含有大量类别,为了避免维度过多,删除部分特征
‘Customer City’, ‘Customer City’, ‘Customer State’, ‘Order City’,
‘Order Country’, ‘Order State’

复制代码
    drop_cat = ['Customer City', 'Customer City', 'Customer State', 'Order City', 'Order Country', 'Order State']
复制代码
    categorical_cols = [i for i in categorical_cols if i not in drop_cat]
复制代码
    drop_num = [i for i in numerical_cols if 'order_' in i]
复制代码
    numerical_cols = [i for i in numerical_cols if i not in drop_num]

对数据进行 one-hot 和label-encoding
分别定义函数对数据进行处理

复制代码
    ## 特征工程函数(使用one_hot Encoder)
    
    def fe(data, numerical_cols, categorical_cols):
      ## 进行归一化
      from sklearn.preprocessing import MinMaxScaler, StandardScaler
      sc = StandardScaler()
      temp_num = pd.DataFrame(sc.fit_transform(data[numerical_cols]), columns=numerical_cols)
    
      ## 进行独热编码
      temp_cat = pd.get_dummies(data[categorical_cols])
      
      fe_data = pd.concat([temp_num, temp_cat], axis=1)
    
      return fe_data
复制代码
    def fe_2(data, numerical_cols, categorical_cols):
      from sklearn.preprocessing import MinMaxScaler, StandardScaler
      sc = StandardScaler()
      temp_num = pd.DataFrame(sc.fit_transform(data[numerical_cols]), columns=numerical_cols)
    
      ## 进行标签编码
      from sklearn.preprocessing import LabelEncoder
      le = LabelEncoder()
      for col in categorical_cols:
    data[col] = le.fit_transform(data[col])
      
      fe_data = pd.concat([temp_num, data[categorical_cols]], axis=1)
    
      return fe_data
复制代码
    ## OneHot
    ## 构造model,对是否fraud进行预测
    x_fraud = train_data_1.loc[:, train_data_1.columns != 'fraud']
    y_fraud = train_data_1['fraud']
    
    ## 对x进行特征工程
    ## 构建x_fraud的categorical_cols
    x_fraud_cat_feat = [i for i in categorical_cols if i != 'fraud']
    x_fraud_num_feat = [i for i in numerical_cols if i != 'fraud']
    x_fraud = fe(x_fraud, x_fraud_num_feat, x_fraud_cat_feat)
    
    # ## 切分数据集
    x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test = train_test_split(
    x_fraud, y_fraud, test_size=0.2
    )
    
    ## 构造model,对是否late delivery进行预测
    x_late = train_data_1.loc[:, train_data_1.columns != 'late delivery']
    y_late = train_data_1['late delivery']
    
    ## 构建x_late的categorical_cols
    x_late_cat_feat = [i for i in categorical_cols if i != 'late delivery']
    x_late_num_feat = [i for i in numerical_cols if i != 'late delivery']
    x_late = fe(x_late, x_late_num_feat, x_late_cat_feat)
    
    
    ## 切分数据集
    x_late_train, x_late_test, y_late_train, y_late_test = train_test_split(
    x_late, y_late, test_size=0.2
    )
复制代码
    ## 构造model,对是否fraud进行预测
    from sklearn.model_selection import train_test_split
    x_fraud = train_data_1.loc[:, train_data_1.columns != 'fraud']
    y_fraud = train_data_1['fraud']
    
    ## 对x进行特征工程
    ## 构建x_fraud的categorical_cols
    x_fraud_cat_feat = [i for i in categorical_cols if i != 'fraud']
    x_fraud_num_feat = [i for i in numerical_cols if ((i != 'fraud') & (i != 'order_month_year'))]
    x_fraud = fe_2(x_fraud, x_fraud_num_feat, x_fraud_cat_feat)
    
    # ## 切分数据集
    x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test = train_test_split(
    x_fraud, y_fraud, test_size=0.2
    )
    
    ## 构造model,对是否late delivery进行预测
    x_late = train_data_1.loc[:, train_data_1.columns != 'late delivery']
    y_late = train_data_1['late delivery']
    
    ## 构建x_late的categorical_cols
    x_late_cat_feat = [i for i in categorical_cols if i != 'late delivery']
    x_late_num_feat = [i for i in numerical_cols if ((i != 'late delivery') & (i != 'order_month_year'))]
    x_late = fe_2(x_late, x_late_num_feat, x_late_cat_feat)
    
    
    ## 切分数据集
    x_late_train, x_late_test, y_late_train, y_late_test = train_test_split(
    x_late, y_late, test_size=0.2
    )

构建模型对数据进行训练和预测

复制代码
    from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix
    ## 对fraud,late_delivery完成评估
    def model_stats(model, x_train, x_test, y_train, y_test, name='Fraud'):
      model = model.fit(x_train, y_train)
      y_pred = model.predict(x_test)
      accuracy = accuracy_score(y_pred, y_test)
      recall = recall_score(y_pred, y_test)
      confusion = confusion_matrix(y_test, y_pred)
      f1 = f1_score(y_test, y_pred)
      print('Model used:', model)
      print('{}Accuracy: {}%'.format(name, accuracy*100))
      print('{}Recall: {}%'.format(name, recall*100))
      print('{}Confusion Matrix: \n{}'.format(name, confusion))
      print('{}F1 Score: {}%'.format(name, f1*100))
      return accuracy, recall, f1

使用LR进行预测

复制代码
    from sklearn.linear_model import LogisticRegression
    ## fe1
    ## 模型回归训练
    model_fraud = LogisticRegression()
    model_late = LogisticRegression()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## fe2
    ## 模型回归训练
    model_fraud = LogisticRegression()
    model_late = LogisticRegression()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    from sklearn.naive_bayes import GaussianNB
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier

使用GaussianNB模型

复制代码
    ## fe1
    ## GaussianNB模型
    model_fraud = GaussianNB()
    model_late = GaussianNB()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## fe2
    ## GaussianNB模型
    model_fraud = GaussianNB()
    model_late = GaussianNB()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')

使用SVM模型

复制代码
    from sklearn.svm import LinearSVC
    ## LinearSVC
    model_fraud = LinearSVC()
    model_late = LinearSVC()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## fe2
    ## LinearSVC
    model_fraud = LinearSVC()
    model_late = LinearSVC()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## KNNClassifier
    model_fraud = KNeighborsClassifier()
    model_late = KNeighborsClassifier()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## LinearDiscriminantAnalysis
    model_fraud = LinearDiscriminantAnalysis()
    model_late = LinearDiscriminantAnalysis()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## fe2
    ## LinearDiscriminantAnalysis
    model_fraud = LinearDiscriminantAnalysis()
    model_late = LinearDiscriminantAnalysis()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## DicisionTreeClassifier
    model_fraud = DecisionTreeClassifier()
    model_late = DecisionTreeClassifier()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## fe2
    ## DicisionTreeClassifier
    model_fraud = DecisionTreeClassifier()
    model_late = DecisionTreeClassifier()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## RandomForestClassifier
    model_fraud = RandomForestClassifier()
    model_late = RandomForestClassifier()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## fe2
    ## RandomForestClassifier
    model_fraud = RandomForestClassifier()
    model_late = RandomForestClassifier()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    important_col = model_fraud.feature_importances_.argsort()
复制代码
    ## 对于Fraud模型的重要特征
    feat_importance = pd.DataFrame({'Variables': x_fraud.columns[important_col], 'importance':model_fraud.feature_importances_[important_col]})
    plt.figure(figsize=(20,10))
    sns.catplot(x='Variables', y='importance', data=feat_importance, height=5, aspect=2, kind='bar')
    plt.xticks(rotation=90)
    plt.show()
复制代码
    ## 对于Late_delivery模型的重要特征
    important_col = model_late.feature_importances_.argsort()
    feat_importance = pd.DataFrame({'Variables': x_late.columns[important_col], 'importance':model_late.feature_importances_[important_col]})
    plt.figure(figsize=(20,10))
    sns.catplot(x='Variables', y='importance', data=feat_importance, height=5, aspect=2, kind='bar')
    plt.xticks(rotation=90)
    plt.show()
复制代码
    import xgboost as xgb
复制代码
    ## XGBClassifier
    model_fraud = xgb.XGBClassifier()
    model_late = xgb.XGBClassifier()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## fe2
    
    ## XGBClassifier
    model_fraud = xgb.XGBClassifier()
    model_late = xgb.XGBClassifier()
    
    ## 模型训练和评估
    model_stats(model_fraud, x_fraud_train, x_fraud_test, y_fraud_train, y_fraud_test, name='Fraud')
    model_stats(model_late, x_late_train, x_late_test, y_late_train, y_late_test, name='Late')
复制代码
    ## 使用神经网络进行分类
    from tensorflow import keras
    from keras import Sequential
    from keras.layers import Dense
复制代码
    ## BN层,在每个batch上将前一层的激活值重新规范化,使得输出的数据均值为0,标准差为1
    keras.layers.BatchNormalization()
    classifier = Sequential()
    
    ## 搭建神经网络
    classifier.add(Dense(1024, activation='relu', input_dim=x_fraud_train.shape[1]))
    classifier.add(Dense(512, activation='relu'))
    classifier.add(Dense(256, activation='relu'))
    classifier.add(Dense(128, activation='relu'))
    classifier.add(Dense(64, activation='relu'))
    classifier.add(Dense(32, activation='relu'))
    classifier.add(Dense(1, activation='sigmoid'))
    classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
复制代码
    classifier.fit(x_fraud_train, y_fraud_train, batch_size=1024, epochs=10)
复制代码
    train_evaluate = classifier.evaluate(x_fraud_train,y_fraud_train)
    test_evaluate = classifier.evaluate(x_fraud_test, y_fraud_test)
    print('Train Evaluation: ', train_evaluate)
    print('Test Evaluation: ', test_evaluate)
复制代码
    y_fraud_pred_temp = classifier.predict(x_fraud_test, batch_size=512, verbose=1)
    y_fraud_pred = np.argmax(y_fraud_pred_temp, axis=1)
    print(accuracy_score(y_fraud_test, y_fraud_pred))
复制代码
    ## fe2 神经网络
    ## BN层,在每个batch上将前一层的激活值重新规范化,使得输出的数据均值为0,标准差为1
    keras.layers.BatchNormalization()
    classifier = Sequential()
    
    ## 搭建神经网络
    classifier.add(Dense(1024, activation='relu', input_dim=x_fraud_train.shape[1]))
    classifier.add(Dense(512, activation='relu'))
    classifier.add(Dense(256, activation='relu'))
    classifier.add(Dense(128, activation='relu'))
    classifier.add(Dense(64, activation='relu'))
    classifier.add(Dense(32, activation='relu'))
    classifier.add(Dense(1, activation='sigmoid'))
    classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    classifier.fit(x_fraud_train, y_fraud_train, batch_size=1024, epochs=10)
复制代码
    train_evaluate = classifier.evaluate(x_fraud_train,y_fraud_train)
    test_evaluate = classifier.evaluate(x_fraud_test, y_fraud_test)
    print('Train Evaluation: ', train_evaluate)
    print('Test Evaluation: ', test_evaluate)
复制代码
    y_fraud_pred_temp = classifier.predict(x_fraud_test, batch_size=512, verbose=1)
    y_fraud_pred = np.argmax(y_fraud_pred_temp, axis=1)
    print(accuracy_score(y_fraud_test, y_fraud_pred))
复制代码
    print('{}Accuracy: {}%'.format('Fraud', accuracy_score(y_fraud_test, y_fraud_pred)*100))
    print('{}Recall: {}%'.format('Fraud', recall_score(y_fraud_pred, y_fraud_test)*100))
    print('{}Confusion Matrix: \n{}'.format('Fraud', confusion_matrix(y_fraud_test, y_fraud_pred)))
    print('{}F1 Score: {}%'.format('Fraud', f1_score(y_fraud_test, y_fraud_pred)*100))

结论1:

1)神经网络会导致小样本模型训练偏差,F1等于0的情况

2)树在处理label encoded数据比one hot encoded数据效果好

3)准确性上,神经网络处理one hot encoded数据更好

4)逻辑回归在处理one hot encoded数据效果更好

5)一般来说,模型对不同的特征工程反应不一

6)处理非均衡样本上,树的表现要比其他模型更好

对于迟交订单进行预测

对销售额进行预测

对订单数量进行预测

读取数据

对数据进行探索及数据清洗

统计为空的值

数据相关性统计(使用皮尔森系数)

全部评论 (0)

还没有任何评论哟~