Advertisement

深圳二手房房价分析及预测

阅读量:

分析目标:

  1. 通过处理后的房价数据,筛选对房价有显著影响的特征变量。
  2. 确定特征变量,建立深圳房价预测模型并对假设情景进行模拟

数据预处理

复制代码
 import pandas as pd

    
 import os
    
 file_path="D:\Python数据分析与挖掘实战\深圳二手房价分析\data"
    
 #读取file_path目录下的所有文件
    
 file_name=os.listdir(file_path)
    
 df=pd.DataFrame()
    
 lis=[]
    
 #使用两种方法读取数据
    
 for i in file_name:
    
     file=pd.read_excel(os.path.join(file_path,i))
    
     # lis.append(file)
    
     df=df.append(file)
    
 # df=pd.concat(lis)
    
 #更改第一列的名字
    
 df=df.rename(columns={'Unnamed: 0':'house_id'})
    
 #查看数据描述
    
  print(df.describe())
    
  
    
 out:
    
  
    
        house_id       roomnum  ...        subway     per_price
    
 count  1.851400e+04  18514.000000  ...  18514.000000  18514.000000
    
 mean   2.280900e+08      2.873339  ...      0.504159      6.118192
    
 std    3.031648e+06      1.040839  ...      0.499996      3.050218
    
 min    4.217338e+06      1.000000  ...      0.000000      1.010100
    
 25%    2.276957e+08      2.000000  ...      0.000000      4.052600
    
 50%    2.284619e+08      3.000000  ...      1.000000      5.246300
    
 75%    2.288307e+08      3.000000  ...      1.000000      7.357400
    
 max    2.289965e+08      9.000000  ...      1.000000     26.396800
    
  
    
 [8 rows x 8 columns]
    
  
    
  
    
 #查看数据信息,是否有缺失值,可以看到总共有10个字段,其中house_id字段是没有用的,我们可以删除
    
  
    
 df=df.drop(columns='house_id')
    
  
    
 print(df.info())
    
  
    
 out:
    
  
    
  #   Column     Non-Null Count  Dtype  
    
 ---  ------     --------------  -----  
    
  0   district   18514 non-null  object 
    
  1   roomnum    18514 non-null  int64  
    
  2   hall       18514 non-null  int64  
    
  3   AREA       18514 non-null  float64
    
  4   C_floor    18514 non-null  object 
    
  5   floor_num  18514 non-null  int64  
    
  6   school     18514 non-null  int64  
    
  7   subway     18514 non-null  int64  
    
  8   per_price  18514 non-null  float64
    
 dtypes: float64(2), int64(5), object(2)
    
  
    
  
    
 #我们可以看到字段中只有每平米的单价,我们可以加一个字段为总价,多一个维度进行分析,总价为面积乘以每平米单价
    
 df['total_price']=df['AREA']*df['per_price']
    
 print(df['total_price'])
    
  
    
 out:
    
  
    
 0       632.002890
    
 1       879.995700
    
 2       110.000800
    
 3        93.990400
    
 4       395.998200
    
        ...    
    
 1487    116.000040
    
 1488    119.999383
    
 1489    145.001298
    
 1490    128.999772
    
 1491     80.999928
    
 Name: total_price, Length: 18514, dtype: float64
    
  
    
  
    
 #查看是否有重复项
    
 print(df.duplicated().sum())
    
  
    
 out:
    
  
    
 0
    
 area_map={'baoan':'宝安','dapengxinqu':'大鹏新区','futian':'福田','guangming':'光明',
    
       'longhua':'龙华','luohu':'罗湖','nanshan':'南山','pingshan':'坪山','yantian':'盐田'
    
       ,'longgang':'龙岗'}
    
 df['district']=df['district'].apply(lambda x : area_map[x])

特征变量分析

1.district特征变量分析

由上图可以看出:

  1. 南山区二点平均房价最高,大鹏新区最低。
  2. 平均总价南山区最高,坪山区最低。
  3. 二手房总数量有18514套,数量最多的为罗湖,接近18%。
  4. 由箱型图可以看出随着区域不同,箱子中心明显不同,说明房价跟区域有关系。

2.roomnum特征变量分析

由上图可以看出:

  1. 房间数量为6的平均单价最高。
  2. 卧室数量对平均单价的影响不明显。

3.hall特征变量分析

由上图可以看出:

  1. 厅数量为3的平均单价最高。
  2. 厅数量对平均单价有一定影响。

4.c_floor特征变量分析

由上图可以看出:

1.不同楼层对均价影响不大。

5.school,subway特征变量分析

由上图可以看出:

1.可以看到,靠近地铁站的二手房均价要明显高于不靠近地铁站的二手房。

2.学校对房价的影响没有地铁站大。

5.面积特征变量分析

复制代码
 plt.scatter(df.AREA,df.per_price,marker='x',color='b',alpha=0.5)

    
 plt.title('面积AREA 和 单位面积房价per_price的散点图')
    
 plt.ylabel("单位面积房价")
    
 plt.xlabel("面积(平方米)")
    
 plt.show()

由上图可以看出:

1.二手房以68-102的小户型居多,几乎占到了总数量的一半。

2.面积变化,单价波动明显,说明面积对单价有一定影响。

3.小户型明显比大户型受欢迎。

6.floor特征变量分析

由上图可以看出:

1.随着楼层的变化,平均单价波动较大,所以楼层对单价有影响。

机器学习预测房价

由上面的分析可以看出(区域、房间数量、学校、楼层数、是否靠近地铁站、面积、厅数)等7个特征对房价有影响,因此将这些特征作为作为机器算法的输入,经过训练拟合后输出预测的房价。

首先使用one-hot编码将类别变量(区域,房间数量,厅数)转化为数值型变量,学校和地铁已经转化过了不需要转化,连续变量可以不转为数值型变量。

复制代码
 from sklearn.linear_model import LinearRegression

    
 from sklearn.metrics import r2_score
    
 from sklearn.svm import SVR
    
 import xgboost as xgb
    
 from sklearn.preprocessing import StandardScaler
    
  
    
 # 学校和是否靠近地铁不需要转化成数值型变量
    
 Roomnum = pd.get_dummies(df['roomnum'])
    
 Roomnum.rename(columns={i: 'roomnum_' + str(i) for i in Roomnum.columns}, inplace=True)
    
  
    
 District = pd.get_dummies(df['district'])
    
 District.rename(columns={i: 'district_' + str(i) for i in District.columns}, inplace=True)
    
  
    
 Hall = pd.get_dummies(df['hall'])
    
 Hall.rename(columns={i: 'hall_' + str(i) for i in Hall.columns}, inplace=True)
    
  
    
 data_new = pd.concat([Roomnum, District, Hall, df], axis=1)
    
 data_new = data_new.drop(columns=['district', 'hall', 'roomnum', 'C_floor', 'total_price'], axis=1)
    
  
    
 # 确定数据中的特征与标签
    
 x = data_new.loc[:, data_new.columns != "per_price"]
    
 fea_imp = x.columns
    
 y = data_new.loc[:, 'per_price']
    
 # 数据分割,随机采样30%作为测试样本,其余作为训练样本
    
 from sklearn.model_selection import train_test_split
    
  
    
 x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=10, test_size=0.3)
    
 # print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
    
 # reshape(-1,1)表示任意行,一列
    
 y_train = y_train.values.reshape(-1, 1)
    
 y_test = y_test.values.reshape(-1, 1)
    
  
    
 # 数据标准化处理
    
  
    
  
    
 ss_x = StandardScaler()
    
  
    
 ss_y = StandardScaler()
    
 # fit_transform是fit和transform的组合,既包括了训练又包含了转换。
    
 # transform()和fit_transform()二者的功能都是对数据进行某种统一处理
    
 # (比如标准化~N(0,1),将数据缩放(映射)到某个固定区间,归一化,正则化等)
    
 x_train = ss_x.fit_transform(x_train)
    
  
    
 x_test = ss_x.transform(x_test)
    
 mean_y = np.mean(y_train)
    
  
    
 s_y = np.var(y_train)
    
  
    
 y_train = ss_y.fit_transform(y_train)
    
  
    
  
    
 y_test = ss_y.transform(y_test)
    
  
    
  
    
 # 线性回归
    
 lr = LinearRegression()
    
 # 支持向量机回归
    
 svr = SVR(kernel="rbf")
    
 param = {'max_depth': 3,
    
      'learning_rate': 0.1,
    
      'n_estimators': 100,
    
      'objective': 'reg:linear',  # 此默认参数与 XGBClassifier 不同
    
      'booster': 'gbtree',
    
      'gamma': 0,
    
      'min_child_weight': 1,
    
      'subsample': 1,
    
      'colsample_bytree': 1,
    
      'reg_alpha': 0,
    
      'reg_lambda': 1,
    
      'random_state': 2}
    
 dtrain = xgb.DMatrix(x_train, label=y_train, feature_names=fea_imp)
    
 dtest = xgb.DMatrix(x_test, label=y_test, feature_names=fea_imp)
    
 num_round = 100
    
 watchlist = [(dtrain, 'train'), (dtest, 'test')]
    
  
    
 lr.fit(x_test, y_test)
    
 svr.fit(x_test, y_test)
    
 xg = xgb.train(param, dtrain, num_round, evals=watchlist, early_stopping_rounds=10)
    
  
    
 out:
    
  
    
  
    
 [0]	train-rmse:1.04640	test-rmse:1.04475
    
 [1]	train-rmse:0.98451	test-rmse:0.98363
    
 [2]	train-rmse:0.93073	test-rmse:0.93074
    
 [3]	train-rmse:0.88439	test-rmse:0.88517
    
 [4]	train-rmse:0.84244	test-rmse:0.84438
    
 [5]	train-rmse:0.80710	test-rmse:0.81022
    
 [6]	train-rmse:0.77707	test-rmse:0.78133
    
 [7]	train-rmse:0.75142	test-rmse:0.75731
    
 [8]	train-rmse:0.72903	test-rmse:0.73546
    
 [9]	train-rmse:0.71017	test-rmse:0.71765
    
 [10]	train-rmse:0.69146	test-rmse:0.69960
    
 [11]	train-rmse:0.67692	test-rmse:0.68635
    
 [12]	train-rmse:0.66298	test-rmse:0.67298
    
 [13]	train-rmse:0.65210	test-rmse:0.66320
    
 [14]	train-rmse:0.64173	test-rmse:0.65345
    
 [15]	train-rmse:0.63378	test-rmse:0.64664
    
 [16]	train-rmse:0.62579	test-rmse:0.63936
    
 [17]	train-rmse:0.61993	test-rmse:0.63410
    
 [18]	train-rmse:0.61405	test-rmse:0.62849
    
 [19]	train-rmse:0.60906	test-rmse:0.62385
    
 [20]	train-rmse:0.60482	test-rmse:0.62023
    
 [21]	train-rmse:0.60097	test-rmse:0.61680
    
 [22]	train-rmse:0.59790	test-rmse:0.61411
    
 [23]	train-rmse:0.59470	test-rmse:0.61117
    
 [24]	train-rmse:0.59193	test-rmse:0.60857
    
 [25]	train-rmse:0.58939	test-rmse:0.60670
    
 [26]	train-rmse:0.58688	test-rmse:0.60459
    
 [27]	train-rmse:0.58516	test-rmse:0.60302
    
 [28]	train-rmse:0.58327	test-rmse:0.60164
    
 [29]	train-rmse:0.58163	test-rmse:0.60073
    
 [30]	train-rmse:0.58036	test-rmse:0.59952
    
 [31]	train-rmse:0.57869	test-rmse:0.59794
    
 [32]	train-rmse:0.57696	test-rmse:0.59644
    
 [33]	train-rmse:0.57596	test-rmse:0.59559
    
 [34]	train-rmse:0.57463	test-rmse:0.59455
    
 [35]	train-rmse:0.57330	test-rmse:0.59346
    
 [36]	train-rmse:0.57212	test-rmse:0.59236
    
 [37]	train-rmse:0.57119	test-rmse:0.59174
    
 [38]	train-rmse:0.57041	test-rmse:0.59106
    
 [39]	train-rmse:0.56929	test-rmse:0.59012
    
 [40]	train-rmse:0.56834	test-rmse:0.58927
    
 [41]	train-rmse:0.56733	test-rmse:0.58862
    
 [42]	train-rmse:0.56628	test-rmse:0.58777
    
 [43]	train-rmse:0.56491	test-rmse:0.58683
    
 [44]	train-rmse:0.56391	test-rmse:0.58618
    
 [45]	train-rmse:0.56332	test-rmse:0.58569
    
 [46]	train-rmse:0.56253	test-rmse:0.58515
    
 [47]	train-rmse:0.56154	test-rmse:0.58434
    
 [48]	train-rmse:0.56082	test-rmse:0.58371
    
 [49]	train-rmse:0.56017	test-rmse:0.58302
    
 [50]	train-rmse:0.55971	test-rmse:0.58265
    
 [51]	train-rmse:0.55917	test-rmse:0.58233
    
 [52]	train-rmse:0.55806	test-rmse:0.58155
    
 [53]	train-rmse:0.55745	test-rmse:0.58125
    
 [54]	train-rmse:0.55672	test-rmse:0.58080
    
 [55]	train-rmse:0.55569	test-rmse:0.57971
    
 [56]	train-rmse:0.55514	test-rmse:0.57927
    
 [57]	train-rmse:0.55455	test-rmse:0.57894
    
 [58]	train-rmse:0.55408	test-rmse:0.57853
    
 [59]	train-rmse:0.55361	test-rmse:0.57818
    
 [60]	train-rmse:0.55305	test-rmse:0.57795
    
 [61]	train-rmse:0.55269	test-rmse:0.57773
    
 [62]	train-rmse:0.55180	test-rmse:0.57714
    
 [63]	train-rmse:0.55151	test-rmse:0.57698
    
 [64]	train-rmse:0.55115	test-rmse:0.57673
    
 [65]	train-rmse:0.55049	test-rmse:0.57638
    
 [66]	train-rmse:0.54969	test-rmse:0.57585
    
 [67]	train-rmse:0.54928	test-rmse:0.57555
    
 [68]	train-rmse:0.54904	test-rmse:0.57539
    
 [69]	train-rmse:0.54829	test-rmse:0.57457
    
 [70]	train-rmse:0.54804	test-rmse:0.57442
    
 [71]	train-rmse:0.54737	test-rmse:0.57405
    
 [72]	train-rmse:0.54685	test-rmse:0.57380
    
 [73]	train-rmse:0.54622	test-rmse:0.57343
    
 [74]	train-rmse:0.54584	test-rmse:0.57330
    
 [75]	train-rmse:0.54572	test-rmse:0.57320
    
 [76]	train-rmse:0.54557	test-rmse:0.57312
    
 [77]	train-rmse:0.54502	test-rmse:0.57257
    
 [78]	train-rmse:0.54446	test-rmse:0.57215
    
 [79]	train-rmse:0.54392	test-rmse:0.57191
    
 [80]	train-rmse:0.54342	test-rmse:0.57153
    
 [81]	train-rmse:0.54309	test-rmse:0.57132
    
 [82]	train-rmse:0.54299	test-rmse:0.57130
    
 [83]	train-rmse:0.54251	test-rmse:0.57103
    
 [84]	train-rmse:0.54239	test-rmse:0.57095
    
 [85]	train-rmse:0.54197	test-rmse:0.57077
    
 [86]	train-rmse:0.54146	test-rmse:0.57042
    
 [87]	train-rmse:0.54137	test-rmse:0.57035
    
 [88]	train-rmse:0.54091	test-rmse:0.57010
    
 [89]	train-rmse:0.54067	test-rmse:0.56994
    
 [90]	train-rmse:0.54059	test-rmse:0.56993
    
 [91]	train-rmse:0.54031	test-rmse:0.56971
    
 [92]	train-rmse:0.54009	test-rmse:0.56960
    
 [93]	train-rmse:0.53972	test-rmse:0.56913
    
 [94]	train-rmse:0.53932	test-rmse:0.56885
    
 [95]	train-rmse:0.53903	test-rmse:0.56882
    
 [96]	train-rmse:0.53881	test-rmse:0.56873
    
 [97]	train-rmse:0.53849	test-rmse:0.56849
    
 [98]	train-rmse:0.53814	test-rmse:0.56835
    
 [99]	train-rmse:0.53806	test-rmse:0.56828
    
  
    
  
    
  
    
 print('逻辑回归模型的r2-score为:', r2_score(lr.predict(x_test), y_test))
    
 print('支持向量机模型的r2-score为:',r2_score(svr.predict(x_test), y_test))
    
 print('xgboost模型的r2-score为:',r2_score(xg.predict(dtest), y_test))
    
  
    
 out:
    
 逻辑回归模型的r2-score为: 0.3927518839179279
    
 支持向量机模型的r2-score为: 0.47270739927209093
    
 xgboost模型的r2-score为: 0.5097990013836289
    
  
    
  
    
 # 新版本将get_fscore()替换了旧的feature_importance_
    
 im = pd.DataFrame({'importance': xg.get_fscore().values(), 'var': xg.get_fscore().keys()})
    
 im = im.sort_values(by='importance', ascending=False)
    
 print(im.head(10))
    
  
    
  
    
 out;
    
  
    
  
    
     importance            var
    
 22       201.0           AREA
    
 23       168.0      floor_num
    
 24        36.0         school
    
 9         34.0    district_南山
    
 14        29.0    district_福田
    
 25        28.0         subway
    
 12        25.0    district_宝安
    
 17        25.0    district_龙岗
    
 10        20.0    district_坪山
    
 11        15.0  district_大鹏新区
    
 xgb.plot_importance(xg, max_num_features=10, importance_type='gain')
    
 plt.show()
    
 # (三) 假想情形,做预测,x_new是新的自变量
    
 '''
    
 预测要找一个条件为:
    
 1.南山区
    
 2.有3个房间
    
 3.面积大概再80㎡左右
    
 4.有地铁
    
 5.学区房
    
 的房子的大概花费
    
 '''
    
 room = Roomnum.loc[Roomnum['roomnum_3'] == 1].head(1).reset_index(drop=True)
    
 dis = District.loc[District['district_南山'] == 1].head(1).reset_index(drop=True)
    
 hal = Hall.loc[Hall['hall_3'] == 1].head(1).reset_index(drop=True)
    
 x_new1 = pd.concat([room, dis, hal], axis=1)
    
 x_new1['AREA'] = 80
    
 x_new1['floor_num'] = 3
    
 x_new1['school'] = 1
    
 x_new1['subway'] = 1
    
  
    
 x_new1_scale = ss_x.transform(x_new1)
    
 dtt = xgb.DMatrix(x_new1_scale, feature_names=fea_imp)
    
 p=xg.predict(dtt)
    
  
    
 # 反标准化
    
 per_price=p*s_y+mean_y
    
 print("单位面积房价:", per_price)
    
 print("总价:", per_price * 80)
    
  
    
  
    
 out:
    
  
    
 单位面积房价: [15.402145]
    
 总价: [1232.1716]

全部评论 (0)

还没有任何评论哟~