365天深度学习训练营-第R3周:天气预测
发布时间
阅读量:
阅读量
- 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
- 🍖 原作者:K同学啊
我的环境:
- 语言环境:Python3.11.2
- 编译器:PyCharm Community Edition 2022.3
- 深度学习环境:TensorFlow2
本次目标**:**
- 了解探索式数据分析(EDA)
一、导入数据
导入时复制一份数据。
import pandas
data = pandas.read_csv('F:\ weatherAUS.csv')
df = data.copy()
print(data.head())

对数据进行分析。
print(data.describe())

查看各项的数据类型。
print(data.dtypes)

设置数据中的年、月、日。dt是pandas 中datetime型数据的操作(dt.*)。
data['Date'] = pandas.to_datetime(data['Date'])
print(data['Date'])
data['year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day
df['year'] = data['Date'].dt.year
df['Month'] = data['Date'].dt.month
df['day'] = data['Date'].dt.day
data.drop('Date',axis=1,inplace=True)
print(data.head())

二、探索式分析(EDA)
2.1 数据相关性探索
****seaborn是python的数据可视化库,基于matplotlib。可以绘制更多的统计图形。
import matplotlib.pyplot as plt
import seaborn as sns
data.drop('Location',axis=1,inplace=True)
data.drop('WindGustDir',axis=1,inplace=True)
data.drop('WindDir9am',axis=1,inplace=True)
data.drop('WindDir3pm',axis=1,inplace=True)
data.drop('RainToday',axis=1,inplace=True)
data.drop('RainTomorrow',axis=1,inplace=True)
print(data.head())
print(data.columns)
plt.figure(figsize=(15,13))
ax = sns.heatmap(data.corr(),square=True,annot=True, fmt='.2f')
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()

data.corr()方法表示data中的两个变量之间的关系。
2.2 是否下雨
sns.set(style="darkgrid")
plt.figure(figsize=(4,3))
sns.countplot(x='RainTomorrow',data=df)
plt.show()

plt.figure(figsize=(4,3))
sns.countplot(x='RainToday',data=df)
plt.show()

统计今天是否下雨与每天是否下雨的数值
x = pandas.crosstab(df['RainTomorrow'],df['RainToday'])
print(x)
y = x/x.transpose().sum().values.reshape(2,1)*100
print(y)
y.plot(kind='bar',figsize=(4,3),color=['green','blue'])
plt.show()

这里第一个表格展示的为数值,即今天不下雨明天也不下雨的次数为92728次,剩余三项以此类推。第二个表格则是将这些数值转换为百分比,即在今天不下雨的情况下明天也不下雨的概率是84.616648,明天下雨的概率为15.383352,另外两项以此类推。

2.3 地理位置与下雨的关系
x = pandas.crosstab(df['Location'],df['RainToday'])
y = x/x.transpose().sum().values.reshape((-1,1))*100
y=y.sort_values(by='Yes',ascending=True)
color=['#cc6699','#006666','#862d86','#ff9966']
y.Yes.plot(kind='barh',figsize=(15,20),color=color)
plt.show()

2.4 湿度和压力对下雨的影响
data.columns
plt.figure(figsize=(8,6))
sns.scatterplot(data=df,x='Pressure9am',y='Pressure3pm',hue='RainTomorrow')
plt.figure(figsize=(8,6))
sns.scatterplot(data=df,x='Humidity9am',y='Humidity3pm',hue='RainTomorrow')


低压与高温会增加下雨概率。尤其是下午三时。
2.5 气温对下雨的影响
plt.figure(figsize=(8,6))
sns.scatterplot(x='MaxTemp',y='MinTemp',data=df,hue='RainTomorrow')

结论:当一天的最高气温和最低气温接近时,第二天下雨的概率增加。
三、数据预处理
查看每列缺失数据百分比
import pandas,numpy
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Activation,Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import classification_report,confusion_matrix,r2_score,mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error,mean_squared_error
d = df.isnull().sum()/df.shape[0]*100
print(d)

对缺失数据进行填充
lst=['Evaporation','Sunshine','Cloud9am','Cloud3pm']
for col in lst:
fill_list = df[col].dropna()
df[col] = df[col].fillna(pandas.Series(numpy.random.choice(fill_list,size=len(df.index))))
s = (df.dtypes == 'object')
print(s)
print(s[s])
object_cols = list(s[s].index)
print(object_cols)
for i in object_cols:
df[i].fillna(df[i].mode()[0],inplace=True)
t=(df.dtypes == 'float64')
num_cols = list(t[t].index)
print(num_cols)
for i in num_cols:
df[i].fillna(df[i].median(),inplace=True)
df.isnull().sum()

构建数据集
X = df.drop(['RainTomorrow','day'],axis=1).values
y = df['RainTomorrow'].values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=101)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
四、预测是否下雨
4.1 搭建、编译、训练神经网络
model = Sequential()
model.add(Dense(units=24,activation='tanh'))
model.add(Dense(units=18,activation='tanh'))
model.add(Dense(units=23,activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(units=12,activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(units=1,activation='sigmoid'))
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)
model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics="accuracy")
early_stop = EarlyStopping(monitor='val_loss',mode='min',min_delta=0.001,verbose=1,patience=25,restore_best_weights=True)
model.fit(x=X_train,y=y_train,
validation_data=(X_test,y_test),
verbose=1,
callbacks=[early_stop],
epochs=10,
batch_size=32
)

4.2 结果可视化
acc=model.history.history['accuracy']
val_acc=model.history.history['val_accuracy']
loss = model.history.history['loss']
val_loss = model.history.history['val_loss']
epochs_range = range(10)
plt.figure(figsize=(14,4))
plt.subplot(1,2,1)
plt.plot(epochs_range,acc,label='Training Accuracy')
plt.plot(epochs_range,val_acc,label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')
plt.subplot(1,2,2)
plt.plot(epochs_range,loss,label='Training Loss')
plt.plot(epochs_range,val_loss,label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.show()

全部评论 (0)
还没有任何评论哟~
