GitHub链接
时间序列数据建模.
准备数据
通过继承 torch.utils.data.Dataset 实现自定义时间序列数据集.
torch.utils.data.Dataset 是一个抽象类, 加载自定义数据只需要继承该类并覆写__len__方法返回数据集大小和__getitem__方法返回第 i 个样本即可.
import torch from torch import nn from torch.utils.data import Dataset,DataLoader,TensorDataset
WINDOW_SIZE = 8
class Covid19Dataset(Dataset): def __len__(self): return len(dfdiff) - WINDOW_SIZE def __getitem__(self, i): x = dfdiff.loc[i:i+WINDOW_SIZE-1,:] feature = torch.tensor(x.values) y = dfdiff.loc[i+WINDOW_SIZE,:] label = torch.tensor(y.values) return (feature,label) ds_train = Covid19Dataset()
dl_train = DataLoader(ds_train,batch_size = 38)
for features,labels in dl_train: break
dl_val = dl_train
|
构建模型
继承 nn.Module 基类构建自定义模型.
import torch from torch import nn import importlib import torchkeras torch.random.seed()
class Block(nn.Module): def __init__(self): super(Block,self).__init__() def forward(self,x,x_input): x_out = torch.max((1+x)*x_input[:,-1,:],torch.tensor(0.0)) return x_out class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.lstm = nn.LSTM(input_size = 3,hidden_size = 3,num_layers = 5,batch_first = True) self.linear = nn.Linear(3,3) self.block = Block() def forward(self,x_input): x = self.lstm(x_input)[0][:,-1,:] x = self.linear(x) y = self.block(x,x_input) return y net = Net() print(net)
|
RNN: 循环神经网络, 每一步输出都依赖于前面的所有步骤, 但是虽然理论上可以处理任意长度信息, 实际存在梯度消失, 计算效率低等问题, 很难记住大量内容.
LSTM: 长短期记忆网络, 引入了输入门, 遗忘门, 输出门和记忆单元.
- 输入门: 决定输入多少新的信息
- 遗忘门: 决定丢弃多少旧的信息
- 输出门: 决定使用记忆中的哪一部分
summary 如下.
-------------------------------------------------------------------------- Layer (type) Output Shape Param # ========================================================================== LSTM-1 [-1, 8, 3] 480 Linear-2 [-1, 3] 12 Block-3 [-1, 3] 0 ========================================================================== Total params: 492 Trainable params: 492 Non-trainable params: 0 -------------------------------------------------------------------------- Input size (MB): 0.000076 Forward/backward pass size (MB): 0.000229 Params size (MB): 0.001877 Estimated Total Size (MB): 0.002182 --------------------------------------------------------------------------
|
训练模型
引入 torchkeras 中的 KerasModel 工具训练模型, 无需编写自定义循环.
注: 循环神经网络调试较为困难, 需要设置多个不同学习率多次尝试, 以取得较好的效果.
from torchmetrics.regression import MeanAbsolutePercentageError
def mspe(y_pred,y_true): err_percent = (y_true - y_pred)**2/(torch.max(y_true**2,torch.tensor(1e-7))) return torch.mean(err_percent)
net = Net() loss_fn = mspe metric_dict = {"mape":MeanAbsolutePercentageError()}
optimizer = torch.optim.Adam(net.parameters(), lr=0.01) lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.0001)
from torchkeras import KerasModel keras_model = KerasModel(net, loss_fn = loss_fn, metrics_dict= metric_dict, optimizer = optimizer, lr_scheduler = lr_scheduler)
dfhistory = keras_model.fit(train_data=dl_train, val_data=dl_val, epochs=100, ckpt_path='checkpoint', patience=10, monitor='val_loss', mode='min', callbacks=None, plot=True, cpu=True )
|
gamma 改成 1 之后 loss 反而从 0.5983 降到了 0.37 左右, 理解不了是为什么. 反正就玄学吧.
评估和使用
此例数据较少, 仅可视化损失函数在训练集上的迭代情况.
使用模型预测疫情结束时间, 即新增确诊病例为 0 的时间.
dfresult = dfdiff[["confirmed_num","cured_num","dead_num"]].copy() dfresult.tail()
for i in range(1000): arr_input = torch.unsqueeze(torch.from_numpy(dfresult.values[-38:,:]),axis=0) arr_predict = keras_model.forward(arr_input)
dfpredict = pd.DataFrame(torch.floor(arr_predict).data.numpy(), columns = dfresult.columns) dfresult = pd.concat([dfresult,dfpredict],ignore_index=True)
dfresult.query("confirmed_num==0").head() dfresult.query("cured_num==0").head()
|
仔细一看发现不仅相同参数跑出来的每次 loss 都不一样, 预测结果也是差的离谱, 有时候能两个月内终结病毒, 有时候能三年内解决不了病毒, 有时候能新增确诊为 0 但是新增治愈为 inf, 治未病说是.