Transformer Model and TST are not converging. #634

HasnainKhanNiazi · 2022-12-05T08:58:38Z

I am working on a regression problem where I am using TransformerModel and TST for training. My dataset and model config can be seen below.

Dataset For both models
Window Length = 100
Features at one time step = 94
I am using batch_tfms=TSStandardize(by_var=True) as it has been shown in the original paper also.

Model Config Transformer
d_model=768
n_head=12
n_layers=12
loss=MSELossFlat

Model Config TST
n_layers=12
d_model=768
n_heads=12

TransformerModel is taking around 3 hours for one epoch and right now, the 34th epoch is in training but the lowest validation loss that I got for TransformerModel and TST was at the 9th epoch but after that, both models are not converging.

My dataset looks like this,

A	B	C	D	E	G	H
34	19.5	19.5	1	0.1	-35.7742	-2.25
34	19.5	19.5	1	-0.1	-39.1072	-2.25
34	19.5	19.5	1	0	-38.885	-2.5
34	19.5	19.5	1	1	--38.6628	-2.5

For obvious reasons, I am not able to post the whole dataset. Any help will be appreciated. Thanks

The text was updated successfully, but these errors were encountered:

oguiza · 2022-12-05T10:30:57Z

Hi @HasnainKhanNiazi,
Here are a few comments:

TransformerModel is taking around 3 hours for one epoch and right now, the 34th epoch is in training but the lowest validation loss that I got for TransformerModel and TST was at the 9th epoch but after that, both models are converging.

Are you using a GPU? This is a really long time per epoch, unless your dataset is huge.
Do you mean converging or diverging? Based on what you say, it seems your models may be overfitting (your key metric mse is growing after some time). In case of overfitting, Jeremy Howard (fastai) recommends the following steps (in order):

Get more data
Use data augmentation
Use a more generalizable architecture (which use batchnorm, etc - you are already doing this).
Use regularization: use weight decay, lower your learning rate, ...
Reduce architecture complexity (The model architecture you are using is really big with 30M+ parameters (compared to the models generally used with time series). This is fine if you have a large number of samples)

HasnainKhanNiazi · 2022-12-05T10:55:43Z

Hi @oguiza, thanks for your insights. Yes, I am using a GPU (Nvidia A100) for training. It is taking 3 hours for one epoch as the dataset is really huge. I don't think the model is overfitting as the training loss is quite huge but in the case of overfitting, the training loss shouldn't be that huge.

I will change the model architecture for sure, I was trying to recreate BERT for the regression problem as BERT was having the same config I am using.

I am attaching an image of the training, it may help find out the core problem.

EDIT: I am also using MetaDataSet as I have the data distributed in multiple files.
Length of len(mdset) is 4083010.

oguiza · 2022-12-05T11:40:50Z

Hi @HasnainKhanNiazi,
Looking at the losses, the model is not learning anything.
Something I'd recommend is that you use a small dataset to train. This way, you can run multiple iterations until so see it starts learning. Then you can scale up.
Looking at the large loss, it seems the issue is related to how you are scaling the data.

HasnainKhanNiazi · 2022-12-05T12:16:00Z

Thanks @oguiza , I will train on smaller chunks, I will keep this issue open for now and will close it after getting to a conclusion. I will post an update here also. Thanks

HasnainKhanNiazi · 2022-12-30T13:01:42Z

Hi @oguiza , I have been doing some experiments with transformers and I have implemented some basic architectures such as;

CNN + Vanilla Encoders + CNN
Vanilla Encoders + CNN
Vanilla Encoders + MLP

And all of these models are learning and validation loss is decreasing but when it comes to using the same data with TST and TSTPlus architectures, models aren't learning anything. I am not sure what could be wrong as I am doing the same data preprocessing in both cases but TST and TSTPlus models aren't learning anything.

oguiza · 2023-03-16T17:46:39Z

Hi @HasnainKhanNiazi ,
Sorry for the late reply. Could you please paste a code snippet to reproduce the issue? I have not been able to reproduce it.
Have you tried using your approach with any of the datasets available in tsai?

oguiza added the question Further information is requested label Dec 5, 2022

HasnainKhanNiazi changed the title ~~Transformer Model and TST are not convgering.~~ Transformer Model and TST are not convergin. Dec 5, 2022

HasnainKhanNiazi changed the title ~~Transformer Model and TST are not convergin.~~ Transformer Model and TST are not converging. Dec 5, 2022

oguiza added the under review Waiting for clarification, confirmation, etc label Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer Model and TST are not converging. #634

Transformer Model and TST are not converging. #634

HasnainKhanNiazi commented Dec 5, 2022 •

edited

oguiza commented Dec 5, 2022

HasnainKhanNiazi commented Dec 5, 2022 •

edited

oguiza commented Dec 5, 2022

HasnainKhanNiazi commented Dec 5, 2022

HasnainKhanNiazi commented Dec 30, 2022

oguiza commented Mar 16, 2023

Transformer Model and TST are not converging. #634

Transformer Model and TST are not converging. #634

Comments

HasnainKhanNiazi commented Dec 5, 2022 • edited

oguiza commented Dec 5, 2022

HasnainKhanNiazi commented Dec 5, 2022 • edited

oguiza commented Dec 5, 2022

HasnainKhanNiazi commented Dec 5, 2022

HasnainKhanNiazi commented Dec 30, 2022

oguiza commented Mar 16, 2023

HasnainKhanNiazi commented Dec 5, 2022 •

edited

HasnainKhanNiazi commented Dec 5, 2022 •

edited