Learning Rate Scheduler¶
WarmUp
Cosine Annealing
Large Batch-Size¶
16, 32, 64, 128, 512, 1024, 2048
Fine-Tuning¶
Generic -> Adam
Fine-Tune -> LBFGS
Transfer Learning¶
Temporal Causlity¶
Architecture¶
Temporal Encoders
Tanh, Sine+Cosine, Fourier,
Training Perspective¶
Sequence-to-Sequence Training
Causal Learning Loss