Skip to article frontmatterSkip to article content

Learning Rate Scheduler

  • WarmUp
  • Cosine Annealing

Large Batch-Size

  • 16, 32, 64, 128, 512, 1024, 2048

Fine-Tuning

  1. Generic -> Adam
  2. Fine-Tune -> LBFGS

Transfer Learning

Temporal Causlity

Architecture

  • Temporal Encoders
  • Tanh, Sine+Cosine, Fourier,

Training Perspective

  • Sequence-to-Sequence Training
  • Causal Learning Loss

Curriculum Learning