Underflowing learning rates

Sometimes early stopping helps to regularize models, other times it seem to have numerical properties. When using learning rate decayers like linear, polynomial or cosine, they allow the rate to be very close to zero for a while. It seems that sometimes this will underflow the change to only some parameters while other parameter changes remain nonzero, The result is that an inaccurate gradient is applied and the model drops in performance. It is detectable of course, but one can probably just snapshot models and choose an earlier model when performance starts going south.

Ah, okay, and some time passes, and I found a slew of papers from several years ago pointing out the problem is with the \epsilon used in Adam, it looks like an underflow because it fell below \epsilon kudos to the people who found this obvious problem.

P.s. and extra kudos for the people who actually just “fixed it” for me by lowering the value inside their software package.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s