Sometimes early stopping helps to regularize models, other times it seem to have numerical properties. When using learning rate decayers like linear, polynomial or cosine, they allow the rate to be very close to zero for a while. It seems that sometimes this will underflow the change to only some parameters while other parameter changes remain nonzero, The result is that an inaccurate gradient is applied and the model drops in performance. It is detectable of course, but one can probably just snapshot models and choose an earlier model when performance starts going south.
…
Ah, okay, and some time passes, and I found a slew of papers from several years ago pointing out the problem is with the used in Adam, it looks like an underflow because it fell below
kudos to the people who found this obvious problem.
P.s. and extra kudos for the people who actually just “fixed it” for me by lowering the value inside their software package.