Artificial Neural Network forgets when not attentive

For attention mechanisms such as softmax have an interesting behavior when trained with weight decay. It seems that when a parameter used to produce the signal line becomes useless, as evidenced by reduction in the attention to it, it’s gradient tends toward zero. Ultimately the weight control will shrink this variable to zero. The parameter for the attention line for associated with this unimportant feature will continue to decrease into negative. However weight decay would actually reduce this believe that the feature is unimportant as the signal line decrease in amplitude.

In short, the bad feature gets to be relearned from fresh and the model can update its believe about the usefulness of the signal line.

One immediate idea following this view of the attention mechanisms is to update weight decay to decay towards an ideal “initial naive state” that may not be zero. Or one may alter weight decay to gain noise as the value of the parameter approaches the naive state. As an example, the bias term of softmax numerators may target shrinkage to -7, Shrinking to 0 causes the exponentiated result to bias towards 1, where as shrinking to a bias of -7 permits the probability of some classes to decrease to almost zero(1e-3). This reduction results in less label smoothing implicit in softmaxes with bias trained with weight decay. The same can be done during exploration stage we could weight decay the bias towards 1.0 or 2.0 forcing the model to reconsider and explore probabilities more.

Leave a comment