Some months ago I wrote of a discovery regarding the training of deep regressors using SGD. I have since come to realize the benefit of exponentiating the raw parameters before using them is reasonable, sometimes. It would appear that for approximate second order optimizers, like Adam that I used instead of the SGD that I thought I used, the exponential has the effect of modulating the variance aspect of the optimization. The signal to noise ratio of gradients of for identically distributed A in the two case where
and
will vary but is largely dependent on A. If A is small, the SNR for
is stronger. If A is large, then SNR is stronger for
.
The combination of methods would tend to follow gradient more eagerly for neuronal activation less dependent on input than larger dependence on neuronal activation. This appear counter to my originally documented intuition that the larger the dependence neuronal output has on neuronal input the larger the gradient step–mainly due to use of approximate second order SGD. My modification, as you would expect, allows the gradients to move more freely even if they are not normally distributed. The exponentiation of weights is akin to log-transform that we used in linear regression analysis, it lets us use linear methods that rely on normality of error on some systems with non-normally distributed, often heavy tailed, errors.
Therefore although my success with this method stems more from the fortuitous conditioning of my problem than it does from the universality of the approach, it can make sense for a very large, albeit non-universal, set of problems. Subsequently any exponent is equivalent to the corresponding inverse power transform of power
.
Stay tuned, more to come as I remove more bugs from the experiments.