I just chanced upon a fascinating article called the Neural Additive Models Interpretable Machine Learning with Neural Nets(FAMX.3 for me due to my interest, but others may feel this draft is a 2 or 3 due to brevity) The proposed ExU is a layer that has an foreactivated parameter (see my own blog discussions on the need for nonlinear over raw parameters here, here, and here, etc.)
I’m very excited that people like Jeffery Hinton and Richard Caruana are thinking about and writing about stuff that I’m thinking about and writing about at about the same time and arriving at similar solutions. In this case they performed foreactivation on a weight matrix. This paper of course is a collection of massive amount of experimentation, far more than I had resources to accomplish. These smart folks also solved the problem of sign that I had struggled with a bit—the sign is washed out by having multiple layers. (64 in their successful examples)
oh! That was obvious, now that they say that. the tanh-autoactivated sign I wanted to multiply on the front of the was not necessary after all. As long as there is at least one “linear” layer at the output of the subnetwork that does not use the ExU or another sign-restricting foreactivation on the parameters, then the output can have a full range in R irrespective of input and therefore can be a universal regressor.
My only concern is the effort it required to arrive at their awesome results, no less than 4 hyperparameters had to be tuned using Bayesian optimization. I think my own laziness demands that there be a way to tune a model hyperparameter using only learning rate warmup and decays—the dynamical nature of a model and its data should be entirely taken care of by the model and automated training process. The foreactivation is one such mechanism.
Of course, I only have access to the initial draft posted on 2020-04-29. I am very hopeful that in subsequent revisions and sequels this highly flexible and highly interpretable modeling technique can made easier to use.