The Deep Universal Regressor II

Several months ago I wrote about the Deep Universal Regressor. Hopefully you have discovered that it is full of bugs, fashioned typically for the late 2010’s, following Interweb AI blogging conventions of the day. It isn’t really because I’m time traveling and writing this from the far past or far future, just that this is the way we wrote in those days.

One most notable one is if you tried to regress $Y = 2X$ using my suggestion, you found that it does not extrapolate well. After some inspection, one realizes the problem. One way to fix it is to use a decision list to make progressively smaller estimates: $y = 2^{unsquash(average(squash(A_0)))} + \sum_{i=1}^{k}squash(A_i)2^{unsquash(average(squash(A_0)))-i}$

The $A_0...A_k$ are linear activations from previous layers calculated from $X$.

But of course, this approach also has some fatal flaws that needs to be fixed. Some effort has to be made to ensure the right term will be adjusted at the right moment–it needs regularization. It also needs some experimentation around what $k$ is.

After that, the $floatx$ is the limit.

Ps, it’s kind of interesting that this range-preserving combination approach is slightly more general version of attention mechanism. Recall attention is applied as softmax combination of items of consideration. The softmax is just a positively weighted average, which in turn is a convex combination.

P.p.s. On the reverse side of this approach, perhaps we can apply activation in an earlier place to create similar effect. This approach is often taken to restrict parameter in a range, for example using $relu(W)x$ to produce a dense layer that applies only positive multiples to inputs. The approach I have tried with good success is the exponentiation activation. $e^{W}x$ At first blush, you might judge this absurd. When we take a step, would the change in $e^Wx$ not be the same as the change in $Wx$? There is no contribution to complexity of the model in this parameter activation. If you look at it as $A=e^W$ and then compute the output of the neuron as $Ax$, the derivative of x with respect to A is linear in X. What difference does it make how you arrive at $A$?

But I have observed that the application of this activation makes linear regression converge faster under SGD. Looking at it more specifically $\partial A/ \partial W$ is $e^W=A$ the parameter’s value is its own derivative. In other parlance the sensitivity of the derivative with respect to a parameter is linear in size of the neuron’s outputs. If A was just raw tensors, this derivative would be the constant $1$. Another consideration is that perhaps all we did was to initialize A using a log-normal or log-uniform distribution. Lastly, the effect of exponentiation maybe through the effect of increased or decreased learning rates. For this blogs purpose, I tried to implement a few of these alternative explanations directly with linear model: using log-normal initialization to a simple dense layer, trying various learning rates. These attempts did not produce the same quickened convergence as exponentiating the raw parameters.

These meandering is of course a middle way between many. For example, if $W$ we’re not raw tensors but the output of some tensor function of the input, and we further normalize: $A=e^W / reducesum(e^W)$ this would be an attention mechanism like those used in transformers. Although the derivative of this is neither linear or constant.

Here is a list of parameter activations and their detivatives. Expressing in terms of A brings us a view of the scale of the gradient. $A$ $\partial A/ \partial W$ $\partial A/ \partial W$ in $A$
W 1 $A^0$ $e^W$ $e^W$ $A$ $W^2$ $2W$ $2\sqrt{A}$ $lg(W)$ $1 / W$ $e^{-A}$ $W^{-1}$ $-W^{-2}$ $-A^2$ $e^{e^{W}}$ $e^{e^{W}}e^W$ $A lg(A)$ $W^3$ $3W^2$ $3\sqrt[^3]{A^2}$

Clearly, its hard to boost the progress of training anymore this way.

Ppps ah, yes, astute you are, I have assumed positive data. The negative of $A$ can be concatenated as parameter or learned using a second layer or even a simple softened sign parameter. That should be exercise for reader to adapt to their own problems.

Pppps, ah, this works so well, I want to name it, let’s call it autoactivation. Similar to the auto- in autoimmune diseases, autoactivation means that the parameters of the network activate themselves, they are autoactivating. By the same convention, self-attention models are autoattentive.

Log-normal Parameters

Has anybody given a serious go at using log-normal initializer for deep neural network parameters? Seems likely to make sense. Additionally, one can also do batch- and layer- log-normalization on layer activations.

Accessible Rights

I wonder how the world will turn out one day. Once long ago, I was in the company of some intellectual human discussing political ethics… And it occured to me that I was struggling to understand a lot of what was being said.

What will happen one day when our political practices advance to a place when most people cannot hold the concepts in their head due to cognitive limits? What will happen if truth, happiness and right and wrong are not comprehensible by most people? These incomprehensible thoughts about things that are fundamentally human(good/happy/free…), let’s call them truer truths…

It’s kind of weird that from intellectually enlightened discussion come a thought of the possibility that we cannot comprehend what is right or good. But from the conception of such truer truths we deduce knowledge of truer truths. It would seem, if there be gods who comprehend these truer truths, lEr him slip a hint of their existence, like black hole leaking radiation.

This of course also come home to the origins of the discussion with companions, what if AI became more cognitively advanced than humans in that they know more about the truer truth that us? It is imaginable today that we have enough computers, under control of a single entity, to computer more than human brain in some task. The AI may be closer to Gods of religion than any of us.

Scary stuff.

Let’s come back down to earth. What if I cannot understand what you say is right? Am I in the deficit to all that is good and mighty for not understanding you? There is a slight chance that you are making it up and that you don’t actually make any sense at all for real. What if we have a constitutional amendment that says “gdjiDkdberish” ? Or a law that says “only non-alien citizens can live and non-aliens can understand e=mc^2” or “affirmative action credits shall be calculated by the suface derivative of the Mao-Zhou balance equations of said population as measured by Kim’s protocol.”

Hey, I mean, it’s not like I’m inventing a human thought here. Among us who are subjects to it, how many really understands the US tax code? In theory, that’s something we chose to do out of our free will. Like really, how are any thing we have today different than these examples above?

And the other side of this argument, how simple do our value system have to be for it to be practical? How universal do our value systems have to be for it to be right? For example, can we demand that citizens can remember to follow the laws? Can we require people to have enough math skills to pay taxes? Can we require them to be literate enough to read candidate names to participate in an election? Can we require citizens to read election candidates write-ups and pass a test before casting his ballot?

What are the faculties, cognitive or otherwise, do we request of our citizens to access to their rights?