One time, at religion camp (non-pc, skip to avoid vulgarity)

I told a blind woman: hey you are stomping into a shit puddle and splashing me with turd juice!

She says back after some deliberation: who are you? Jesus Christ? You are just as blind as me,you shouldn’t judge me. Truly you probably put that puddle there to set a trap for me.

I say to her: wait!! What?? You are blind not anosmic! Why are we even having this conversation? Why would I do that to you!?!? Why am I even at this camp!? And now you are just shitting an pissing into the puddle standing upright, is this incontinence camp too?

She says: you are just jealous that I thought of it first…

Oh no, now you are douching yourself with it.. I cannot watch this any more…

(oh wait that’s actually kind of smart, because now I, a man, cannot, within my imagination, copy the idea you presently practice.(sorry, despite present absurdities, I still have a little residual sense of wonderment about the products of people conditions and processes))

And the day continues this way, and day after day..

Seriously this seriously happened, come and smell me. (Now a few years later, the oder should be palpable still)

Ps, as always, I whole vehemently deny and disavow any connection with or even similarity to real world people, things or events of past, present or future.

The Deep Universal Regressor II

Several months ago I wrote about the Deep Universal Regressor. Hopefully you have discovered that it is full of bugs, fashioned typically for the late 2010’s, following Interweb AI blogging conventions of the day. It isn’t really because I’m time traveling and writing this from the far past or far future, just that this is the way we wrote in those days.

One most notable one is if you tried to regress Y = 2X using my suggestion, you found that it does not extrapolate well. After some inspection, one realizes the problem. One way to fix it is to use a decision list to make progressively smaller estimates:

y = 2^{unsquash(average(squash(A_0)))} + \sum_{i=1}^{k}squash(A_i)2^{unsquash(average(squash(A_0)))-i}

The A_0...A_k are linear activations from previous layers calculated from X.

But of course, this approach also has some fatal flaws that needs to be fixed. Some effort has to be made to ensure the right term will be adjusted at the right moment–it needs regularization. It also needs some experimentation around what k is.

After that, the floatx is the limit.

Ps, it’s kind of interesting that this range-preserving combination approach is slightly more general version of attention mechanism. Recall attention is applied as softmax combination of items of consideration. The softmax is just a positively weighted average, which in turn is a convex combination.

P.p.s. On the reverse side of this approach, perhaps we can apply activation in an earlier place to create similar effect. This approach is often taken to restrict parameter in a range, for example using relu(W)x to produce a dense layer that applies only positive multiples to inputs. The approach I have tried with good success is the exponentiation activation. e^{W}x At first blush, you might judge this absurd. When we take a step, would the change in e^Wx not be the same as the change in Wx? There is no contribution to complexity of the model in this parameter activation. If you look at it as A=e^W and then compute the output of the neuron as Ax, the derivative of x with respect to A is linear in X. What difference does it make how you arrive at A?

But I have observed that the application of this activation makes linear regression converge faster under SGD. Looking at it more specifically \partial A/ \partial W is e^W=A the parameter’s value is its own derivative. In other parlance the sensitivity of the derivative with respect to a parameter is linear in size of the neuron’s outputs. If A was just raw tensors, this derivative would be the constant 1. Another consideration is that perhaps all we did was to initialize A using a log-normal or log-uniform distribution. Lastly, the effect of exponentiation maybe through the effect of increased or decreased learning rates. For this blogs purpose, I tried to implement a few of these alternative explanations directly with linear model: using log-normal initialization to a simple dense layer, trying various learning rates. These attempts did not produce the same quickened convergence as exponentiating the raw parameters.

These meandering is of course a middle way between many. For example, if W we’re not raw tensors but the output of some tensor function of the input, and we further normalize:A=e^W / reducesum(e^W) this would be an attention mechanism like those used in transformers. Although the derivative of this is neither linear or constant.

Here is a list of parameter activations and their detivatives. Expressing in terms of A brings us a view of the scale of the gradient.

A \partial A/ \partial W \partial A/ \partial W in A
W 1 A^0
e^W e^W A
W^2 2W 2\sqrt{A}
lg(W) 1 / W e^{-A}
W^{-1} -W^{-2} -A^2
e^{e^{W}} e^{e^{W}}e^W A lg(A)
W^3 3W^2 3\sqrt[^3]{A^2}

Clearly, its hard to boost the progress of training anymore this way.

Ppps ah, yes, astute you are, I have assumed positive data. The negative of A can be concatenated as parameter or learned using a second layer or even a simple softened sign parameter. That should be exercise for reader to adapt to their own problems.

Pppps, This works so well, I want to name it, let’s call it autoactivation. In the tradition of autoencoders and similar to the auto- in autoimmune diseases, autoactivation means that the parameters of the network activate themselves, they are autoactivating. By the same convention, self-attention models are autoattentive.