The Deep Universal Regressor

There’s this idea in Deep Learning that Neural Networks are universal function approximates. They can approximate any function you can provide data for.

It has confounded me for a long time exactly how it does this for continuous valued output, but recently, through the grapevines that is the Deep Learning community, I finally discovered one answer to this question.

Consider some deep neural network taking in X and producing some kind of penultimate layer of activations, A. We want to write a formula for producing a Y that approximates \hat{Y}.

Oh boy, who are we kidding, let’s just drop down to tensorflow code…

You want to do

Y = inverse\_sigmoid(tf.mean(tf.sigmoid(A)))

Being careful, of course, to calculate the pooling not for the batch but for each input and not to double sigmoid-activate A, but the last activation must be sigmoid-compatible. Note since sigmoid produces numbers between zero and one inclusive, so the mean, or any convex combination, of a bunch of such numbers can also exactly span that range, suitable for input to the inverse\_sigmoid. And of course if you need to, A could have been activated with the likes of tf.exp or tf.square and then filtered through the tf.sigmoid

For example, if you think \hat{Y} ultimately grows with tf.log(A), and you have already made sure A is positive, then you can use the following by simplifying out the exponential and compute

Y = inverse\_sigmoid(tf.mean(\frac{A}{A+1}))

The sigmoidsigmoid^{-1} pair can also be replaced with other bounded activations like the tanh or \frac{x}{\sqrt[1/k]{1+x^k}} and their respective closed-form inverses.It can also be replaced with unbounded constricting activations such as x^{\frac{1}{2k+1}} and x^{2k+1} pair for a chosen whole numbers k.

Tada!

This solves the problems of your deep neural network needing constricting nonlinearities like the sigmoids, your need to produce a continuous output that may grow at non-linear rates relative to activations, your limited computational resources, and your having a lucid hunch as to how they are related.

Hopefully this helps you and saves significant amount of brain activity and experimentation. Your problem will probably need a special architecture using a configuration of this pooling later.

P.s. the use of sigmoidal functions seems beckons to a probabilistic interpretation. The desigmoid, that’s inverse sigmoid, can be interpreted as a lookup from the CDF of a random variable, the value at which it achieves that accumulation. Essentially, in the most basic configuration, this regressor uses each element of A in the penultimate layer to support(or to reflect evidence that) that the desire Y is larger. In a human brain, this positive-only thinking seem overly restrictive. What if we have a field of A is a positive signal that strictly means smaller Y? One way is to use a second FCL to remove effect of one sigmoid from another. A second intuitive idea would be to do the following:

Y = tf.atanh(tf.mean(tf.tanh(A)))

In one step, this regressor can consider both support for larger and for smaller value of Y.

P.p.s. Want to also put in a plug to our wonderful democracy. The computation of mean is explicitly mixing votes of each activation in the penultimate layer equally–each neuron gets an equal vote as to the result. Politics aside, and in addition to convex combinations, all other range preserving combination are fair game–e.g. geometric mean, softmax,etc. depending on the relationship between X and Y and the network that produced A.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s