# The Deep Universal Regressor

There’s this idea in Deep Learning that Neural Networks are universal function approximates. They can approximate any function you can provide data for.

It has confounded me for a long time exactly how it does this for continuous valued output, but recently, through the grapevines that is the Deep Learning community, I finally discovered one answer to this question.

Consider some deep neural network taking in $X$ and producing some kind of penultimate layer of activations, $A$. We want to write a formula for producing a $Y$ that approximates $\hat{Y}$.

Oh boy, who are we kidding, let’s just drop down to tensorflow code…

You want to do

$Y = inverse\_sigmoid(tf.mean(tf.sigmoid(A)))$

Being careful, of course, to calculate the pooling not for the batch but for each input and not to double sigmoid-activate $A$, but the last activation must be sigmoid-compatible. Note since sigmoid produces numbers between zero and one inclusive, so the mean, or any convex combination, of a bunch of such numbers can also exactly span that range, suitable for input to the $inverse\_sigmoid$. And of course if you need to, $A$could have been activated with the likes of tf.exp or tf.square and then filtered through the tf.sigmoid

For example, if you think $\hat{Y}$ ultimately grows with $tf.log(A)$, and you have already made sure $A$ is positive, then you can use the following by simplifying out the exponential and compute

$Y = inverse\_sigmoid(tf.mean(\frac{A}{A+1}))$

The $sigmoid$$sigmoid^{-1}$ pair can also be replaced with other bounded activations like the tanh or $\frac{x}{\sqrt[1/k]{1+x^k}}$ and their respective closed-form inverses.It can also be replaced with unbounded constricting activations such as $x^{\frac{1}{2k+1}}$ and $x^{2k+1}$ pair for a chosen whole numbers $k$.

P.s. the use of sigmoidal functions seems beckons to a probabilistic interpretation. The desigmoid, that’s inverse sigmoid, can be interpreted as a lookup from the CDF of a random variable, the value at which it achieves that accumulation. Essentially, in the most basic configuration, this regressor uses each element of A in the penultimate layer to support(or to reflect evidence that) that the desire $Y$ is larger. In a human brain, this positive-only thinking seem overly restrictive. What if we have a field of $A$ is a positive signal that strictly means smaller $Y$? One way is to use a second FCL to remove effect of one sigmoid from another. A second intuitive idea would be to do the following:
$Y = tf.atanh(tf.mean(tf.tanh(A)))$
P.p.s. Want to also put in a plug to our wonderful democracy. The computation of mean is explicitly mixing votes of each activation in the penultimate layer equally–each neuron gets an equal vote as to the result. Politics aside, and in addition to convex combinations, all other range preserving combination are fair game–e.g. geometric mean, softmax,etc. depending on the relationship between $X$ and $Y$ and the network that produced $A$.