Keep it up

I want to write and advocate for keeping historically important dataset in the public domain for historical references. Recently MIT removes data that have been in use since 2006 from their websites (news article, announcement letter)citing a draft of paper challenging the quality of the data with respect to biases against races, sex, etc.

While this knee jerk reaction is like an instinctive wipe with the sleeve when you feel snot and booger stuck on your face post sneeze, it really is not necessary. Everyone sneezes and knows what I’m talking about. There is no shame in a mistake like this. But removal of the dataset removes people’s ability to verify the claims in the draft paper challenging the dataset. ATM, it would appear to me that keeping the dirty data available may help us to learn about the process and bias in our world today. Perhaps someone wants to create an automated data repairing algorithm. Perhaps someone will want to verify a calibration algorithm that can use a large model trained on such a biased dataset so that we understand the relationship between data and model better.

So, I know it’s kind of painful for a prestigious school, and professors and researchers who need to keep their funding etc… But in the name of science, I hope they recant and republish the “bad” dataset. It is history that factually happened, denying history in public is not a good habit to promulgate by leading American institutions. MIT can be a leader in this by doing something good for society by taking thoughtful and scientifically sophisticated actions. MIT doesn’t need to, and can not afford to, throw very simpleminded slogans of the social justice movement around and call it a day.

And overall, AI scientists can really ban together and make this all work.

And to begin, may I suggest that a simple step be taken. Each measurement be it an image taken, sound recorded, or a human opinion recorded for supervised training, all these just needs a bit of metadata: starting with a UUID and timestamp for each recorded measurement. Another way is to keep the data on a blockchain to ensure that they do not get corrupted or destroyed for the purpose of traceability, immutability and persistence.This way, the data gains useful digital integrity. They can be accessed and discussed in very precise ways. Another idea is to place cryptographic water marks on the dataset, right inside images to the concern of pyrrhic writers, with these properties: 1) Cannot be removed by someone without the private key, 2) can be easily verified using a public key to verify source to be an image from ML dataset so that “reverse image search engines” can refuse to service these requests, 3) said cryptographic watermarking algorithm used is implemented in open source software so that it’s effect on ML systems can be evaluated.(Think seemingly random and very small perturbations of pixels that have no useful pattern but is reliably verifiable) Again, if each data element had an identifier, we can actually produce a list of bad examples of especial concern when discussing them. Right now it’s just a bunch of images many file names sitting in a text file on researcher’s workstations. There is no easy and public way to discuss the quality of their analysis. Taking some of these steps in addition to those gathered for Datasheets for Datasets(Gebru, et al. 2018) will be doing a solid for our industry.

(By the by, the paper also ties in the idea of paying for data, in this case they may mean data like picture of private parts that people would normally not want to make public. A more general interpretation is taken by Andrew Yang in his UBI proposals which is partially motivated by the fact that many large companies, Internet and otherwise, are starting to leverage our data in very intimate ways and that it would be reasonable for them to do that, with our consent, by paying us for our intimate data.)

P.s. after reading TFA, I guess the news decided to leave out one of the facts of challenge which is that the dataset had easily identifiable private parts of people(easily identify to whom said privacy belongs, IMO). I suppose these would be a terrible thing to keep on the Internet. If considered PII and nonconsensual, then it would be illegal to keep any of them posted. TFA also mentions the existence of Child Pornography and some other ominous and unspeakable issues that they only revealed to the dataset creators. In all these cases, there should probably be some legalese in place of the anti-bias letter citing scientific and moral concern hiding legal and financial ones. It would be nice if MIT did not mislead and mis-lead the community.

P.p.s. It’s okay, it’s just chunky booger wrapped with slushy snot—everyone makes it. Instead of secretively wiping it off and then realize it’s stuck on the sleeves and wipe it on our pants, or a wall, or under the classroom desk…. lets flush it into specially designed paper with hot air. Let’s then put it in a jar and examine more carefully through the lens of time.🤢🤮

P.p.p.s. what does “offensive” mean? This is the top criterion for assessing model and dataset propriety besides non-objectivity and damaging effects(financial and physical hurt) From top google search results it seems one is offended mostly an emotional response. Even if one’s intelligence is insulted, it is the emotional feeling that we seek to comfort—i.e. apology required. Based on media reports of increasing violence and anger against Asians in very casual daily situations(example, or myself finding Stanford Asian Liver Center offensive despite their intentions, or when a lot of Asians find Trump’s “Kung Flu” and “China Virus” offensive.). I am to believe that many white people are very much genuinely offended if I walk on the street. Given that, I definitely do not want “offensive” to be a criterion of scientific moral propriety. Out desirable social norms should be expressed in quantifiable terms: racial parity, equality of opportunity, equality of benefit, equality of outcome, equality of progress, equality of arrival, equality of representation, equality of contribution, fairness of service in efficiency and effort, as measured by QIM, etc… These arguably subjective metrics are actually the most objective way to shape our technology. These metrics resonate deeply with both our rational and emotional minds, as well as our body, and all their extensions in our language, culture, technology and society, they all have innate design to express in terms of these metrics.

The Right-Sizing of Humanity

I was playing with a deep neural network tutorial recently. The fun of deep learning is starting to wear out after three years or so of continued exposure. Adjusting learning rates, batch sizes, filters, penalties and regularizations. Trying out algorithm that promise to perform without undue experimentation with these hyper-parameters… It used to be so fun, exciting to make even the smallest improvements. But today, it’s quite tedious and quite boring.

A quick meta-thought brings to mind a training procedure: Everytime I want to change the training, either interrupting SGD mid-stride or tinker with a hyper parameter. I could write the change as python code. And then ask a model to learn to write these changes for me based on my supervision.

The outcome, in the limit, is that the model will be able to autonomously make these changes that I will want to make as if I was watching it. It would just do what I want to do, with the same patience as myself, and maybe same typo-rate, even. Note it doesn’t optimize final metric, just mimics what I would do.

ATM, I feel that experience will be gratifying. There are some various interpretations to that event.

We can say that the program has removed human desire. Not in the sense that we cannot or need not desire, but desires that are quickly satisfied are really not desires.

We can also say that human will have achieved idleness. We do nothing and anything and yet everything is done. We’ve achieved nil-activity, we accomplish all through inaction.

We can also say that the work to achieve said model tuning automation is a human minimizing activity. If optimizing human resource consumption, the model will remove the need for human. And if the model is still imperfect it will tend to minimize our usefulness. This is by far the most horrorfying interpretation of the event. Working on that model literally is an effort to minimize human involvement. (That has the MDL for my desires at the moment)

So here we have arrived at one way to inspect the “future of jobs” “situation.”

Since we are still in control, we actually have the ability to set where we want to go with. In the most pathetic case, we will institute Affirmative Action to affirm Humanity: the Law shall favor by race, the Human race, requiring all AI to have 1bps of entropy added to their actions to facilitate the need for humans. A less pathetic approach, we are squeezed out of skills-for-hire arena but human still engage each other in socializing and networking and things that only humans do, that’s still quite, quite, pathetic though, IMMHO today.

Can we ponder the question: “what is the right size of humanity?” What is the amount of involvement we really want? Right now I hate tuning parameters, but I can certainly see a situation, say a robot lay dying in the middle of the road, his loving and beloved human child crying.

I can say, “move aside and let me through, I can help!. The 2019 model year bots uses algorithms that have been in the public domain since 2018(and they don’t work! cd /dev ; they are bastard of random with zero and properly homed in null). I have SGD training! I can save that bot!!” The child’s watery eyes is now filled with hope, meeting my fiercely determined and confident eyes at the midpoint on the edge linking us two humans. (Think Goldblum cross Eastwood cross Moore)

But a bot could do that better than you! My annoyingly observant reader will quickly point out and move on to another more interesting blog.

But for me, there seems to be something I care about in that moment. There’s something I care for in that moment. And one can easily achieve consensus that there is something humanity cares for in that moment. Is it hope? Is it kindness? Is it sympathy? Is it the desire to decrease perceived entropy? Is it the interdependence of humans that is really of note? Is it my usefulness that I eally care about? To the bot or to the child? Is it respect all I want, even when that’s only payable in arrears? Is it…??? What is it? Can we quantify it? Or does its identity and essence rely on its lack of computerized representation?

Perhaps an AI can be made to tell us this idea that it cannot describe within its domain? An AI to give humanity it’s best meaning and purpose. And any progress in characterizing it seem like a truly imaginative and inventive step forward, be it taken by human or by computers.

To be continued…

P.s. I realize it was more than 25 years ago when I first wrote about this matter. I dreamt in highschool of making AI. I wrote for my 11th grade Advanced Social Science class to take the position that a symbiosis is not only acceptable but also a desirable and inevitable outcome. We should co-evolve, I wrote. Somehow, that position still echos in the FAM Blog. It would be fascinating CS work to integrate with philosophy, perhaps name it Computational Philosophy, a field of philosphical endeavor, Human kind and Computer kind, together, hands-on-keyboards…

But that’s an interesting question in itself. Because we, as a kind do ingest a lot of very intimate things from our surroundings: water, air, viral dna/rna, etc. Things like antibiotics we take as part of humanity because enough people use them, on average, there’s some non-zero antibiotic in everyone. Then there are vaccines. Significant resources have been devoted to the continued injections of antibiotics or vaccines into people that they are us. Computer is us, part of us. They have both physical presence and biological and social functionality as part of us personally and as part of our society.

While there are some who object to mass enforcement of mandatory vaccination, their effects are limited. One would imagine that the people yammering against computers becoming irreplaceable part of our lives… They are vaccine opposers. They are the people who asks questions like “who will buy drugs if vaccine prevents a disease?” And “are you still you if computer does the shopping for you?”

Don’t care, and yes.

Refactor Autoactivations

I’ve been thinking about autoactivations recently. This is one of those great innovations that stood up to the test of time, it still works after a lot of debugging and exposure to new data and models.

I find that I have been referring to autoactivations as pre-activation because they occur to deep neural nets before the parameters are actually mixed with input data (or previous layers activations) but if you look at the two expressions:

  1. To pre-activate a parameter means to apply nonlinearity before it is used. e.g. preheating the oven, the suffix is a verb and happens before something else.
  2. But a pre-activation is actually an adjective meaning before any activations. e.g. pre-trial motions. It’s suffix is a noun and becomes the subject to be preceded.

And actually similar problem applies to ante- prefix. So, to avoid confusion, we should probably refer to autoactivations as foreactivations and to foreactivate the layer. This prefix also means before and it works both for nouns: foresight, foreknowledge, forethought, forerunner, foreword, foreman, and also works for verbs: forecast, foreshadow, foredone, foreshorten, forewarn, forestall, foredoom. In each case the suffix is always the prior thing before but never preceded by another.

So, let us all try out foreactivations and related approaches. The speed up in training will surely be a good thing for humanity, at least, we won’t be consuming as much energy training models without foreactivations.

😀