I want to write and advocate for keeping historically important dataset in the public domain for historical references. Recently MIT removes data that have been in use since 2006 from their websites (news article, announcement letter)citing a draft of paper challenging the quality of the data with respect to biases against races, sex, etc.
While this knee jerk reaction is like an instinctive wipe with the sleeve when you feel snot and booger stuck on your face post sneeze, it really is not necessary. Everyone sneezes and knows what I’m talking about. There is no shame in a mistake like this. But removal of the dataset removes people’s ability to verify the claims in the draft paper challenging the dataset. ATM, it would appear to me that keeping the dirty data available may help us to learn about the process and bias in our world today. Perhaps someone wants to create an automated data repairing algorithm. Perhaps someone will want to verify a calibration algorithm that can use a large model trained on such a biased dataset so that we understand the relationship between data and model better.
So, I know it’s kind of painful for a prestigious school, and professors and researchers who need to keep their funding etc… But in the name of science, I hope they recant and republish the “bad” dataset. It is history that factually happened, denying history in public is not a good habit to promulgate by leading American institutions. MIT can be a leader in this by doing something good for society by taking thoughtful and scientifically sophisticated actions. MIT doesn’t need to, and can not afford to, throw very simpleminded slogans of the social justice movement around and call it a day.
And overall, AI scientists can really band together and make this all work the same way. There will be another data mistake, let’s develop the procedures for handling of discovering error, or even planned convalescence by predetermining a dataset’s retirement date, in popular datasets that maximizes societal benefit through the advancement of science without undue sacrifice from the individual researchers.
And to begin, may I suggest that a simple step be taken. Each measurement be it an image taken, sound recorded, or a human opinion recorded for supervised training, all these just needs a bit of metadata: starting with a UUID and timestamp for each recorded measurement. Another way is to keep the data on a blockchain to ensure that they do not get corrupted or destroyed for the purpose of traceability, immutability and persistence.This way, the data gains useful digital integrity. They can be accessed and discussed in very precise ways. Another idea is to place cryptographic water marks on the dataset, right inside images to the concern of pyrrhic writers, with these properties: 1) Cannot be removed by someone without the private key, 2) can be easily verified using a public key to verify source to be an image from ML dataset so that “reverse image search engines” can refuse to service these requests, 3) said cryptographic watermarking algorithm used is implemented in open source software so that it’s effect on ML systems can be evaluated.(Think seemingly random and very small perturbations of pixels that have no useful pattern but is reliably verifiable) Again, if each data element had an identifier, we can actually produce a list of bad examples of especial concern when discussing them. Right now it’s just a bunch of images many file names sitting in a text file on researcher’s workstations. There is no easy and public way to discuss the quality of their analysis. Taking some of these steps in addition to those gathered for Datasheets for Datasets(Gebru, et al. 2018) will be doing a solid for our industry.
(By the by, the paper also ties in the idea of paying for data, in this case they may mean data like picture of private parts that people would normally not want to make public. A more general interpretation is taken by Andrew Yang in his UBI proposals which is partially motivated by the fact that many large companies, Internet and otherwise, are starting to leverage our data in very intimate ways and that it would be reasonable for them to do that, with our consent, by paying us for our intimate data.)
P.s. after reading TFA, I guess the news decided to leave out one of the facts of challenge which is that the dataset had easily identifiable private parts of people(easily identify to whom said privacy belongs, IMO). I suppose these would be a terrible thing to keep on the Internet. If considered PII and nonconsensual, then it would be illegal to keep any of them posted. TFA also mentions the existence of Child Pornography and some other ominous and unspeakable issues that they only revealed to the dataset creators. In all these cases, there should probably be some legalese in place of the anti-bias letter citing scientific and moral concern hiding legal and financial ones. It would be nice if MIT did not mislead and mis-lead the community.
P.p.s. It’s okay, it’s just chunky booger wrapped with slushy snot—everyone makes it. Instead of secretively wiping it off and then realize it’s stuck on the sleeves and wipe it on our pants, or a wall, or under the classroom desk…. lets flush it into specially designed paper with hot air. Let’s then put it in a jar and examine more carefully through the lens of time.🤢🤮
P.p.p.s. what does “offensive” mean? This is the top criterion for assessing model and dataset propriety besides non-objectivity and damaging effects(financial and physical hurt) From top google search results it seems one is offended mostly an emotional response. Even if one’s intelligence is insulted, it is the emotional feeling that we seek to comfort—i.e. apology required. Based on media reports of increasing violence and anger against Asians in very casual daily situations(example, or myself finding Stanford Asian Liver Center offensive despite their intentions, or when a lot of Asians find Trump’s “Kung Flu” and “China Virus” offensive.). I am to believe that many white people are very much genuinely offended if I walk on the street. Given that, I definitely do not want “offensive” to be a criterion of scientific moral propriety. Out desirable social norms should be expressed in quantifiable terms: racial parity, equality of opportunity, equality of benefit, equality of outcome, equality of progress, equality of arrival, equality of representation, equality of contribution, fairness of service in efficiency and effort, as measured by QIM, etc… These arguably subjective metrics are actually the most objective way to shape our technology. These metrics resonate deeply with both our rational and emotional minds, as well as our body, and all their extensions in our language, culture, technology and society, they all have innate design to express in terms of these metrics.