A while back, I talked about computing IG–information gain–by clandestine methods via an otherwise secret(personal) email. I will point to some other prior blogs entries about what can we reasonably consider private and some reasons why I think it’s bad (Because it removes competition….
The basic challenge is this: If your competitor can spy on what you do (unilaterally) then they will never be motivated to innovate. Their key strength will be their ability to hack your secrets and they will work hard on that, but not on how to build a better product or cure a disease or solve a new problem. If you can both spy on each other with perfect information then there is no need to innovate, just calculate the equilibrium and aim for that. If you can disinform your opponent then all your effort will go into disinformation instead of innovation. Basically it is much easier to do something sneaky and cheat than to do the right thing and innovate. This is why the government, a non-competing body whose interest is to make sure everyone compete (at least in America government this is the case), should provide for information security.
I realize in retrospect that IG may not make sense to most people based on the formulation I laid out. Let’s review. IG is the change in entropy from a state without additional knowledge to a state with knowledge
IG = H(secret) – H(secret | private email)
This measurement seem to be of a quite abstract concept of entropy–a unitless measurement. Why would I think this useful for any reason other than that it is called “Information Gain?” Well truth be told, what I had in mind was more of the IG from machine learning literature: Class purity after conditioning on some private information. It is actually used more as a measurement of correctness of predicting discrete output than abstract change in entropy of distribution after conditioning. I will refer reader to these excellent introductory books regarding “classification” algorithms.
… Some days passes and the books will hopefully have arrived on your desks…
So the example is if my secret is the probability that I will have Chinese food tonight. Let’s throw in several more classes, say Italian, Mexican cover 99.9% of all possibilities. This probability may be internal to me. Or it may be an externalizable model like I will toss a three-sided die and figure out what I will eat tonight.
Actually, this system forces us to think of a new class. I will call this new class the innovation class. It covers all cases where something new might happen, such as tonight when I went off on a tangent and forgot to eat dinner completely. Or I might be abducted by Aliens for demanding privacy, Japanese paramilitary for blogging, or God for thinking all these awful things. The fact is, I do not know what will happen, but what I do know is that things I don’t know will happen. So the class is called IC, Innovation Class–now we have a 4 sided die: Chinese, Mexican, Italian, IC; Let’s write naively that the probability for each class is:
The formula for the entropy of these classes is written as:
-H(Dinner)= p(Chinese) * log(p(Chinese)) + p(Mexican) * log(p(Mexican)) + p(Italian) * log(p(Italian)) + p(IC)*log(p(IC))
the above evaluates to almost the maximum possible entropy in three-class situation: H(Dinner)= 1.6499060116098556
that’s it. that’s the formula for calculating entropy that we will use repeatedly. Now, suppose that you have read my email to my wife saying “oh man, look at this great deal on groupon, 50% off on Indian food right near our home” What is the right thing to think about the distribution of my dinner?
Indian food is not Chinese or Mexican or Italian, but we have thought of that and put in IC to account for it.
-H(Dinner|private email to wife) = p(Chinese|private email to wife) * log(p(Chinese|private email to wife)) + p(Mexican|private email to wife) * log(p(Mexican|private email to wife)) + p(Italian|private email to wife) * log(p(Italian|private email to wife)) + p(IC|private email to wife)*log(p(IC|private email to wife))
gives us the conditional entropy of probability of dinner after reading my private email. This entropy H(Dinner|private email to wife)=0.09596342477405478
IG(Dinner; private email to wife) = H(Dinner) – H(Dinner|private email to wife) = 1.6499060116098556-0.09596342477405478=1.5539425868358008. This corresponds to an IGR of 1619.31%, that is, 15X more information after you saw the email than before.
Great! so now we know how much information is gained by reading that one private email of mine. This number, I think quantifies my loss of privacy.
Btw, this innocent example contain some hand waving. H(Dinner) for example is something that we may or may not know. Most people have trouble writing down a distribution for dinner choices. also, P(Dinner|private email to wife) here written as a table contain assumed values. What if after reading my private email you feel that P(IC)=85%? Who is to say what the reality of this probability is? This is why I felt that this model will not make to main stream legal system because the link between private email and the actual secret itself is not so obvious. You might use naive Bayes as the definitive of reality (refer to chapter in books or wiki), logistic regression, decision trees, or you might use something else… You may even use a distributions system like SVM or god forbid rule based systems…
If you understand this computation above, then it will be easy for you to understand the continuous version. Let dinner be a continuous variable, we can still write the same expression
IG(Dinner; private email to wife) = H(Dinner) – H(Dinner|private email to wife)
and it would have the same meaning. How far are we from the truth. This idea, btw, is indeed partially inspired by the name Information Gain, which also goes by Kullback-Leibler divergence when computed over distributions. The above formation exactly with the exception that “private email to wife” is a distribution, say, perhaps, my emails are generated randomly.
KL( Dinner|private email || Dinner )
But KL divergence does point us to some other interesting characterizations. Divergence–distance without some properties of distance. Namely that it is not a metric distance:
* Nonnegative dl(x,y)>=0: yes
* Indiscernability: dl(x,y)=0 iff x==y: yes
* Symmetric dl(x,y)==dl(y,x): NO
* Triangle inequality dl(x,y)+dl(y,z) >= dl(x,z): NO
This has some serious implications regarding this formulation of privacy. Somethings that we naturally think should make sense do not.
Let’s say I have two emails, e1 and e2, and let’s say dinner is still the subject of intense TLA investigation:
KL(d;e1) + KL(d;e2) != KL(d;e1,e2)
All private information must be considered together, because considering them separately would yield inconsistent measurement of privacy loss
Let’s say there’re two secrets, d1 is my dinner choose and d2 is my wife’s dinner choose
KL(d1;e1,e2) + KL(d2;e1,e2) != KL(d1,d2; e1,e2)
All secrets must be computed together, because computing IG separately and adding is not equal to the total information gain.
Let’s say we have an intermediate decision called Mode of Transportation (mt), and it is a secret just like my dinner choice.
KL(mt;e1,e2) + KL(d ; mt) != KL(d; e1,e 2)
The intermediate secret can be calculated, but again, it must be calculated carefully and not by additive increase of IG.
Bummer, but fascinating!! But we we must make some choice about how to proceed. Knowledge about the nature of information (and especially electronic information), I believe, informs us about how we make choice in our privacy laws:
- Should the whole data be analyzed all at once?
- or should we only allow each individual’s data be processed all at once?
- or should we only allow daily data of everyone to be processed together?
- or should we only allow daily data of each individual to be processed separately?
Each of these choice (and many other) impact the private information loss due to clandestine activities.