An discussion of the skewed join

I never learned database stuff in cs classes… But recently had to use Pig’s Skewed Join and thought it was interestiong:

The skewed join samples the data (by running a full MR through the data) and splits data belonging to the larger keys to several reducers and replicating the smaller table to all those reducers.

My initial thought when exposed to this is that wow, that’s so cool, but so dumb. Why do a full MR to sample the data? Why not utilize hdfs to actually sample records from the input to approximate the split?

Secondly, the optimization to split only the large key into several partitions is unecessary… Consider this instead: Approximately compute the mode hash partition. (i.e. hash key into hash space, compute the hash space that would have received the most keys, possibly exceeding that reducer’s capabilities) And in reality, often this is will not just be identical hash partitions, but identical keys.

Compute conservatively the number of reducers will be needed to compute that hash partition, say this number is k. Then in the mapper stage add to the key a randomly to each record an integer between 0 and k-1. (Add key as in add a separate key, not arithmetic addition). Cross the smaller table with the numbers [0,k-1] and use the integer as the key as well. Join on the reducer side.

I argue that this doesn’t increase, excessively, runtime/resource consumption as compared to the implementation that splits only the largest keys into several reducers. By obviousness. The small keys will be sent to several reducers, but the smaller table will be waiting for it there, so no biggie.

It reduces cost by one MR, because we can sample, or if the data is known, we can just allow the language to specify the number of splits to happen.

r= join x by a, y by b using “skewed 100”;

Finally, to push this to an extreme, Why bother even generating the second key at all? Similar to the Pig’s map-side-join called “replicated”, the replicated join can actually happen on the reducer side. Just send small table to each reducer, and randomly throw records at the reducers (w/o even putting into a bucket), and perform the join there.

The problems only occur when “Small table” is large. In which case, too many different hash keys on one reducer may overwhelm the system. But the proposed system here is no worse than the described system that is implemented righ tnow.

I guess the complication comes in when there are other operations and joins involved. If for instance, an operation on the join reduces the size of the data significantly (i.e. filter) then putting it on the mapper side is worth the while. Because, obviously.

But if that is not the case it seems always more worth while to replicate to reducer than to mapper. Because there is a chance that the small table won’t have to be replicated as much (big hash partitions will only receive one row in the small table). AND, the cost of communicating that extra bit of random integer is almost surely smaller than copying smaller table to mappers, joining, and then taking big table row + small table row and sending to reducer. Right?? Send smaller table once to reducer and be done with it. Send a small integer along with big row, but that’s probly compressed away because it’ll be the same integer in most rows. (Recall that data is skewed, so most data is in the “mode” key and that key all goto a few reducers, so “mode” key and it’s random number all get compressed to nil)

So, I guess the solution is not to only to allow us to specify the skewness but also to specify whether

Using “replicated”

is map side or reducer side.

Using “replicated to mapper”

vs

Using “replicated to reducer”

I want to explore the exact situations under which the above proposed features are superior. As well I’d like to extend the analysis to multi-table situations.

wlog:
Join A by key, B by key, C by key…

with sizes A>B>C….

what kind of key distribution will…..

What Does a Marketing Company do?

I wish I had spent more time in biz school learning the trade. When you have a b2b company (My former employer Bittorrent, for instance) how do you perform market analysis? Who’s the target market?

Well, it seems that for a b2b company, the target market is some times better defined by the target market of the target market.

Using Bittorrent again for example. The target market are numerous. (The product DNA, Distributed Network Accelerator)

The target market is anybody who wants to transfer the same bits over the network.

But this is made easier if we analyze the target market of the target market.

Who needs to transfer same large files to many people? (a priori, not streaming as is the case with youtube, google earth, … hmm actually, maybe that’s another target market)

The users who can withstand some latency:

  • Movie watchers downloading HD/blueray movies.
  • Game players downloading game graphics
  • Sysops downloading linux DVD images
  • Scientific data sharing
  • Intra-company transfer of data file.

From there, the target market suddenly becomes clear

  • Media companies: youtube, hulu, porno sites, netflix, if p2p streaming is supported then all the better
  • Game designers or Game distribution channels
  • Linux distro companies or organizations (redhat, etc., etc.)
  • Scientific data sharing is usually between university labs (so market to the universities directly and get a share of the NSF/NIH funding)
  • Intra-company is harder, there are of course products in this area from companies such as Microsoft.

So because we begun to investigate the target market of our target market, the target market analysis suddenly became easy. And the sales prospects just became really really obvious.

Of course, most successful marketers will have done this subconsciously and arrived at the finally analysis directly as if by pure genius. For me, I’ll just have to square that target market analysis process.

Cosmetic Surgery

For quite some time now, I’ve been considering getting myself cosmetic surgery. Today, while at work, some thing subtle finally pushed me over the edge and I spent a good lunch hour googling and day dreaming about what to get myself.

I made a list of big stars that I might want to look like. The girls at the office thought I was just being cute and making conversation, but I hadn’t the heart to tell them that I really feel that I need it.

So here’s the list… We started with

George Clooney,
then

Gerad Butler
Russel Crow (btw, does Gerad Butler look like a younger version of Russel Crow? And they both acted in a bunch of historic movies about gladiators?)
Owen Wilson
Jude Law
Ashton Kutcher (Maybe Ashton kutcher knows something you don’t?)
Nicolas Cage (wowa!0
Jon Hamm (total madness)
Leonardo DiCaprio
Will Smith
Mike Myers ( Do I make you feel horny?)
Alec Baldwin (-30 yrs)
Matt Damon (-5 yrs)
Kevin Spacey
Hugh Jackman (yeah!!)
Mel Gibson

Simon Cow
Marlin brando
pierce brosnan
Tim Allen
Daneil Radcliffe (Abracadebra)
Brandon Frasier
John Malchovich (am)
John Travolta
BD Wong (wait, does his name sound like beady? hehe, like beady eyes?)
Jack Nicholson (-60 yrs)
clint eastwood (-70 yrs)
keanu reeves
Hugh Laurie (wohoo!)
Anyways, still working on my list of potential ppl to look a like.
I think this will really change my life.

oh, heheee, also learned a new word today… “phalloplasty” heheee… ouch… didn’t know this was possible… not that I need it.

What is the impact of the $10k new home buyer credit?

Recently the governator announced that he will sign AB 183 into law. I mentioned to my co-worker that this will create a micro-bubble and raise the price of house by $10k for the duration of the tax credit (may1-dec31 2010)

Today, he comments to me that actually the bubble might be more than $10k higher price. Let’s make the simplifying assumption that instead of a tax credit spread through 3 years, the tax credit is actually a cash payment that the government gives you at the point of closing. And also, let us assume, without serious loss of generality, that the home buying reward is given to all home buyers instead of new home buyers.

Then, he’s argument is, if you put down the $10k that government gives you into the downpayment, you actually save more thank $10k. Because you won’t have to pay the interest of those $10k. So the total  savings will be

10000*(1.05)^30 = 43129.42

(assuming 5% mortgage interest), so a dumb person may be persuaded to buy a house as high as $43129.42 above it’s actual market price if there wasn’t an incentive.

Now, let’s take this a step further. Say the interest rate is zero, and the lowest amount of down payment a person can pay in order to buy a house is 10%, then this $10k incentive allows him to buy a house that is $100k more expensive than the market price without incentive. So gives us a bubble of $100k above normal.

The psychology of this is well known. A person would probably be satisfied with keeping $5k of the incentive and allow seller to take the other $5k, but he would probably not buy the house if he was only able to keep $1k and the buyer takes $9k.

Anyways, the true size of the bubble will be known soon enough…

Alternative Great Leap Forward

Often, children of Chinese will hear parents speak of a mythical great leap forward. This was a gigantic effort to catch China up with America and exceed the United Kingdom. According to the documentary from Discovery Channel, (along with film from the era), the people were asked to cut down forest and melt their pots and pans to create steel. When the forest were exhausted, they went on and burned furniture.

Some people suggested that the Chinese were creating some kind of weapon, but had no defence against American spying planes and satellites. So they created a massive social movement to mask the secretive weapons labs and factories. (heat signature…)

So, let us suppose that this was not the case, and that the whole great leap forward(1958-1961) and cultural revolution was a necessary evil. (There could be many reasons. For instance, because of the closed-mind west is insanely scared of communism and could not possibly be open to any kind of communication or exchange of technology. Without this kind of exchange, there could not be much progress in China. So…. they have to wear down the communism’s political power (but without losing to the ROC leadership).) If it was necessary for China to do nothing for a decade or two, what could have the 1.3 billion people (or 750 million at that time.) do?

Well, let’s suppose that they can maintain agricultural output, then the people won’t starve.

But let’s suppose that they were indeed stuck in a rut, with out external stimulus (war, capital, knowledge), that it would take many years for things to improve, what could all those people do?

Hmmm, well, they could spend the time popularize Kung Fu. People can practice it and Chinese people would become physically strong. In this arena, the Chinese was still fairly advanced, if not the most advanced. Why not make more people practice the martial arts? Strengthen the bodies.

Chinese culture and history is certainly filled with ideals: ethics, morals, virtues, “cultivation”, “self improvement”, “self-sufficiency”, “well rounded man”, “solidarity”, “brotherhood/fraternity”, “orderly society”, they thought about relationship between the people and the government, their responsibilities and obligations to each other, they levied taxes, wrote laws, they explored men and his relationship with the world and other creatures of the world.

As much as any western educated person would like to believe in the originality and uniqueness of these ideals, they all existed and were common knowledge in Chinese historical culture, language, and practice.

You don’t believe me?

Think about this then: Do you think Chinese had sex? Did they have oral sex? Did they have three-some’s? Did they have orgies? Did they have gay sex? Did they do it and then write about it? At risk of some kid in the PRC will be disallowed to read my blog, think about this: Did they do it anally in China five millenniums ago? Do they do it today??




Given this commonality in our human history, let us further imagine for the Communist government to call upon some of these powerful ideals fully embedded in the population of China of the time and say: “let us strengthen ourselves and practice Kung Fu.” The Communist party is very powerful. It was fully capable of widely improve it’s population by leveraging this powerful pre-existing conditioning in the Chinese people.



Second thing is the arts. Granted, the great majority of the population is undereducated, they still have their local forms of arts. Establish art institutions. The dances of each regions of China are so beautiful and so different. The songs of each race are so endearing, even if we don’t understand the words. The paintings, the pottery, the architecture of their buildings… the poetry… This was apparently how the Renaissance was started… Some city state (Florence) had a benevolent leader who liked art. He commissioned many painters and sculptors to make art. And that lead to all those things that we know to be great about the Renaissance. Mao loved the arts, he himself is a great poet, and surely had great appreciation for the Classics (Chinese Classics), the instruments, chess, calligraphy, and painting. The documentary that I saw put serious emphasis on his dance parties (social, ballroom type, it would appear, or some Chinese dances), and that he often “rested with young girls in his private room.” He, clearly has great appreciation for dancing, he clearly appreciates life, and all it’s beautiful blessings! He could have encourage the practice of arts beyond the martial arts by giving massive amount of money to those causes (top down economics…)

Wait, while we’re at it, why don’t they build roads like America did when it faced severe depression? Build massive road and rail systems. Learn experience by doing, through doing we find our weaknesses. By doing we discover what we need, and necessity is mother of inventions. Why don’t they build roads? Can’t build rails because they don’t have iron/steel. But they can build roads right?

Let’s take a turn for the worse…

What else could they have done?

They could have gotten mad.

Build an army twice the size of the Japanese population.
Trained in advanced martial arts.
Invade Japan.

And get even with them.

Why the fuck not?

They didn’t have guns or boats? learn to build them! I mean, China had been a great ocean faring power just a few centuries ago. It’s not like they can’t build big boats. Just get the troops over there, and take over Japan. Why not??

Okay, I know, you are rolling your eyes into your skull.

“The point of the cultural revolution was to get rid of the old stuff” you say, “encouraging the arts and kung fu is the last thing they’d think of”

well… I don’t have to be responsible about this. This is just a thought experiment… had Mao knew the outcome. Would he have done something different? If he waste money/time, why not waste on something beautiful and empowering?

Okay, I know, you are still rolling your eyes into your skull for the second time having completed the first 360 rotation.

“They’d die starving, or being invaded. Kung Fu is no match for machine guns and air raids”


sigh… I need to learn more history. No matter how closed the west were, they did teach the Japanese how to make steel, how to make cars, what was it about the Chinese that prevented them from doing the same???


It’s not the language. There are plenty talented Chinese people who can learn English, French, or Russian. Certainly the leaders of China almost were all students of the west…. I do not believe that language was truly the problem…


And some years from now, if I were to think about this issue, would I think myself silly? OR would I feel the same way?

I guess only time will tell.


Or…. if I die from a Chinese bomb some day, and my suspicion that China had secretly developed weapons and was using the great leap forward to hide heat signatures of weapons factory…

well, then

HA!

carve “I knew that!!” on my tombstone.

The 6-2-10 system, part II

Okay, I’ve had a chance to discuss this with a kindred spirit mate of mine. So, there are some concerns:


Q: what happens to families in this system?
A: Well, the plan is to have the families belong to the same day-cycle. If dad goes to work 6-2, then the plan is for the wife to go to work 6-2 and for the child to go to school 6-2. The schools will multiplex between the shifts. And in fact, this is a most important feature of the 10-2-6 system. It is precisely because the schools are shared between two or three shifts that we can increase throughput of society. The school building is a limiting resource. We save money/resource by building one building and service twice or three times as many students as the 9-5 system can do in the same building.

Another important feature of the 6-2-10 system is enlightened by this question is that in the 6-2-10 system we penalize over-time. The company (and possibly each worker who consume multiplexed resource) is charged for overtime. This is necessary to keep spaces and other resources maximally available for the people who have reserved that space or resource for their shift.

If I’m in the 6-2 shift, I am incentivized to leave work at 2pm. If I don’t, both my company and my income may be charged. Company and work culture will change to allocate resource and plan work in such a way that it does not exceed the time allotted. This incentive means people will be more strict with their work time and will be able to spend the extra time with their family.

Q: How to you incentivize the system?
A: Agreed, this is a tough problem. we are in a predominantly 9-5 system. How do we migrate into a 6-2-10 system? Well, because we have doubled the throughput of society, we can actually afford to give an extra weekend day. Three day weekend for all in the system.

To maximize the utility of the space, we may want to shift the weekend days by enforcing a phased weekend.

Fri., Sat., Sun. is weekend
Sat., Sun., Mon. is weekend

This setup allows family to share Sat. and Sun., but adds an extra day (unshared if desired) so that the additional free time can be spent alone or with family.

Q: Can I work two shifts to make more money?
A: If you or your company finds that this is worth the money then yes. Since we charge for space and other shared resources, you would only do this if you can guarantee high work efficiency for the entire shifts.

Q: Doesn’t some big companies already use this?
A: Yeah, manufacturing companies does this. The “sweat shops” in China does. Actually their state run companies also does. Iron smelting, oil refineries, pilots and flight attendents, Taxi drivers,…, many industries already apply similar system, though not as thorough and systematic as what I’m proposing here.


Also, the navies of the world probably operate on shifts. This is also likely the inspiration for the shift system on the Starship Enterprise…


Q: How do you say 6-2-10?
A:The two system sounds alike some people work “9 to 5” while others work the “6 2 10”. Morning shift, Afternoon shift and the graveyard shift.

I wish I knew game theory

I wish I knew game theory.

Here is the problem:

There are conventional companies, secret intelligence agencies, and countries.

companies belong to one country, and companies can use secret intelligence agencies to spy on any other company, and the country has the ability to regulate how it’s companies can use it’s secret intelligence agencies to spy (as in, how much spying it allows it’s company to use.)

The players are countries and the reward is the net corporate fitness. What is the equilibrium state of this system?

Here are the factors to consider. If the companies do not spy, it will obviously fail to anticipate other companies products and pricing, and therefore will fail to compete effectively.

If all the company does is to spy, then it loses the ability to be productive and create original products or form independent product lines. When the company loses enough capability as it spends more and more time copying and coping with competitors, then it dies.

Here are some other possibilities: What if the company only spied on foreign companies? Or the country directs the spying activities to manipulate which company succeeds and which one doesn’t?

What are the optimal strategies for countries?

What if companies can chose to accept spying offered by the country or not, what would be their optimal strategies?

…. damn it, wish I paid more attention in my game theory class in college.

书到用时方恨少ya.

The 6-2-10 system–How to Deal with the Chinese Population Problem

Star Trek VII is on Hulu right now… I got very excited… watched for a while at work, and remembered that I hate Star Trek. For some reason, through four decades of the franchise, there have never been a Chinese person on the show. Given that the population is a quarter of humanity now, …, sigh, …, maybe the sci-fi writers of Star Trek decided that WW III wiped the Chinese out?

And the Chinese space program continues separately from the “International Effort”…. sad..

Anyways, back to my main point. In my youth, while I was still dreaming of a peaceful, advanced future where human can face all challenges because of our ingenuity and humanity…, I once had an idea.

So, China is “over populated” and most people are forced to retire around age of 50… just when they have accumulated experience.

So, a fairly naive idea is to take the day, split it into three 8 hour segments.

6am-2pm
2pm-10pm
10pm-6am

and have people put into phased days. Some people have living schedule such that they work 6-2, others 2-10, and a few 10pm to 6 am. Let’s refer this as the 6-2-10 system, as compared to 9-5 system.

(And this goes for universities, most companies, manufacturing… the only thing that it wouldn’t work for is probably farmers…, which, well, that’s for a separate blog)

They would have to build office space (and possible living quarters) such that people can share the same office space, during the three phases.  But once this is done, the only thing is to coordinate traffic so that people commuting to work and from work can pass freely. And this should also be easy because people going from home to work and from work to home will be less this way than a 9-5 system. And when they share roads, the roads will be utilized better because both to and from work will be used simultaneously, instead of all going to work and all coming from work.

And Japan, and Korea, any where there is dense large populations, this system can be applied, and suddenly, we lessen the crowd, and people will be allowed to work past their retirement age in China.

Also, in the 6-2-10 system, there won’t be a need for day light saving time. The entire day light will be used. Granted, some extra resources will be consumed to light offices and schools to allow them to see from 6-8ish, and from 6ish pm – 10pm. But that is a solvable problem.

I guess, grudgingly, I should credit the show Star Trek for making me aware of this possibility. There was actually one episode where they switch from a 3-cycle day to a 4-cycle day to increase the efficiency of those working. (because they work shorter but more intensely) Even though they don’t like the Chinese, they may have inadvertently solved a problem for them.

What are some problems with the 6-2-10 system?

I guess one obvious one is that the society will be segregated into three segments. The effect on society is unfathomable.

Also, how would you either assign or choose which period you worked in? (What would be fair? What would work well?)

How would you deal with the extra resources (electricity) needed to make this happen? How would you transition into this system from a standard 9-5 system?

When would you collect the garbage if an office space is occupied 24 hours a day?

We don’t have this problem in America of course… But Asia certainly face this problem.

 And, I’d like to claim that the 6-2-10 system is far superior to the 1-child policy as a means to deal with the “Chinese Populaiton Problem”.

That foreign concept of prioritization

In my earlier youth, I once heard my childhood friend mention to me that he knows the solution to all my problems:

“You have to learn to prioritize, Huan!”

he said, full of confidence…. I heard this for the first time in my life. He speaks of my not having a gf in college. He’s much more westernized than I am, dating multiple women at once, clubbing with a different one every night…

“What’s important in your life? You have to prioritize your life.”

Later, I start to realize that prioritization should not be so foreign to me… Even if he is more westernized in his womenizing ways, he has firmer grasp of the Chinese culture than I. The Chinese culture is full of hiearchies and prioritizations.

One obvious one is this list of things one is to do in sequence:

…修身 齐家 治国 平天下…

from 《礼记·大学》. It says, one is to excercise the body, get a wife, serve the country, and bring the peace to the world–in that order. The ancient scholars beleived that that was the right approach to life, and that each preceding action is a prerequisite for each action following it. i.e. You must have a good body in order to get a wife, and you must have a wife before you can serve the country.

Of course, one can microscopically argue that this is not exactly true. But, in the large, this statement of dependency and life style is what people (western or easter) actually live.

… What made me realize this today was when I explained to co-workers how Chinese men will take their wife home on Chinese new years eve, then goto her home the next day, then goto other relatives home in order of age in the following fifteen days to bid them new year wishes of fortune and health…

It suddenly dawned on me. I think this way very explicitly… despite my not having realized the fact explicitly…

Of course, I’m not arguing that this is for the best or not. My childhood friend is doing well in life, better than myself. But ultimately, I guess, self awareness is what I have gained.

What should the PRC do with Tibet?

I once heard a very interesting discussion (prc folks as it will become very obvious), about how the PRC can advance Tibet’s economy.

It is known to many people that Tibet is a located at a very high elevation. There is really nothing there except for mountains, snow, and lots of UV rays. The people discussing the subject suddenly came upon a suggestion

“We should legalize gambling and install casino’s in Tibet”

and it suddenly dawned on me… wait! that’s such a brilliant idea. Certainly United States is able to create an artificial economy in Nevada which is similarly barren….

“We’d have to ship in some prettier women…”
“Yeah! I’ve never seen a pretty Tibetan women before…”
“…”

the room exploded in discussions like a bag of popcorn in a microwave.

“We can also put lot’s solar panels up there, because it’s higher elevation, there’s less air to block the sun’s ray’s… we can generate lots of electricities up there”

“… And google can put computers up there, so they don’t have to spend money on cooling…”

(obviously many google employees amongst the crowd)

“… and we could make a nuclear waste dump, and charge per annum fee to the US and EU to store their nuclear waste…, until the material is no longer dangerously radio active”

Personally I think these are great ideas. But as an educated person, I have to given pause and consider the Tibetan people. As President Obama is about to meet the Dali Lama, (despite PRC protest),

In this time of continued economic crisis, it’s surprising that he continue to experience pressure to meet the Dali Lama. Are there lobbyist who have spare money right now to make this happen right now?

Okay, so, motivation aside, what is an outcome to the PRC-Dali-Lama relationship that would be favorable to the United States of America?

A.)
Dali incites civil war in the PRC. China spends lots of money fighting an internal war… and thus creating the Chinese Military-Industrial complex.

Yikes!

B.)
America goes in with Comandos and installs a military to support the Tibetan independence war. America get’s it’s stimulus, and economy recovers.

PRC is sure to respond… this is a war that we probably want to avoid for ever.

C.)
Dali is shamed into “kowtow”ing to the PRC, and become imprisoned in Beijing.

Big deal. The Tibetan religion won’t die. The Tibetan people will live on, and another religious leader will be born in their religious process. One person suffers a little bit

D.)
Everything remains status quo. Obama meets Dali, but neither says anything material, and the event evaporates quickly.

The Chinese may harbor  ill-will toward Obama. Or worse, it will probably attempt to mess with the American economy within it’s currently limited ways… delays “the American economic recovery” by at most 6 mo.; Not so bad outcome… Obama may still make re-election if he recovers..

E.)
America proactively create a peaceful resolution to the conflict between Dali Lama and the PRC. This may include Obama giving financial incentive (maybe indirectly) for Dali to engage in dialog and to be conciliatory in his activities…

This would be great for the PRC. And Obama may win a second Nobel Peace prize…

F.)
same as E.) but Obama to get the $ from the PRC.

Best solution yet. The United States looks great: solves an unsolvable problem, promotes human right, equality, American way of life, AND get paid for doing it.

Giddyup!!!!