map reduce is so repigged

okay, so here’s a problem I ran into today. In PigLatin, I needed to calculate the following:

A = group TABLE by (f1,f2,f3);
B = foreach A generate group, SUM(f4), MIN(f5), MAX(f6), (f7 is null)?1:0;

My problem is that for the data that I have (about 100gb) there are actually only about a dozzen combinations of (f1,f2,f3), so the consequence is that the execution crashes with the reducer running out of memory.

The question is, is it possible to make this calculation even with sparse keys?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s