okay, so here’s a problem I ran into today. In PigLatin, I needed to calculate the following:
A = group TABLE by (f1,f2,f3);
B = foreach A generate group, SUM(f4), MIN(f5), MAX(f6), (f7 is null)?1:0;
My problem is that for the data that I have (about 100gb) there are actually only about a dozzen combinations of (f1,f2,f3), so the consequence is that the execution crashes with the reducer running out of memory.
The question is, is it possible to make this calculation even with sparse keys?