re-repigged

So, here’s the solution to the question in my previous post.

The most obvious way is to perform a partial aggregate and then merge the results.

A = group TABLE by (f1,f2,f3, ((long)(random()*100));
B = foreach A generate FLATTEN(group), SUM(f4) as sf4, MIN(f5) as mf5, MAX(f6) as mf6;
C = group B by (group::f1, group::f2, group::f3);
D = foreach C generate FLATTEN(group), SUM(f4) as sum_of_f4, MIN(mf5) as min_of_f5, MAX(mf6) as max_of_f6;

simple and effective. Does not require extra programming in java at all. This algorithm can be applied to all associative reducing functions. But associativity is not a necessary condition. The Algebraic functions in Pig actually allows for more general implementation where by an operation is divided into two stages. The first stage is required to produce tuples, which the second stage then processes.

You will probably never meet a problem that produces a bag larger than one compute node can handle. I mean I work for one of the fastest growing internet eCommerce servicing companies and our processing never even come close to seeing this problem.

But, in case in some distant future, when you encounter bags in computation that overflow one or even two compute nodes, and even at that time Pig/Hadoop still does not have a built in mechanism to deal with this automatically. So the above is an workaround.

Hope this helps you in your work or play when I’m long gone and dead.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s