taoCMS是基于php+sqlite/mysql的国内最小(100Kb左右)的功能完善、开源免费的CMS管理系统

Hadoop Streaming 实战之aggregate

2013-10-27

1. aggregate概述
aggregate是Hadoop提供的一个软件包,其用来做一些通用的计算和聚合。
Generally speaking, in order to implement an application using Map/Reduce model, the developer needs to implement Map and Reduce functions (and possibly Combine function). However, for a lot of applications related to counting and statistics computing, these functions have very similarcharacteristics. This provides a package implementing those patterns. In particular,the package provides a generic mapper class,a reducer class and a combiner class, and a set of built-in value aggregators.It also provides a generic utility class, ValueAggregatorJob, that offers a static function that creates map/reduce jobs。
在Streaming中通常使用Aggregate包作为reducer来做聚合统计。

2. aggregate class summary

DoubleValueSumThis class implements a value aggregator that sums up a sequence of double values.
LongValueMaxThis class implements a value aggregator that maintain the maximum of a sequence of long values.
LongValueMinThis class implements a value aggregator that maintain the minimum of a sequence of long values.
LongValueSumThis class implements a value aggregator that sums up a sequence of long values.
StringValueMaxThis class implements a value aggregator that maintain the biggest of a sequence of strings.
StringValueMinThis class implements a value aggregator that maintain the smallest of a sequence of strings.
UniqValueCountThis class implements a value aggregator that dedupes a sequence of objects.
UserDefinedValueAggregatorDescriptorThis class implements a wrapper for a user defined value aggregator descriptor.
ValueAggregatorBaseDescriptorThis class implements the common functionalities of the subclasses of ValueAggregatorDescriptor class.
ValueAggregatorCombiner<K1 extends WritableComparable,V1 extends Writable>This class implements the generic combiner of Aggregate.
ValueAggregatorJobThis is the main class for creating a map/reduce job using Aggregate framework.
ValueAggregatorJobBase<K1 extends WritableComparable,V1 extends Writable>This abstract class implements some common functionalities of the the generic mapper, reducer and combiner classes of Aggregate.
ValueAggregatorMapper<K1 extends WritableComparable,V1 extends Writable>This class implements the generic mapper of Aggregate.
ValueAggregatorReducer<K1 extends WritableComparable,V1 extends Writable>This class implements the generic reducer of Aggregate.
ValueHistogramThis class implements a value aggregator that computes the histogram of a sequence of strings

 

3. streaming中使用aggregate
在mapper任务的输出中添加控制,如下:
function:keytvalue
eg:
LongValueSum:keytvalue
此外,置-reducer = aggregate。此时,Reducer使用aggregate中对应的function类对相同key的value进行操作,例如,设置function为LongValueSum则将对每个键值对应的value求和。

4. 实例1(value求和)
测试文件test.txt

  1. a       15      1  
  2. a       17      1  
  3. a       18      1  
  4. a       19      1  
  5. a       19      1  
  6. a       19      1  
  7. a       19      1  
  8. b       20      1  
  9. c       15      1  
  10. c       15      1  
  11. d       16      1  
  12. a       16      1  

mapper程序:

  1. #include <iostream>  
  2. #include <string>  
  3.   
  4. using namespace std;  
  5.   
  6. int main(int argc, char** argv)  
  7. {  
  8.         string a,b,c;  
  9.         while(cin >> a >> b >> c)  
  10.         {  
  11.                 cout  << "LongValueSum:"<< a << "t" << b  <<  endl;  
  12.         }  
  13.         return 0;  
  14. }  

运行:
$hadoop streaming -input /app/test.txt -output /app/test -mapper ./mapper -reducer aggregate -file mapper  -jobconf mapred.reduce.tasks=1 -jobconf mapre.job.name="test"
输出:
a       142
b       20
c       30
d       16

 

5. 实例2(强大ValueHistogram)
ValueHistogram是aggregate package中最强大的类,基于每个键,对其value做以下统计
1)唯一值个数
2)最小值个数
3)中位置个数
4)最大值个数
5)平均值个数
6)标准方差
上述例子基础上修改mapper.cpp为:

  1. #include <iostream>  
  2. #include <string>  
  3.   
  4. using namespace std;  
  5.   
  6. int main(int argc, char** argv)  
  7. {  
  8.         string a,b,c;  
  9.         while(cin >> a >> b >> c)  
  10.         {  
  11.                 cout  << "ValueHistogram:"<< a << "t" << b  <<  endl;  
  12.         }  
  13.         return 0;  
  14. }  

运行命令同上
运行结果:
a       5       1       1       4       1.6     1.2
b       1       1       1       1       1.0     0.0
c       1       2       2       2       2.0     0.0
d       1       1       1       1       1.0     0.0

 

参考:
http://hadoop.apache.org/common/docs/r0.15.2/api/index.html?org/apache/hadoop/mapred/lib/aggregate/package-summary.html
book:Hadoop实战

类别:技术文章 | 阅读:452001 | 评论:0 | 标签:hadoop streaming aggregate

想收藏或者和大家分享这篇好文章→

“Hadoop Streaming 实战之aggregate ”共有0条留言

发表评论

姓名:

邮箱:

网址:

验证码:

公告

taoCMS发布taoCMS 3.0.2(最后更新21年03月15日),请大家速速升级,欢迎大家试用和提出您宝贵的意见建议。

捐助与联系

☟请使用新浪微博联系我☟

☟在github上follow我☟

标签云