Saturday, April 21, 2012

Implement Word Count in Accumulo

It seems like the first map-reduce program that everyone tries is counting words. This first program reads a piece of text using the mapper to tokenize the text and outputs a "1" for each token. Then the reducer adds up the "1" values to produce the word counts.

Accumulo provides the same functionality without needing to write a single line of code by using a SummingCombiner iterator. Below is a complete example.

Actually this example is more powerful because the same code can be used to sum across any time dimension.

This example shows how to sum across days. First start the accumulo shell.Then follow these steps:

> createtable --no-default-iterators wordtrack
wordtrack> setiter -t wordtrack -p 10 -scan -minc -majc -class org.apache.accumulo.core.iterators.user.SummingCombiner
SummingCombiner interprets Values as Longs and adds them together.  A variety of encodings (variable length, fixed length, or string) are available
----------> set SummingCombiner parameter all, set to true to apply Combiner to every column, otherwise leave blank. if true, columns option will be ignored.: true
----------> set SummingCombiner parameter columns, <col fam>[:<col qual>]{,<col fam>[:<col qual>]} escape non-alphanum chars using %<hex>.: 
----------> set SummingCombiner parameter lossy, if true, failed decodes are ignored. Otherwise combiner will error on failed decodes (default false): <TRUE|FALSE>: 
----------> set SummingCombiner parameter type, <VARLEN|FIXEDLEN|STRING|fullClassName>: STRING

Insert records for a daily rollup.

wordtrack> insert "Robert" "2011.Nov.12" "" 1
wordtrack> insert "Robert" "2011.Nov.12" "" 1
wordtrack> insert "Parker" "2011.Nov.12" "" 1
wordtrack> insert "Parker" "2011.Nov.12" "" 1
wordtrack> insert "Parker" "2011.Nov.12" "" 1
wordtrack> insert "Parker" "2011.Nov.23" "" 1
wordtrack> scan
Parker 2011.Nov.12: []    3
Parker 2011.Nov.23: []    1
Robert 2011.Nov.12: []    2

Get all counts for a given day:

wordtrack> scan -c 2011.Nov.12
Parker 2011.Nov.12: []    3
Robert 2011.Nov.12: []    2

Let's talk about that "--no-default-iterators" parameter for a moment. Normally, Accumulo uses an iterator that only displays the one value (the value with the latest timestamp) based on the uniqueness of the key/column family/column qualifer combination. If you leave that iterator in place, your counters will get essentially reset to one each time a compaction is done.
Post a Comment