Saturday, November 23, 2013

Reading Accumulo Metadata Table to Learn How Many Entries Are In Each Tablet.

After compacting the table, you can run the following program to learn how many entries are in each table. Accumulo does a nice job of splitting tables by byte size but if you have small records then it's fairly easy to run the "Curse of the Last Reducer!" I've run into situations where some tablets have 50K and other with 50M.

package com.affy;

import java.util.Map.Entry;
import java.util.Properties;
import org.apache.accumulo.core.Constants;
import org.apache.accumulo.core.client.AccumuloException;
import org.apache.accumulo.core.client.AccumuloSecurityException;
import org.apache.accumulo.core.client.Connector;
import org.apache.accumulo.core.client.Instance;
import org.apache.accumulo.core.client.IsolatedScanner;
import org.apache.accumulo.core.client.IteratorSetting;
import org.apache.accumulo.core.client.Scanner;
import org.apache.accumulo.core.client.TableNotFoundException;
import org.apache.accumulo.core.client.ZooKeeperInstance;
import org.apache.accumulo.core.client.impl.Tables;
import org.apache.accumulo.core.iterators.user.RegExFilter;
import org.apache.accumulo.core.util.ByteBufferUtil;

public class GetEntryCountForTable {

    public static void main(String[] args) throws IOException, AccumuloException, AccumuloSecurityException, TableNotFoundException {

        String accumuloTable = "tableA";

        Properties prop = new Properties();
        ClassLoader loader = Thread.currentThread().getContextClassLoader();
        InputStream in = loader.getResourceAsStream("");

        String user = prop.getProperty("accumulo.user");
        String password = prop.getProperty("accumulo.password");
        String instanceInfo = prop.getProperty("accumulo.instance");
        String zookeepers = prop.getProperty("accumulo.zookeepers");

        Instance instance = new ZooKeeperInstance(instanceInfo, zookeepers);

        Connector connector = instance.getConnector(user, new PasswordToken(password));

        String tableId = Tables.getNameToIdMap(instance).get(accumuloTable);

        Scanner scanner = new IsolatedScanner(connector.createScanner(Constants.METADATA_TABLE_NAME, Constants.NO_AUTHS));
        scanner.setRange(new KeyExtent(new Text(tableId), null, null).toMetadataRange());

        int fileSize = 0;
        int numEntries = 0;
        int numSplits = 1;
        for (Entry entry : scanner) {
            String value = entry.getValue().toString();
            String[] components = value.split(",");
            fileSize += Integer.parseInt(components[0]);
            numEntries += Integer.parseInt(components[1]);

        int average = numEntries / numSplits;

        System.out.println(String.format("fileSize: %,d", fileSize));
        System.out.println(String.format("numEntries: %,d", numEntries));
        System.out.println(String.format("average: %,d", average));


Friday, November 22, 2013

How to Run Accumulo Continuous Testing (well ... some of them)

Accumulo comes with a lot of tests. This note is about the scripts in the test/system/continuous directory. The README is very descriptive so there is no need for me to discuss what the tests do. I'm just doing a show and tell.

After creating an Accumulo cluster, you'll ssh to the master node to install Parallel SSH (pssh).
  1. Start an Accumulo cluster using
  2. vagrant ssh master
  3. cd ~/accumulo_home/software
  4. git clone
  5. cd parallel-ssh
  6. sudo python install
Now you can run the continuous programs. I've created the editable files so you can just copy my versions (Step two below). The script starts ingest processes on the slave nodes which was not immediately obvious to me. Watch Watch http://affy-master:50095/ to see the ingest rate. When you've got enough entries, run the
  1. cd ~/accumulo_home/bin/accumulo/test/system/continuous
  2. cp /vagrant/files/config/accumulo/continuous/* .
  3. ./
  4. ./
The figure below shows the ingest rate running two nodes on my MacBook Pro inside VirtualBox. My setup won't win any speed awards!

The next scripts we'll run are the walker scripts. They walk the entries produced by the ingest script. The output from the walker scripts are found on the slave nodes in the /home/vagrant/accumulo_home/bin/accumulo/test/system/continuous/logs directory. Watch http://affy-master:50095/ to see the scan rate.
  1. ./
  2. ./
Below is an example of the scan rate 

And finally there is the verify script which took about 15 minutes to run on my setup. You can visit http://affy-master:50030/jobtracker.jsp to see the job running

  1. ./

Saturday, November 16, 2013

Using Pig with Accumulo (building on Jason Trost's work) shows how to use Accumulo as a simple source for Apache Pig. (By the way, I haven't tested writing to Accumulo yet)

After the quick setup, you'll be able to read from Accumulo using a script like the following. The '\' character represents a line continuation. Be careful with your security because the password is sent via plain text. And stored in your history buffer. Probably this should be changed to use a property file.

register /home/vagrant/accumulo_home/bin/accumulo/lib/accumulo-core.jar
register /home/vagrant/accumulo_home/bin/accumulo/lib/accumulo-fate.jar
register /home/vagrant/accumulo_home/bin/accumulo/lib/accumulo-trace.jar
register /home/vagrant/accumulo_home/bin/accumulo/lib/libthrift.jar
register /home/vagrant/accumulo_home/bin/zookeeper/zookeeper-3.4.5.jar
register /vagrant/accumulo-pig/target/accumulo-pig-1.4.0.jar

DATA = LOAD 'accumulo://people?instance=instance&user=root&password=secret\
&zookeepers=affy-master:2181&columns=attribute' \
using org.apache.accumulo.pig.AccumuloStorage() AS (row, cf, cq, cv, ts, val);


Tuesday, November 12, 2013

Using Accumulo Proxy From Python

Start the Proxy Server
  1. Start an Accumulo cluster using
  2. vagrant ssh master
  3. cd /home/vagrant/accumulo_home/bin/accumulo/proxy
  4. edit so that instance=instance and zookeepers=affy-master:2181
  5. accumulo proxy -p
Install Thrift
  1. cd /home/vagrant/software
  2. Download the thrift gz from
  3. sudo apt-get install -y libboost-dev libboost-test-dev libboost-program-options-dev libevent-dev automake libtool flex bison pkg-config g++ libssl-dev
  4. sudo apt-get install -y ruby-full ruby-dev librspec-ruby rake rubygems libdaemons-ruby libgemplugin-ruby mongrel
  5. sudo apt-get install -y python-dev python-twisted
  6. sudo apt-get install -y libbit-vector-perl
  7. tar xvfz thrift-0.9.1.tar.gz
  8. cd thrift-0.9.1
  9. ./configure
  10. make
  11. sudo make install
  12. thrift -version
  13. cd lib/py
  14. sudo python install
  15. cd /home/vagrant/software
  16. thrift --gen py $ACCUMULO_HOME/proxy/thrift/proxy.thrift
  17. cd /home/vagrant/accumulo_home/software/accumulo
  18. export PYTHONPATH=/home/vagrant/accumulo_home/gen-py
  19. python proxy/examples/python/   


Monday, November 11, 2013

Watching Accumulo Heal Itself

1. Start a cluster with a master and two nodes using

2. Visit http://affy-master:50095/ to verify that two tablet servers are running.

3. Enable auto-refresh.

4. Run 'vagrant destroy slave1'

5. Visit http://affy-master:50095/. In a minute or two you should see one Dead Tablet Server. You'll also see messages on the Recent Logs page.

6. Run 'vagrant up slave1'

7. Run 'vagrant ssh slave1 -c /vagrant/files/' to re-establish SSH public keys.

8. Run 'vagrant ssh master -c "accumulo_home/bin/accumulo/bin/"' to re-start the Accumulo processes on slave1.

9. Visit http://affy-master:50095/. In a minute or two you should have both Tablet Servers alive and responding to requests.