Wednesday, May 29, 2013

Using Apache Accumulo as the backing store for Apache Gora - a tutorial

Apache Gora (http://gora.apache.org/) provides an abstraction layer to work with various
data storage engines. In this tutorial, we'll see how to use Gora with Apache Accumulo
as the storage engine.

I like to start projects with the Maven `pom.xml` file. So here is mine. It's important to
use Accumulo 1.4.3 instead of the newly released 1.5.0 because of an API incompatibility.
Otherwise, the `pom.xml` file is straightforward.

  <project ...>
      <modelVersion>4.0.0</modelVersion>
  
      <groupId>com.affy</groupId>
      <artifactId>pojos-in-accumulo</artifactId>
      <version>0.0.1-SNAPSHOT</version>
      <packaging>jar</packaging>
  
      <name>POJOs in Accumulo</name>
      <url>http://affy.com</url>
      
      <properties>
          <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
          <!-- Dependency Versions -->
          <accumulo.version>1.4.3</accumulo.version>
          <gora.version>0.3</gora.version>
          <slf4j.version>1.7.5</slf4j.version>
          <!-- Maven Plugin Dependencies -->
          <maven-compiler-plugin.version>2.3.2</maven-compiler-plugin.version>
          <maven-jar-plugin.version>2.4</maven-jar-plugin.version>
          <maven-dependency-plugin.version>2.4</maven-dependency-plugin.version>
          <maven-clean-plugin.version>2.4.1</maven-clean-plugin.version>
      </properties>    
        
      <dependencies>
          <dependency>
              <groupId>org.apache.accumulo</groupId>
              <artifactId>accumulo-core</artifactId>
              <version>${accumulo.version}</version>
              <type>jar</type>
          </dependency>
          <dependency>
              <groupId>org.apache.accumulo</groupId>
              <artifactId>accumulo-server</artifactId>
              <version>${accumulo.version}</version>
              <type>jar</type>
          </dependency>
          <dependency>
              <groupId>org.apache.gora</groupId>
              <artifactId>gora-core</artifactId>
              <version>${gora.version}</version>
          </dependency>
          <dependency>
              <groupId>org.apache.gora</groupId>
              <artifactId>gora-accumulo</artifactId>
              <version>${gora.version}</version>
          </dependency>
          <dependency>
              <groupId>org.slf4j</groupId>
              <artifactId>slf4j-api</artifactId>
              <version>${slf4j.version}</version>
          </dependency>
          <dependency>
              <groupId>org.slf4j</groupId>
              <artifactId>slf4j-log4j12</artifactId>
              <version>${slf4j.version}</version>
          </dependency>
          <!--
          TEST
          -->
          <dependency>
              <groupId>junit</groupId>
              <artifactId>junit</artifactId>
              <version>4.8.2</version>
              <scope>test</scope>
          </dependency>
      </dependencies>
      
  </project>

Now create a `src/main/resources/gora.properties` file configuring Gora by
specifying how to find Accumulo.

    gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore
    gora.datastore.accumulo.mock=true
    gora.datastore.accumulo.instance=instance
    gora.datastore.accumulo.zookeepers=localhost
    gora.datastore.accumulo.user=root
    gora.datastore.accumulo.password=

There are some important items to note. Firstly, we'll be using the MockInstance of
Accumulo so that you don't actually need to have it installed. Secondly, the password
needs to be blank if you are depending on Accumulo 1.4.3, change the password to
'''secret''' if using an earlier version.

That's all it takes to configure Gora. Now let's create a json file to define a very
simple object - a Person with just a first name. Create a json file with the
following:

    {
        "type": "record",
        "name": "Person",
        "namespace": "com.affy.generated",
        "fields": [
            {"name": "first", "type": "string"}
        ]
    }

This is the simplest object I could think of. Not very useful for real applications, but
great for a simple proof-of-concept project.

The json file needs to be compiled into a Java file with the Gora compiler. Hopefully, you
have installed Gora and put its ```bin``` directory onto your path. Run the following to
generate the Java code:

    gora goracompiler src/main/avro/person.json src/main/java

One last bit of setup is needed. Create a ```src/main/resources/gora-accumulo-mapping.xml```
file with the following:

    <gora-orm>
        <class table="people" keyClass="java.lang.String" name="com.affy.generated.Person">
            <field name="first" family="f" qualifier="q" />
        </class>
    </gora-orm>

Finally we get to the fun part. Actually writing a Java program to create, save, and
read a Person object. The code is straightforward so I won't explain it, just show it. Create
a src/main/java/com/affy/Create_Save_Read_Person_Driver.java file like this:

    package com.affy;
    
    import com.affy.generated.Person;
    import org.apache.avro.util.Utf8;
    import org.apache.gora.store.DataStore;
    import org.apache.gora.store.DataStoreFactory;
    import org.apache.gora.util.GoraException;
    import org.apache.hadoop.conf.Configuration;
    
    public class Create_Save_Read_Person_Driver {
    
        private void process() throws GoraException {
            Person person = new Person();
            person.setFirst(new Utf8("David"));
            System.out.println("Person written: " + person);
        
            DataStore<String, Person> datastore = DataStoreFactory.getDataStore(String.class, Person.class, new Configuration());
            if (!datastore.schemaExists()) {
                datastore.createSchema();
            }
            datastore.put("001", person);
            
            Person p = datastore.get("001");
            System.out.println("Person read: " + p);
        }
    
        public static void main(String[] args) throws GoraException {
            Create_Save_Read_Person_Driver driver = new Create_Save_Read_Person_Driver();
            driver.process();
        }
    }

This program has this output:

    Person written: com.affy.generated.Person@20c {
      "first":"David"
    }
    Person read: com.affy.generated.Person@20c {
      "first":"David"
    }

Hopefully, I'll be able to post more complex examples in the future.

Thursday, May 23, 2013

Pointer to "Jump Point Search Explained" Page

http://zerowidth.com/2013/05/05/jump-point-search-explained.html - This explanation is very well done.

Tuesday, May 14, 2013

Understanding Markov Networks

http://izbicki.me/blog/markov-networks-monoids-and-futurama - I just found a very readable introduction to Markov networks. Don't be scared by the Haskell code, the concepts are all well-explained.

Sunday, May 12, 2013

Adding a tee command to Accumulo Shell

An afternoon project. There are hacks involved. Do not use in production!

Some of the iterators that I've been writing are designed to create a problem-oriented dataset; a limited view into the larger dataset. Once the iterators are put into place in the shell, there isn't a way to easily materialize that sub-set of the data. I'm not even sure it makes sense to materialize it, but it was interesting to experiment with the code.

Because this project was so specific to my whim, I don't feel it's right to add to the official code base.

My first step was an update to the Shell.java file. I added "new TeeCommand()" to the external[] Command array. Then I added a private String attribute called 'teeTableName". The last change was to the printRecords method. This change was a hack.

Formatter formatter = FormatterFactory.getFormatter(formatterClass, scanner, printTimestamps);
if (formatter instanceof TeeFormatter) {
    ((TeeFormatter)formatter).setConnector(connector);
    ((TeeFormatter)formatter).setTeeTableName(teeTableName);
}

The TeeCommand class is fairly simple. The only interesting part of the execute() method. You'll note that the teeTable can't be the same as the current table in the shell. And it is automatically created if it does not exist. Another point to note is that the formatter for the current table is changed *globally*. Another hack. And a dangerous one. I don't see a cleaner way to assign the formatter with larger changes to the Shell class.

@Override
    public int execute(String fullCommand, CommandLine cl, Shell shellState) throws AccumuloException, AccumuloSecurityException, TableNotFoundException, TableExistsException {
        String tableName = cl.getArgs()[0];
        String currentTableName = shellState.getTableName();
        if (currentTableName.equals(tableName)) {
            throw new RuntimeException("You can't tee to the current table.");
        }
        if (!shellState.getConnector().tableOperations().exists(tableName)) {
            shellState.getConnector().tableOperations().create(tableName);
        }

        String subcommand = cl.getArgs()[1];
        if ("on".equals(subcommand)) {
            shellState.setTeeTableName(tableName);
            shellState.getConnector().tableOperations().setProperty(shellState.getTableName(), Property.TABLE_FORMATTER_CLASS.toString(), TeeFormatter.class.getName());

        } else if ("off".equals(subcommand)) {
            shellState.setTeeTableName(null);
            shellState.getConnector().tableOperations().removeProperty(shellState.getTableName(), Property.TABLE_FORMATTER_CLASS.toString());
        }
       
        return 0;
    }

The last change was to develop the TeeFormatter class. It's a copy of the DefaultFormatter except for the addition of a copyEntry method which inefficient in the extreme because it opens a BatchWriter for *every* row in the scan. I'll leave it to the reader to develop a more efficient approach. Note that I choose random number for the createBatchWriter call. More hackery!

private void copyEntry(Entry entry) {
      BatchWriter wr = null;
      try {
          wr = connector.createBatchWriter(teeTableName, 10000000, 10000, 5);
          Key key = entry.getKey();
          Value value = entry.getValue();
          Mutation m = new Mutation(key.getRow());
          m.put(key.getColumnFamily(), key.getColumnQualifier(), new ColumnVisibility(key.getColumnVisibility().toString()), key.getTimestamp(), value);
          wr.addMutation(m);
      } catch (TableNotFoundException e) {
          throw new RuntimeException("Unable to find table " + teeTableName, e);
      } catch (MutationsRejectedException e) {
          throw new RuntimeException("Mutation rejected while copying entry to tee table.", e);
      } finally {
          if (wr != null) {
              try {
                  wr.close();
              } catch (MutationsRejectedException e) {
          throw new RuntimeException("Mutation rejected while closng writer to tee table.", e);
              }
          }
      }
  }

I did not include the whole solution in this email because of length, hackery, and criminal inefficiency. However, if you want this tee command the clues above should let you write your own.

In order to help Accumulo hacking, I've updated my https://github.com/medined/accumulo-at-home project to include a 1.4.3/update-accumulo-1.4.3.sh script.