Monday, September 23, 2013

Technologies Used In Developing Applications Using Apache Accumulo

I was recently talking about how people train themselves for Big Data projects. The technology stack is fairly daunting. Below are the technologies that I find helpful. I'll add the list as I remember more:

Systems Administration Technologies

System administrators are absolutely essential to successful projects. They ensure that software is installed and configured correctly. More importantly, they ensure repeatable builds and deployments. Oh, and they really need to understand the widely varying failure modes. Read the excellently written Aphyr blog to learn about some of them.

OpenStack - The technology consists of a series of interrelated projects that control pools of processing, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering its users to provision resources through a web interface.

Puppet - Puppet Open Source is a flexible, customizable framework available under the Apache 2.0 license designed to help system administrators automate the many repetitive tasks they regularly perform.

Bash - The command-line for Unix-based operating systems. While you learn about the Bash shell, make sure you also become proficient in Perl, Ruby, or Python scripting language.
Jenkins - An application that monitors executions of repeated jobs, such as building a software project or jobs run by cron.

Ganglia -  (optional) A scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.

Gitorious - (optional) A infrastructure for hosting open source projects that use Git. It can be used inside your firewalls to provide secure access to git repositories.

Jira - (optional) Issue Tracker software. If you're using the Agile methodology, make sure to get the Jira Agile version.

Application Developer Technologies

Vim - While you may prefer to use a nice graphical IDE like NetBeans, Eclipse or IntelliJ, you'll be totally lost if you don't understand Vim. It can easily open large files that utterly crush any IDE.

Java - The programming language for Accumulo. Actually you can probably use most JVM-based languages.

Ant - While knowledge if Ant is not required, I use it to run both Java and Map-Reduce jobs. Its ability to orchestrate multiple targets can prove valuable.

Git - A distributed version control system designed to handle everything from small to very large projects with speed and efficiency. This is the version control system used by Accumulo. It seems fairly safe to say that you need at least a basic knowledge of Git to excel.

Maven - A software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information. Apache Accumulo uses Maven to compile, test and build jar files.

Tomcat or Jetty - (optional) Web applications are the main way to interact with users and these two web servers are good to develop with.

Hadoop - The Hadoop project develops open-source software for reliable, scalable, distributed computing. There are several distributions of the Hadoop stack (MapR, Cloudera, etc...) that you can use.

Zookeeper - A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Accumulo depends on Zookeeper.
Accumulo - a sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Jenkins - An application that monitors executions of repeated jobs, such as building a software project or jobs run by cron.

Solr  - includes powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geo-spatial search. Solr is extremely useful when integrating between Big Data and applications. You can analyze the heck out of your data and then store the results in Solr.

Gitorious - (optional) A infrastructure for hosting open source projects that use Git. It can be used inside your firewalls to provide secure access to git repositories.

Jira - (optional) Issue Tracker software. If you're using the Agile methodology, make sure to get the Jira Agile version.

Gnuplot - (optional) plotting software

Graphviz - (optional) diagramming software