Column databases are essentially the ones that store data where a key map to a set of columns, and physically common column are stored together. From all NoSQL databases, they are the easiest to compare against relational ones since a column family can be looked as a table, its key thought as a primary key, and all columns seen as table columns.
In this article will cover two distinct column-based databases: Apache Cassandra and Apache Hbase. While the first one is classified as a standard column family, the second one is labeled as a super column family. It is very important notice that in despite of being part of the same NoSQL family, both products fit very distinct purposes and hardly ever will compete when designing a solution.
Due to its ready-to-use installation, Apache Cassandra is a complex product yet a very simple one to use. It exposes the Cassandra Query Language (CQL); a SQL-like language that allows database administrator and application to communicate with the database under an efficient and easy to use text protocol. One of its main features is the capability to support a cluster configuration, which in its own terminology are composed of data centers and hacks that dictates how to replicate data among all configured nodes.
Behind the scenes, Cassandra persist every write operation into a commit log that can be used later for crash recovery; also, every write content is buffered on a memory table which is flushed from time to time into SSTable files. On the same way, delete operations are not performed right away, instead they mark the record for further deletion during compression or a scheduled time to live configuration that defines for how long a record should exist.
Since most relational databases provides an ACID capability, Cassandra also supports transaction, however, avoid it for application that requires high performance during write operations. Still similarly to relational databases, a set of tools are available for administrators in order to perform operations like backup and recovery after a possible schema disagreement between nodes on the cluster.
Cassandra installation is very simple, and its completion shall not take more than dozen minutes. Its configuration follows the 80-20 Pareto principle, where finishing 80% of all work shall not require to touch more than 20% of all attributes. Extra features as configuring security mechanism is not complicated neither and both authorization and authentication are supported as well; the same is valid to configure encrypted internode communication.
Developed over a high performance I/O channel, Its Java Client API requires only one endpoint per cluster configuration. Communication to the database uses session objects, hence its creation and destruction are costly a good idea consists of making those sessions a long live object inside your program.
Following the JDBC standard, it supports the usage of PreparedStatement and batch operations in order to increase performance and programmatically queries can use the available Query Builder API. In addition, modifications for metadata changes can be performed on the fly reducing update downtime.
In despite of its fit for several applications, I would recommend Cassandra usage mainly for solve problems where a huge amount of denormalized data is necessary to improve query performance added by the need of high availability through the usage of geographical redundancy. The figure below exemplifies a deployment scenario which summarizes Cassandra main features.
Definitively, Apache HBase is not a simple to use NoSQL database. Installing it from scratch requires read couple books and get under some minor stress; probably that is true because it requires an understanding of Apache Hadoop, which is on my perspective the core knowledge behind it, therefore the most valuable thing to learn.
Hadoop is an alias for a more complex set of technologies that involves HDFS and MapReduce. While the first stands for hadoop distributed file system and defines a file system that can be installed on any Linux. The second one stands for a technique that allows building high scalable algorithms by parallel mapping chunk of data that will be reduced on a set of expected information.
Apache HDFS file system is also useful to create high capable storage system using just a bunch of disk (JBOD) and commodity hardware. In other words, it allows an application to enjoy an unlimited capacity of data persistence without spend millions of dollars on limited vertical scalable hardware.
In a simplified perspective, the Hadoop architecture consist of two distributed systems components. The first one is the namenode, which is the one responsible for storing file system metadata using an in-memory data structure (high availability solution can be achieved by using active/standby architecture or clustered configurations, both supported by a persistent crash recovery model). The second one is the datanodes, which are the ones responsible to write down all data as file system block. Datanodes are usually in greater number than namenodes; in fact, for a common deployment there are dozens or even hundreds of datanodes for each configured namenode.
The Apache HBase architecture is very similar to the Apache Hadoop one. It is composed by two distributed system; the master server is the one responsible for handling the schemas (or metadata) while the region servers are the ones responsible by handling all read and write operations. A third component named ZoeKeeper can be added to improve availability for both Apache HBase and Hadoop, its main role is monitoring the nodes in order to guarantee a reliable message distribution even under crashes or failures.
Now that we acquire the basic knowledge on Hadoop and HBase, let us focus on the HBase NoSQL features. Differently from Apache Cassandra, HBase is an explicit multidimensional Map where the tuples: Table, Rowkey, Family, Column Qualifier and Timestamp maps to a specific Value, in other words, what Cassandra does behind the scenes the developer has to model and develop by himself when using HBase.
Similarly, to Cassandra, HBase provides a client API that allows java applications (and dozen more languages) communicate to the database. However, such API does not support any query language like CQL, therefore every single operation invokes a set of pre-defined methods, each of it containing its own specific parameters. The connection object name is HTable, which shall have one instance per thread, or a long live pool that matches the application life cycle; pooled resource is a more appropriated solution for multithread applications.
The client API also supports manual usage of locks in order to guarantee consistency and batch operations (List<Put> and List<Get>) and although not being marked as a good practice, full scan operations are allowed on tables (similarly to cursors in relational databases). Filters are available to improve queries over billion records, saving hardware resources and application wall time. Additionally, HBase supports observers and endpoint processors, which are features comparable to triggers and stored procedures on relational databases.
In despite of its fit to several applications, I would recommend HBase usage mainly for solve problems where a very, very huge amount data is necessary to improve query performance, the purchase of vertical scalable storages are prohibitive and parallel processing of the data are common operations. The figure below exemplifies a deployment scenario which summarizes HBase main features.