Document database are commonly confused as databases capable of store binary documents like a pdf, docx or any other file format similarly to Electronic Document Management (EDM) system. However, document databases is one of the simplest NoSQL ones and can be summarized as a storage that persist data in a very flexible format minimizing metadata configuration dependencies and complexity.
Document based databases have always existed; however, the NoSQL movement had given them more visibility and consequently their adoption has rocket. Commonly, those databases stores documents following a standard format like JSON or XML, even though there is no restriction about using any other standard as well. One of its main disadvantage, at least for those that uses text format, consist of the high volume of disk required to store quite small amount of data once all metadata are usually replicated for every single record.
In this article, we will cover two distinct document databases. Both MongoDB and CouchDB are open source products that stores their documents by using a JSON-like format yet their similarities stop by there. While the first one has a far more complex architecture, which fits well for applications looking for a clustered behavior, the second one follows the “less is more” rule by providing an easy to use product with ready to use replication capability.
A great feature consists of read and write preferences, which allows tuning the database to provide better performance. Read preferences can be set to allow retrieve data from both primary and replicate node. Write preferences can be set to consider a write finished without requires any kind of response; more robust application may require an acknowledge or a full disk flush before completion; there is even a preference that requires a replication to succeeds before considering a write done, of course, that shall be used for highly sensitive applications.
MongoDB strategy take in account that every write into a document is an atomic operation, for the cases where a couple of documents shall be consistent is possible to use an isolation operator, however have in mind that it will increase costs and consequently decrease performance. It supports bulk operations to provide high performance throughput during massive writes operations, additionally a time to live (TTL) feature allow to discard out of date data, preventing those to cause conflicts further.
The product has a suite of administration tools, which allows to perform data import, export, backup and restore; its monitoring allows to retrieve database usage statistics and resource usage by using both a REST interface or a HTTP console. Its security features supports authorization and authentication added to a fine-grained access control level and the possibility to encrypt communication.
Its architecture defines primary and secondary nodes; the primary ones are those whom receive all write operations while the secondary ones store replicated data and optionally support read operations. If the primary nodes goes down, the secondary ones can take its places through a voting system that shall guarantee an uneven election, it is possible to configure a secondary node to not candidate itself on elections (just set its priority to 0) or even create arbiters nodes that will not perform any operation but voting during elections.
Another pretty cool feature consists of creating hidden and delayed members, while the first one is just a backup that guarantees a replica for primary nodes, the second performs a backup based on a pre-configured delay, making it very useful for actualization procedures that may require a rollback after a failure.
The recommended deployment scenario is using one primary node followed by two secondary ones; however, the product limit supports up to twelve nodes where only seven has voting capabilities. In addition, a rollback limit supports no more than 300MB and a document size limitation of 16MB per record. Remembering that such number may change in a near future.
Up until now, you might be asking: if a primary node contains the data while the secondary one contains its replicas, where in the heaven are the sharding and cluster features? The answer is very simple; those features were still not covered. To enable sharding and cluster operations, it is necessary add two extra components: the query server is the load balancer responsible to find out the appropriated node which shall handle the write and read operation, while the config servers are the ones responsible to hold all metadata which shall be used by all nodes from the whole cluster.
If you have observed the reference to query server using singular form while config servers were on the plural form, here is the reason: MongoDB requires three config servers in order to guarantee metadata availability, yet one query server has proven to be enough on routing a massive number of requests. It may looks like very complex having to configure all those nodes for an application deployment; however, it does not take more than half an hour to perform it from scratch. Additionally, remember that clusters shall not be used when developing solutions that has no requirement for high volume of data handling.
Following HBase and Cassandra ideology, MongoDB also has a java client API with full support to all database operations. Such API has easy to use configuration that allows the application to select from one or a list of query servers, all that with an embedded pool of connections to be used in a multithread environment.
In despite of its fit to several applications, I would recommend its usage mainly for solve problems where a considerable amount of denormalized data is necessary to improve query performance, added by the need of high availability through the usage of local or geographical redundancy. Differently from Cassandra that has a stricter column type, MongoDB is more flexible and shall be preferred for application whose data is flexible and change and requires no database validation. The figure below exemplifies a deployment scenario over the MongoDB architecture.
Apache CouchDB saves all its internal data in a JSON-like format, similarly to MongoDB one. However, it is easy to realize that both were developed and architected to achieve very different goals, hence CouchDB had its focus on usability and a lightweight architecture that allows it to be deployed even on mobile devices. In despite of its simplicity, authentication and authorization are available through the usage of roles and admin users.
Due to its simplicity, a cluster environment is not a natural choice, even though it is possible by using extended modules like the one provided by CouchDB Lounge. Its focus resides on provide an easy to configure replication mechanism that guarantee high availability and also a full features REST interface that allows handling all database operations.
Differently from others NoSQL databases, CouchDB does not delivery a client java API, instead its team has focused on provide a complete and easy to use REST interface. Anyway, there are couple of open source APIs, like LightCouch, which delivers a simple and powerful interface that does even cover POJO serialization through the usage of annotation and inheritance.
In despite of its fit to several applications, I would recommend its usage mainly for solve problems where a huge amount of data is not a requirement and there is no need to guarantee query performance. Since it has no embedded clustering features I see no reason to use CouchDB instead of MongoDB. Except for the fact, that its simplicity and lightweight architecture makes it perfect to fit mobile caches or to deliver an offline database for rich clients that has as requirement an unplugged capability. The figure below exemplifies a deployment scenario and a summary of MondoDB features.