Taking a Look into the NoSQL World (Part 4)

At the very beginning, I had in mind that key-value NoSQL databases where just the ones that uses a simple Map interface to store object data through different servers. However, those databases are a quite more than that simple perspective, they are associative arrays that supports a set of other dictionaries and collections and additionally can perform parallel operations on them.

Due its own nature, is common find out that a huge part of key-value databases has adopted an in memory approach in order to guarantees very high write throughput and very fast read operations. Such databases give the power of an almost infinite memory expansion to applications that requires huge amount of physical memory, what makes them perfect to solve performance problems by the usage of caches.

This article covers two distinct key-value and in memory databases. The first one is Hazelcast, which even the community edition offers a product that literally jump of the page and whose commercial version adds some valuable features like big memory and tomcat session seamless replication. The second one is Redis, which provides a rich set of dictionaries and collections besides a ready to use persistence capability.

Hazelcast

Hazelcast identifies itself as an in-memory data grid supported by a peer-to-peer communication that requires no master-slave or primary-secondary configurations; all its default configurations fits well for both small test cases and large production scenarios. A great feature consists of the capability to run as a separate server or as an embedded one, making its distribution very smooth and easy. The component installation takes less than couple of minutes and its full library distribution occupies less than 5MB.

Focused on providing data structures that can be used in a distributed way, it implements all main java collection interfaces, hence the developer will be able to use clustered Maps, Queues, MultiMaps, Sets, Lists, Topics with almost no extra effort. If there is a specific data that shall be replicated on all nodes in order to guarantee safety, it is possible to use the ReplicatedMap object, which performs a weak consistency and do all dirty job behind the scenes.

In order to control object manipulation and providing a consistency operation, it supports a Lock API by using conditions, semaphores and count-down-latches. Not that surprising, but still useful, are the objects that support atomic operations like AtomicLong, AtomicReference and IdGenerator. Another great feature is the support for creating distributed events and processing massive events by using respectively MembershipListeners and ExecutorServices. In addition, parallel Queries are available to retrieve data from all nodes giving the application even higher performance result.

Once Hazelcast has focused on java platform it also implements JEE Transactions, JCache and Hibernate 2nd level cache. Since there is no native persistence function, it offers both MapStore and QueueStore interfaces as counterparts that allows programmers implement such a feature and reduce application risk of data loss due to eventual crashes. Requiring a developer to implement such a mechanism of persistence might see a constraint, however, I strongly believe that open implementations will arise sooner.

Probably now, the reader is guessing that for all those come and go of parallel java objects, an indispensable element would be the usage of a high performance Serialization interface, which predictably shall be considered. In order to make serialization easier and high end, a set of serializations mechanism are provided through interfaces like Externalizable, DataSerializable, IdentifiedDataSerializable and Portable.

Finally, a client API is available to access Hazelcast server, so even for embedded deployments an extra client is not necessary. In Addition, a rich management interface through a Web Cockpit allows monitoring data activity, exhibiting several statistics data, which is also available by means of the product API. I almost forgot; adding or removing a node is simple like Sunday morning and requires no configuration, even data balance is simply performed magically by the lightweight API itself.

In despite of its fit to several applications, I would recommend its usage mainly for cache solutions and seamless manipulation of huge amount of in memory data. Its commercial version can be very useful if a single node is supposed to store more than 8GB of data, since its off-heap feature will prevent the java virtual machine to go on knees due to garbage collection performance. The figure below exemplifies a deployment scenario beyond summarize Hazelcast Community Edition main features.

c8bcf-blog
Hazelcast Deployment Architecture

Redis

Redis also identify itself as an in-memory database; however, very differently from Hazelcast it provides a persistence mechanism for crash recovery and is fully based on a master-slave configuration, which shall be used to guarantee distributed and high performance operations. Its command line interface supports most common data structures like String, List, Set, Hash, ZSet.

Every single supported data structure brings with it a set of operations to handle inner data, those operations varies from basic manipulation like add and subtract to more complex operations like persist, watch and some timed-based ones. In order to provide consistency operations, transaction is available through the usage of Multi and Exec commands; additionally, several types of locks and non-transactional pipelines that provides batch operations are at developer disposal as well.

Persistence and replication mechanism can be either configured or programmed. When using the configuration path, persistence is executed after a pre-configured number of writes or an elapsed time (or even both), for the programmatically way a synchronous or asynchronous BGSAVE command must be sent to the server. An important note about using configuration is that it usually slow down crash recoveries if applied to application with high throughput and long period between persistence since they can lead to huge log files.

Its architecture relies on the master nodes to receive all write operation and the slave ones to help improve data retrieving through read operations. Since its replication strategy is based on a raw copy of data from master to slaves, it is always a good idea make sure that not too many slaves connect to the master one requiring many synch operations. A workaround on that consist of giving the master node at least 30% to 45% of its total memory available for perform synching operations, what can be a pain for applications with a huge amount of in memory data.

In order to guarantee a better performance and scaling capability, there are a couple of considerations to be taking in account when using Redis. The first one is to keep key values as short as you can, followed by the preferred usage of bigger ziplist to reduce the size of linked nodes data structure; the usage of bitsets are also highly recommended in order to minimize memory usage. A golden rule is preventing at most the usage of slaver nodes for write purposes followed by the creation of a slave tree hierarchy that will not require so many synch operations from the master one. After all, the usage of sentinels to provide a watch dog mechanism that promotes slave nodes to master one after a crash seems to be a very important step to care about.

Unfortunately, Redis has no official java client API; during the tests I have used the Jedis, which is an open source one and seems to be the better choice, even though, it has no support for all operations provided by the command line interface. In addition, the API supports multithread pool for concurrent usage by using the JedisPool class that provides a connection object to master-slave nodes using sharding algorithms.

In despite of its fit to several applications, I would recommend its usage mainly for A/B Testing, logging, store statistics data, creating score mechanisms and for high volume caches that would never fit commodity hardware memory capacity. Despite of all mentioned limitations, Redis got the better throughput results when compared to all other NoSQL database on my 5 million students score database. The figure below exemplifies a deployment scenario beyond summarize Redis deployment.

ebdfe-blog
Redis Deployment Architecture

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair /  Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair /  Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s