The Big Object: May 2016

If you are evaluating high-performance NoSQL solutions such as: Redis, Riak, Couchbase, MongoDB, Cassandra etc. or in even rarer cases if you’re evaluating caching solutions such as Memcached or EHcache, it’s possible that your best choice may be Hazelcast: Hazelcast uses a considerably different approach to any of the above projects, and yet for some classes of people looking for a Key-Value store, Hazelcast may be the best option for you.

So lets look at Hazelcast and why it is better alternative for above mentioned systems, Hazelcast is an In-Memory Data Grid, not a NoSQL database.

advantages and disadvantages of using Hazelcast in-memory solution, first of all as being key-value store into the memory, it has some default advantages are speed and read efficiency but also some natural disadvantages as storing map into memory are scalability and volatility, as we know that size of RAM is always less than the total available disk space, as RAM is more expensive than the disk, so being in-memory store space is always been a constant and second one as being flash storage, RAM refreshes itself on process restart, resulting into the data loss. Hazelcast addressing this shortcomings of the in-memory stores by providing efficient and convenient solution.

In Hazelcast, scalability issue is addressed by clustering solution, as joining hundreds nodes into the cluster, we may aggregate more than terabytes of in-memory space, to accommodate Hazelcast map into the memory. Of course this not going to be compared with disk space as its been 100 times more than memory but depending upon use case some terabytes of space is sufficient for the in-memory operations or else this solution may use with some backend data store.

Volatility, in Hazelcast volatility handled by peer to peer data distribution, so every block of data has multiple copies(replication across the cluster) of it present on different locations, as any node/rack goes down because of some issue, we can recover data from other copies present on other locations. The number of backup copies can be configured as depending upon the data criticality. Too much copies of the data into the cluster may reduce the overall availability of working memory to other operations. Hazelcast addresses the tuning problem of the cluster by providing ways to make your cluster available and reliable, including setting up backup process for example servers on one rack would be set to back up on another rack, so failure of entire rack can be managed gracefully.

Hazelcast also address the Rebalancing problem of the cluster, whenever node is added/removed to/from the cluster can be lead to moving data across the cluster to rebalancing it. If the node crashes because of some issues, the data copy(primary) on dead node has to be re-owned by the another node who is having replica of that data copy(secondary) becoming primary and backed up on another node to make cluster fail-safe again. this process may consume more cluster resources like CPU, RAM, network etc. might lead to latency into the process during this whole process.

Also in addition to the above benefits, Hazelcast also make sure that Java Garbage Collection process should not be having any effect on the terabytes of data stored onto memory specifically on heap as your heap gets bigger, garbage collection might cause delay in your application response time. So memory store is Hazelcast with native storage support to avoid garbage collector to causing delay in application; resulting in more efficiency and throughput.

Technically, there are 2 important comparison differences between below protocols:

- Statically typed or dynamically typed
- Type mapping between language's type system and serializer's type system (Note: these serializers are cross-language)

The most understandable difference is "statically typed" vs "dynamically typed". It affects that how to manage compatibility of data and programs. Statically typed serializers don't store detailed type information of objects into the serialized data, because it is explained in source codes or IDL. Dynamically typed serializers store type information by the side of values.

- Statically typed: Protocol Buffers, Thrift
- Dynamically typed: JSON, Avro, MessagePack, BSON

Generally speaking, statically typed serializers can store objects in fewer bytes. But they they can't detect errors in the IDL (=mismatch of data and IDL). They must believe IDL is correct since data don't include type information. It means statically typed serializers are high-performance but you must strongly care about compatibility of data and programs.
Note that some serializers have original improvements for the problems. Protocol Buffers store some (not detailed) type information into data. Thus it can detect mismatch of IDL and data. MessagePack stores type information in effective format. Thus its data size becomes smaller than Protocol Buffers or Thrift (depends on data).

Type systems are also important difference. Following list compares type systems of Protocol Buffers, Avro and MessagePack:

- Protocol Buffers: int32, int64, uint32, uint64, sint32, sint64, fixed32, fixed64, sfixed32, sfixed64, double, float, bool, string, bytes, repeated, message
- Avro: int, long, float, double, boolean, null, float, double, bytes, fixed, string, enum, array, map, record
- MessagePack: Integer, Float, Boolean, Nil, Raw, Array, Map (=same as JSON)

Serializers must map these types into/from language's types to achieve cross-language compatibility. It means that some types supported by your favorite language can't be stored by some serializers. Or too many types may cause interoperability problems. For example, Protocol Buffers doesn't have map (dictionary) type. Avro doesn't tell unsigned integers from signed integers, while Protocol Buffers does. Avro has enum type, while Protocol Buffers and MessagePack don't have.

It was necessary for their designers. Protocol Buffers are initially designed for C++ while Avro for Java. MessagePack aims interoperability with JSON.

Some of the advantages and disadvantages of the Serialization frameworks

1. XML
Advantage of the XML is human readable/editable, extensibility, interoperability, XML provides a structure to data so that it is richer in information, XML is easily processed because the structure of the data is simple and standard and There is a wide range of reusable software available to programmers to handle XML so they don't have to re-invent code. XML provides Many views of the one data. XML separates the presentation of data from the structure of that data. It is standard for SOAP etc.

XML use Unicode encoding of the data. XML can be parsed without know schema in advance.

2. JSON
JSON is much easier for human to read than XML. It is easier to write, too. It is also easier for machines to read and write. JSON also provides a structure to data so that it is richer in information and easy processing. JSON has better data exchange format, JSON would perform great for correct usecase. JSON schema and structures are based on arrays and records into the JSON Object. JSON has excellent brower support and less verbose than XML.

As same as XML, JSON also use Unicode encoding format of the data. Advantage of the JSON over XML is the size of the message is much smaller than XML and JSON readability/editability. JSON can be parsed without know schema in advance.

JSON is just beginning to become known. Its simplicity and the ease of converting XML to JSON makes JSON ultimately more adoptable.

3. Protocol Buffer
Protocol Buffer has small output size, very dense data, but very fast processing. It is hard to robustly decode without knowing the schema. (data format is internally ambiguous, and needs schema to clarify) Only machine can able to understand is, not intended for the human eys. (dense binary)

Protocol Buffer has full backword compatibilty, and requires less boilerplate code for parsing as compared to JSON or XML.

4. BSON(Binary JSON)
BSON can be compared to binary interchange formats, like Protocol Buffers. BSON is more "schema-less" than Protocol Buffers, which can give it an advantage in flexibility but also a slight disadvantage in space efficiency (BSON has overhead for field names within the serialized data).

BSON is Lightweight, Keeping spatial overhead to a minimum is important for any data representation format, especially when used over the network. Traversable, BSON is designed to be traversed easily. This is a vital property in its role as the primary data representation for MongoDB.

Efficient, Encoding data to BSON and decoding from BSON can be performed very quickly in most languages due to the use of C data types.

5. Apache Thrift
Apache Thrift provides, Cross-language serialization with lower overhead than alternatives such as SOAP due to use of binary format, A lean and clean library. No framework to code. No XML configuration files. The language bindings feel natural. For example Java uses ArrayList<String>. C++ uses std::vector<std::string>. The application-level wire format and the serialization-level wire format are cleanly separated. They can be modified independently. The predefined serialization styles include: binary, HTTP-friendly and compact binary and Doubles as cross-language file serialization.

Thrift does not require a centralized and explicit mechanism like major-version/minor-version. Loosely coupled teams can freely evolve RPC calls. No build dependencies or non-standard software. No mix of incompatible software licenses.

6. Message Pack
The tagline of Message Pack says 'It's like JSON. but fast and small.' Message Pack is not human readable as it stores data into binary format.

7. Apache AVRO
Nowadays Apache AVRO becoming popular into the industry is because of the message size and evolving schemas feature.

Schema evolution – Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialization and deserialization, and Avro will handle the missing/extra/modified fields. Untagged data – Providing a schema with binary data allows each datum be written without overhead. The result is more compact data encoding, and faster data processing. Dynamic typing – This refers to serialization and deserialization without code generation. It complements the code generation, which is available in Avro for statically typed languages as an optional optimization.

Please let me know your comments and experiences about these frameworks.

Thanks.

Pages

Tuesday, May 24, 2016

Hazelcast : In-Memory NoSQL Solution

Sunday, May 1, 2016

Serialization Frameworks(Protocols), When to use What?

Followers