The Rise of Column-based Data Stores Part I

Column-based data stores are becoming in an important trend today

If you read my post about Real-Time Analytics, you should be excited like me about this trend. Did you remember the phrase: “Time is equal to money”? Time is the main cause behind all innovations in the Database world: we want faster solutions; quick ways to gather huge quantity of data; faster ways to query billions of records; faster ways to adapt our infrastructure, etc; and many have tried to give clean and useful solutions to this problem.

One of the first organizations that had this kind of problem was Google, and everybody knows very well that this company is synonym of innovation: Google’s team put to its best minds to work on this and the dynamic duo of BigTable and GFS came to play a key role in the global search infrastructure.

Where is the innovation part? BigTable is a highly distributed column-based data store on top of GFS (Google File System), a highly distributed file system, taking advantage of the large servers farm of the company.

But what means that BigTable is a column-based data store? I will use the simple definition that used Ricki Ho on this blog:

Unlike traditional RDBMS implementation where each “row” is stored contiguous on disk, BigTable, on the other hand, store each column contiguously on disk. The underlying assumption is that in most cases not all columns are needed for data access, column oriented layout allows more records sitting in a disk block and hence can reduce the disk I/O.

But if you want to know more about this topic, see the work research by the Dr. Daniel Abadi, CS Assistant Professor at Yale University and co-founder of Hadapt:

Many teams saw in BigTable an answer to their problems, so they worked very hard and developed BigTable’s clones.

But, they two trends here: Disk-based columnar databases, and in-memory column-based databases.

In the first category: the best examples are: Apache HBase, Apache Cassandra, Vertica; and in the second, there is a new movement behind this: Druid, VoltDB, and SAP HANA One.

In this part 1, I will talk about the first category, focused on HBase and Vertica, because I’ve not used very much Cassandra.

Disk-based columnar databases

Apache HBase

HBase is one of my favorite projects right now. I’ve followed the technology for two years right now, and I’ve seen the evolution of the platform in this time, and I have to say that the interest for HBase is growing with time. If you have subscribed to the HBase User mailing list, you will see that the quantity of messages grows everyday.

But, What is HBase? I will put here the original definition of its site: is a distributed, versioned, column-oriented database built on top of Apache Hadoop and Apache ZooKeeper.

HBase is developed in Java, so, you must have in your system a Java installation, particularly, the Oracle Java 1.6.x serie, to run the platform.

HBase use the Hadoop Distributed File System(HDFS) to store its data in a special type of slave node called Region Servers, that run in the Hadoop’s DataNodes of your cluster , and Zookeeper for distributed coordination service.

HBase can run in several modes:

Standalone: which uses the local filesystem to run, and all HBase deamons and a local Zookeeper runs on the same JVM process
Pseudo-distributed: where all daemons runs on a single server, including a local instance of HDFS (You can use this mode for development process; never use this in production)
Distributed: which each daemon runs on separate servers: a NameNode for HDFS and the HBase Master, two or more nodes acting like RegionServers; and a Zookeeper ensemble (in production, the recommended size of the ensemble is three, five or seven machines, to be more tolerant to failures)

How it works?

In HBase, the basic unit is the column, which are grouped together in a column family. All columns in a column family are stored together in the same low level storage file called HFile.

This improves very much the I/O performance. All rows are stored in a byte-lexographic order, and the tables are automatically splitted to regions, which each one of them, contains row values from a row start-key to an row end-key.

The regions are stored in a particular RegionServer node. You must monitor constantly the growing of your reginons, because a large quantity of regions hosted in a single RegionServer is not recommended, and several experts recommend to do the region split process manually, to take the control of this process.

Who uses HBase today?

There are a lot of companies that uses HBase today like a key part of its infrastructure.

A great list is posted in the HBase wiki here, but you really want to see how many organizations are using this amazing platform, you should take a look to the passed HBaseCon, hosted by Cloudera, where companies like Tumblr, Salesforce, StumbleUpon (who have several HBase commiters), Adobe, Jive Software, eBay, Facebook are using HBase for its operations. You can see the Cloudera’s profile at Slideshare to see all the content with the hbasecon tag. Highly recommended:

HBase and HDFS: Past, Present and Future, by Todd Lipcon, one of the Big Data rockstars of the Apache Hadoop project and a proud Cloudera employee.
Case Study of HBase Operations at Facebook, by Ryan Thiessen, Technical Lead at Facebook
HBase Schema Design, by Ian Varley, Principal Member of Technical Staff, Salesforce.com

So, keep an eye above HBase because they are a lot of people working to do of the platform one of the best choices to choose in the NoSQL market today.

If you want to know more about HBase, the great Lars George’s “HBase: The Definitive Guide” book and the mailing lists are the main resources for this.

Vertica

If you are in the topic of Databases, you should know who is Dr. Michael Stonebraker, who is right now an adjunct professor in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology, considered like one of the world experts in this field.

Why I began in that way? Because Dr. Stonebraker co-founded Vertica Systems, seeing the innovation behind this amazing product.

But, What is Vertica?

Vertica Analytic Database is a high performance MPP (Massive Parallel Processing) columnar engine optimized to deliver faster query results in the shortest time.

I said optimized because this is a keyword inside the Vertica team: every piece of code in Vertica has a lot of research and innovation, which I will discuss later.

I heard abut this database when I was writing a research paper for my organization about MPP systems, and I found that Vertica was one of the good players in this Big Data Analytics game (the other good players are the Greenplum Database and Teradata’s Aster Data Platform).

Then, HP saw the great opportunity that this product represented for the Big Data business and acquired the company in 2011.

OK, let’s talk now about some of the Vertica’s features

Column-based storage:
Vertica use a patented architecture called FlexStoreTM, created based on three principles: the grouping of multiples columns in a single file, the selection of disk storage format based on data load patterns automatically, and the ability to differentiate storage media by their performance characteristics and to enable intelligent placement of data based on usage patterns
Advanced Data compression:
Based on the choosed architecture by Vertica team of grouping columns in a single file; the data compression follows the same principle: Vertica organizes values of similar data types contiguously in memory and on disk, enabling to select the best compression algorithm depending of the data type. This improves dramatically the query execution and parallel load times
Built-in Analytics functions:
Vertica comes with a completed packages of useful functions for Analytics, divided by topics like Natural Language Processing, Data Mining, Logistic Regression, etc. This is called User-Defined Extensions. You can read more about this here in this whitepaper
Automatic High Availability:
Vertica allows to scale your data almost without limits, with remarkable features like automatic failover and redundancy, fast recovery, and fast query performance, executing queries 50x-1000x faster eliminating costly disk I/O.
Native integration with Hadoop, BI and ETL tools:
Seamless integration with a robust and ever growing ecosystem of analytics solutions.

You can read deeply about all these features here. The last version of the platform is Vertica 6, and here you can find some of the new features and improvements in this version, or you can view this video, where Luis Maldonado, Director of Product Management at Vertica, explaining a quick overview of this version.

Ok, it’s a great platform, but who are using it today?

There are a lot of companies that are trusting in Vertica today: Twitter, Zynga (th number # 1 company in the Social Gaming industry), Groupon, JPMorgan Chase, Mozilla, AT&T, Verizon, Diio, Capital IQ, Guess Inc,and many more. Read its testimonials here.

Final Thoughts

So, I let the decision to you, selecting the best choice for you. Stay tuned to the second part of the article, focused on in-memory column-based data stores. Thanks a lot for using your precious time to read my articles. If you like, share it in Twitter(@marcosluis2186), Facebook(marcosluis2186) and wherever you want.

Happy Hacking !!!Marcos Luis Ortíz Valmaseda about.me/marcosortiz @marcosluis2186

The Rise of Column-based data stores Part 1