Yes, I have to say this: I admire to the MapR´s management team for the great job that they have done in these few years after the company was founded. In two years, MapR have been become on a respected leader in the rush Big Data following a simple mantra: “Build a great product and create a business around it”.
Well, and which is the product behind the business of MapR? This amazing team have done one of the best Hadoop distributions to this date, divided in two versions: M3, which it’s a free edition of the distribution, and M5, which is the commercial version, with a lot of great features that I will talk later. These two distributions are the key reason why MapR have been selected like an important partner in the Cloud market; first by the Amazon Web Services team, to offer M3 and M5 in the Amazon Marketplace, and then by the new Cloud service offered by Google: Google Compute Engine, proving that MapR has much more to offer to the world. But what about another Cloud services providers? What about Rackpace, HP Cloud Services or Microsoft Azure? Ok, first, let’s talk about MapR.
The MapR distributions
But, What does MapR’s distributions so remarkable and different? Let’s begin to identify the key features for this distribution, and how this can be benefitial, it doesn’t matter if you are developer or sysadmin working with Hadoop. The key features for MapR distribution are:
- It includes a completed Hadoop stack: Hive, Pig, HBase, Oozie, Zookeeper, Sqoop, Flume, and an entire MapReduce layer 100% compatible with Apache Hadoop
- It provides the MapR Direct Access NFS, which enable real-time read/write data flows based on the NFS (Network File System) protocol. This is a huge improvement over Apache Hadoop, because this can bring to shorter processing times, which is very important in these days. United to this, the MapR Lockless Storage Services enable multiple concurrent reads and writes on any file.
- Buit-in compression: the MapR distribution provides automatic compression, which offers top-notch performance acceleration and great benefits for storage, because it can save a lof of space in the cluster.
- MapR Volumes: This is one of the key features why I personally recommend to use this Hadoop distribution. For anyone that has played with Apache Hadoop knows that working with users’s permissions, space management, etc, in a cluster, many times it’s a painful process, and it can bring a lot of headaches for sysadmins. Well, What is Volumes? It make cluster data both easy to access and easy to manage by grouping related files and directories into a single tree structure so they can be easily organized, managed, and secured.
- Provides enterprise-level features and advanced data management funtionality, which it allows to sysadmins and developers do capacity planing for the management of the cluster.
- MapR Control System (MCS): is a completed cluster management solution, which anyone can control the operations of the cluster and monitor the activity of each component of it, using the MapR HeatMap
- Multi-cluster support and management: With the Map Control System, you can control several clusters at the same time, and an user can easily between clusters. This is another of my favorites features, because it allows to create separate clusters for several departments in an organization, but they can share its data throught NFS. Awesome !!!
But there is more. MapR Technologies did a great play here: they created two editions. Keep reading.
This is the free version of the MapR, which you can use it for free, is ready for production, and its support is Community-based using the MapR answers portal (something seemed to the Stackoverflow site)and the Hadoop mailing lists.
M5 is the commercial version of the MapR distribution It has 24x7 support and email support, and the cost to implement a cluster depend of its size. It a subscription software offering which delivers the best of Hadoop and many more. Some of its features? Analyze carefully this list:
- For me, this is one of the definity key features of this Hadoop distribution. With Snapshots, any organization can do a great capacity planing in every process of the development, because with it, you can protect your cluster from human errors (the most common error today), applications errors; the kind of errors that you can’t solve with replication. This feature works in conjuction with Volumes, and can be performed using the MapR Control System(MCS).
- What happens if you want to do multiple research studies in the same dataset, but don’t want to double the information? OK, then you want to use this feature only available in this version. With this feature activated, many teams can share certain data set, simply mirroring the data to another cluster. This is great
- Jobtracker HA:
- This is one of the one of my main headaches when I’m going to install a new Hadoop cluster, and one of the key features why the Hadoop development group is working in a 2.x series (by the way, today Arun created a RC2 for 2.0.2). Putting the Jobtracker in a high availability environment can be a painfull and stressing process, but the MapR’s guys solved this, created a battled-tested, distributed Jobtracker version, with a single purpose: everything inside a MapR distribution is logged and journaled, allowing to recover quickly and easily from disasters. Are you worried about loose MapReduce jobs? Don’t be. This is precisely why the “distributed” part of the Jobtracker was built: you don’t have to think anymore about to loose another job, because if the Jobtracker fails in one node, it automatically restart in another node in the cluster. Do you know how many headaches you save with this feature? A lottttttttt, believe me.
- Distributed NameNode HA
- :If you are a Hadoop sysadmin or developer, you know that the NameNode running in a single node is a potential trouble, because, if you are using it for a large cluster, this node can fail, and like you know, it maintains the location of the data around the cluster; so, this would be a serious problem for you. But, add another problem to this: the “small files” problem. When you have a NameNode node in a large cluster, you always should monitor constantly the quantity of files that this node is managing, and many times, the sysadmins should become on a wizard to keep running the cluster. But chaos failures happen all time, and when the NameNode fails, all cluster can become unavailable for a short or a long time. With this feature, MapR’s distribution is a gem for your infrastructure, because the NameNode is distributed in all the cluster, and like the Jobtracker, is self-healing. Give it a try.
- Data Placement control:
- With this feature, you can control where determined volume store its data: in the entire cluster or in a subset of nodes. For example, if you want to use HBase, like a key part of your infrastructure, and you know that there are three nodes in the cluster that has SSD drives (HBase is a intensive I/O platform), you could create one or several volumes for your RegionServers on top of these drives to obtain a better performance.
Another Cloud partner? Joyent is my first option
Like you saw, Amazon and Google has trusted on MapR offering its Hadoop distribution like a service on its platforms, but if MapR wants to become on a leader in this field, they have to offer its products in more cloud services providers, and here is when Joyent come out like my first choice.
Why Joyent? There are a lot of reasons for me to recommend to the Joyent Cloud like another of the partners for MapR Technologies. This is not a trivial decision for MapR, but I will expose here why I think this.
People at Joyent
First, Who are these guys with “orange ties” (this is a distintive thing of the Joyent’s people in the events)? If you are behind the technology trends in these days, you should note that Node.js have gained a lot of popularity, and Ryan Dalh, the creator of this amazing project is an full-time Joyent’s employee. But the greatness doesn’t end here: Bryan Cantrill (who is the VP of Engineering at Joyent) and Brendan Gregg (Senior Software Engineer at Joyent) , one of the original creators of DTrace ( the dynamic tracing tool built for the Solaris OS, considered like one of the key features why many organization use it today) are in the payroll too.
I have to say it: this is one of the best cloud services that I’ve ever used. I’ve used AWS and Rackspace Cloud, and these are amazing platforms too, but the performance that I’ve seen in a Joyent SmartDataCenter, united to the incredible Joyent SmartOS, who is a fork of IllumOS; are two of the key reasons why I say this.
But, What is SmartOS? Based on the words of the Joyent’s engineers: is a comprehensive, secure operating system combining best of breed technologies that deliver complete hardware and operating system virtualization, enterprise and carrier grade storage, and analytics. This is the gem of the performance of Joyent Cloud, but, if you have any trouble in your Cloud platform; Joyent Cloud Analytics is your answer to this. This application is a web monitoring platform based on DTrace with great 4D heatmaps, where you can see the right state of the infrastructure in real-time. You can see more of this in the Joyeur’s blog. So, What do you think to create a Joyent SmartMachine Appliance focused on MapR M3 and MapR M5 like 10gen did with MongoDB and Bansho did with Riak?
The Joyent’s customers
Joyent have a great range of customers, but which of them use Hadoop for its business applications? Linkedin, eBay, Cocktail (who is a division of Yahoo!), Gilt Group are some of the companies inside this range that I know that they use Hadoop like a key part of its infrastructure. So, How many of you want to see a hand shake between the Joyent’s Management team and the MapR Technologies’s Management Team? I you do this, many companies would use the Joyent Cloud to host its high-intensive Hadoop applications, but giving the whole confidence to the MapR distribution for its peformance and great features, and on its base, SmartOS optimized for it. Everybody happy: customers, Joyent and MapR.
Like I said before, MapR and Joyent, you should talk and work together to raise the bar for your current and potential customers; focused on one keyword: “PERFORMANCE”, yes, if we join the features of the MapR Distribution with the remarkable features of SmartOS; MapR and Joyent can change the Big Data hosting business forever; I strongly believe that. But there are many minds in this high technical world that they don’t share my thoughts; if you are one of them, I encourage you to let a comment to enrich the post. Thanks you for your time to read this.