Why the new Amazon Linux AMI release matters for Modern Data Analytics platforms

Some weeks ago, I received an email from Jeff Barr (Chief Evangelist at Amazon Web Services) explaining the new features that brings the new Amazon Linux AMI, and when I finished to read the post in the AWS blog, I commented all the changes with my team about the impact that all features could carry in AWS Based Big Data Analytics platforms.

The completed list of changes are here. When you begin to analyze all the improvements that are inside the Linux kernel 3.14.19 has, you should wonder how this release could make for High Performance Analytics platforms like Amazon Elastic MapReduce (EMR), or for your Amazon Redshift cluster, or your own Hadoop cluster on top of Amazon EC2 using this Linux AMI. I will comment some of my favorite features in this post. Keep reading

Memory Management improvements with zram, zswap and zcache

This is one of the best things in this release because all these distributed analytics platforms like Apache Hadoop, Apache Spark; High Performance data storage platforms like Redshift or Apache Cassandra, relies a lot in the speed and optimization of RAM united to the tuning of the kernel to a better distribution of the available resources among the nodes, and the zprojects like Linux kernel developers know these particular developments could help a lot on it.

It would be nice to see future posts of Airbnb or Netflix about how they use these platforms with this new AMI.

But what is zswap and zram? If you visit the documentation of zswap, you see this:

zswap is a lightweight compressed cache for swap pages. It takes pages that are in the process of being swapped out and attempts to compress them into a dynamically allocated RAM-based memory pool. zswap basically trades CPU cycles for potentially reduced swap I/O. This trade-off can also result in a significant performance improvement if reads from the compressed cache are faster than reads from a swap device.

Now, taking about zram, I think that here is where the Linux kernel could improve all distributed applications that use RAM like one of its main resources.

Think for example, in Spark’s Resilient Distributed Datasets or RDDs, Cassandra memtables, Hadoop NameNode filesystem metadata or YARN NameNode High Availability features.

Let me get inside a little deeper with YARN or MapReduce NextGen, where the main idea is to split up the main tasks of the JobTracker (in MapReduce v1) into two different daemons:

  • theResource Manager (RM), in charge of resources management, and
  • the Application Master (AM), which will be by application (in YARN, you can host several different applications using the same Hadoop cluster; just using different AM for each application).

I must clarify that there is another component in every slave node called NodeManager (NM), which are centrally controlled by the RM. All these new features of the kernel could improve dramatically the efficient use of RAM for the RM and NM hosts, which could increase the quantity of information to be maintained in memory about the status of the nodes and MR jobs (in the case of the RM) and the monitoring and status of the tasks (in the case of the NM).

One of the components of the RM is the Scheduler, who is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues, and its Resource Model is based in the fact that every node in the system is considered to be composed of multiples containers of minimum size of memory (the recommended value is 1 GB, but this depends of the use of the Hadoop cluster, available resources, etc).

So you should note that the Resource Model is completedly based in RAM, so for that reason, a more efficient way to store a particular sequence of bytes in the kernel using the zprojects could improve, in theory, the work of a YARN cluster. Of course, this must be tested heavily the impact of these changes in Amazon Elastic Mareduce for example, or in your own Hadoop cluster.

TCP Fast Open is enabled by default

First, we have to discuss what is this. TCP Fast Open (TFO) is an extension to speed up the opening of successive TCP connections between two endpoints.

If you read its page in Wikipedia, in the Details section is described this:

It works by using a TFO cookie (a TCP option) in the initial SYN packet to authenticate a previously connected client. If successful, it may start sending data to the client before the receipt of the final ACK packet of the three way handshake is received, skipping a round trip and lowering the latency in the start of transmission of data. This cryptographic cookie is stored on the client side and is set upon the initial connection. It is then repeated back whenever the client reconnects. The cookie is generated by applying a block cipher keyed on a key held secret by the server to the client’s IP address, generating a MAC tag that cannot be forged. The proposal was originally presented in 2011 and is, as of February 2012, an IETF Internet draft.

All these distributed systems like Apache Hadoop, Apache Cassandra, Amazon Redshift relies heavily in the efficiency of the networking stack in Unix systems, and with the right kernel tuning, great things can be achieve; I think that is coming a series of posts for companies that use all these services, and how this particular feature could improve its services.

File Systems improvements

If you are using Amazon Elastic MapReduce or Amazon Redshift (or both), you should be using Amazon S3 for your Input/Output stuff. But if you have your own Hadoop cluster of your own Apache Cassandra cluser hosted in AWS, you should be interested in the improvements for file systems in this release. One of my favorite things here is the new inodes properties for Btrfs, where the docs says:

This release adds infrastructure in Btrfsto attach name/value pairs to inodes as xattrs. The purpose of these pairs is to store properties for inodes, such as compression. These properties can be inherited, this means when a directory inode has inheritable properties set, these are added to new inodes created under that directory. Subvolumes can also have properties associated with them, and they can be inherited from their parent subvolume. This release adds one specific property implementation, named”compression”, whose values can be “lzo” or “zlib” and it’s an inheritable property.

There are other performance improvements which you could find them here, where every commit is well explained. All this is becoming to Btrfs like a more serious file system everyday, and I think that you should consider to make heavy tests hosting your HDFS data directories, or your SSTables in the case of Cassandra, with modern file systems like this one or ext4 which had some improvements too.

I say this because for example with Cassandra, always the first recommendation related to disks, is to maintain your SSTables in a formatted disk with xfs, which has a practically an unlimited maximum file size on a 64 bits system, but I think that this could change in the upcoming years.

The order word is: TEST, TEST AND TEST !!!

Final Thoughts

The conclusions here are simple: this Linux AMI in AWS could change dramatically the way how we build distributed Analytics applications in the cloud. Thanks to Jeff for the information, and to all AWS Linux Engineering team for this incredible work. Keep doing this guys, and you will conquer the Cloud war for Modern Analytics applications.

Marcos Ortiz Valmaseda

Marcos Ortiz Valmaseda

Editor at The Panda Way, where I help companies to earn more income through #investing. Cloud Data Engineer in the morning at Grupo Intercorp
Lima, Peru