Yesterday, I was reading the Docker newsletter from March 12, and I saw a lot of interesting links about how many people use Docker for several purposes. One of the writeups that picked my curiosity was a post written by Kenny Bastani called “Getting Started with Apache Spark and Neo4j using Docker Compose”. This article described in a simple manner how to use Apache Spark and Neo4j using a Docker-based image for it. Then, I began to search about Apache Spark in the Enterprise and in the process, I stopped in the Databricks site. This incredible team has developed an easy way to create an Apache Spark cluster with Databricks Cloud, which allows to deploy Spark As a Service in an unified Cloud-hosted data platform.
Apache Spark aims to become The Platform for Big Data for its incredible performance, ease of use and built for fast result delivery. Then, I spent almost a day reading everything I could about Apache Spark, and I found this great article at LinkedIn written by Kavitha Mariappan (VP of Marketing at Databricks), where she described why every company interested to extract value using from Apache Spark, could use Databricks Cloud for it.
Then, I began to think in the long term for Databricks Cloud making questions like: What about the new development trend to “Dockerize” everything? Docker containers is an amazing way to provide a fast, secure and clean way to deploy Enterprise applications; so what about if Databricks and Docker work together to create a Docker-based image to deploy Apache Spark quickly? But, beyond that, for all this, you need a strong base for this, and the recent release of Red Hat Enterprise Linux 7.1 is perfect for this; for its new Atomic Host offering, a version of the battle-tested Enterprise Linux distribution optimized to run Containers-based applications; combined with the Real-Time version, which could be a great complement for an Apache Spark cluster for its needs for low-latency response times for its distributed architecture. But, which benefits could bring to the world a collaboration among these great companies? Keep reading to find out.
The three companies use Open Source based business models
You should be wondering why this is important. I will give just three examples:
- Imagine a development project where the main developers of Apache Spark work with the Red Hat’s Performance Engineering team to extract extreme performance of the new Real-Time offering for Apache Spark clusters.
- Or think in other project where Apache Spark uses the best packaging guides from Red Hat and make an optimized RPM for Fedora, CentOS and RHEL as well
- Or think in other project where Apache Spark dev team works with Docker to provide an optimized Docker-based image for Apache Spark and make it available throght Docker Hub. Think in this like “free marketing” where any interested developer could test and deploy quickly an Apache Spark cluster in matters of minutes. Or even more, if Databricks works in a way to bring to Databricks Cloud Docker-based containers like other of the offerings.
Security provided throught SELinux
One of the constant worries from large companies which want to adopt Cloud-based technologies is the security of its data. If Databricks work with Red Hat in this front-line, could bring a lot of benefits bringing Apache Spark to the SELinux world. In the new release of RHEL 7.1, Red Hat talked about how they use a secure Linux container stack which provides isolation and security to the container host. So, if Databricks uses RHEL Atomic Host like its foundation for Apache Spark cluster deployment; a secure environment is guaranted, which is fundamental to deliver this solution to Fortune 500 companies.
A Fedora-based Docker image
Fedora is the foundation of RHEL, so if Databricks, Red Hat and Docker works in a Fedora-based Docker image where they could test everything new in Spark, for example the new DataFrame API, in a simple way, could bring a lot of benefits for new developers interested to use the platform.
Conclusions
When you start to think about this more and more, many ideas come to mind, so, I strongly think that three teams could work together to deliver the best, secure and fast way to deploy Apache Spark in the Enterprise using Docker-based images and more; this could be just the beginning of the future of Cloud-Based Near Real-Time Enterprise applications. Who knows !!!