Python for developing MapReduce Data Analysis Applications: Part 1

Jun 02, 2011

Have not said to you? Python is a primary programming language. Simply I love it. Its simplicity, correctness, and a obligated path to write good and readable code. 
Really, I love the Python Principles. So, when I began to experiment with Hadoop Clusters for big data processing, I asked to myself: Well, How I can do all this 
using my favorite language? Umm, let me search on the wiki and voilá: Hadoop Streaming.

Reading the docs on the wiki, I found: “Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run 
MapReduce jobs with any executable or script as the mapper and/or the reducer”. All examples are based on Python!!!: Good start.

But, I followed my search, and I found Dumbo, which is a Python module that allows to you to easily write 
and run Hadoop programs. It’s considered a good Python API for writing MapReduce programs. On the Last.fm’s blog, they posted a minor guide to put your Hadoop 
jobs to work with Dumbo. Two main things: simplicity and productivy. Now, of course, I have a advice for you, if you ar going to develop a real big data intensive 
processing job, is better to use Java, because Hadoop is written on it. Test it, improve it, compare it with the execution with Python, and select the best option 
for you.

Then, I found this amazing blog post 
from Michal Noll, explaining in deep how to use Python with Hadoop. Please, don’t forget to read it. 
The other player is hadoopy. It’s built with the same purpose of Dumbo. Check it out and try it.

mrjob: The another player built by Yelp

The Yelp Engineering Team released its own Python framework for writing Hadoop Streaming jobs called 
mrjob. On a 
great post on its engineering blog, they explained 
why developed mrjob and shared with the world its work. Thanks, guys, It would be a good project to work on my open source’s contributions.

mrjob can work with the Amazon’s ElasticMapReduce and too with you own Hadoop cluster, and is available too on the 
Python Package’s Directory, so, you can install it on this way: 
easy_install mrjob. 
The documentation is here.

Other playes out of the Hadoop ecosystems based on Python

But, I followed my search, looking for a completed Pythonic solution for writing MapReduce applications, and yes, I found two projects: 
Disco Project and mrjob.

The Disco Project

This projects are sponsored by Nokia Research and the Disco Project Development Team; and is a pure implementation of MapReduce for distributed processing.Disco supports parallel computations 
over large data sets, stored on an unreliable cluster of computers, as in the original framework created by Google. This makes it 
a perfect tool for analyzing and processing large data sets, without having to worry about difficult technicalities related 
to distribution such as communication protocols, load balancing, locking, job scheduling, and fault tolerance, which are handled by Disco.

The basic workflow of the process is:

  • Disco users start Disco jobs in Python scripts
  • Jobs requests are sent over HTTP to the master
  • Master is an Erlang process that receives requests over HTTP.
  • Master launches slaves on each node over SSH
  • Slaves run Disco tasks in worker processes
  • What are you waiting for to try them?
  • Have you considered to contribute to them?