The first time that I heard about this role was when I was reading the “Beautiful Data: The Stories Behind Elegant Solutions” book, which was edited by Toby Segaran and Jeff Hammerbacher. So, I said: “Why this is so important ?” OK, let’s go to search information about this. I began my search and I found a interview that was given by Hal Varian, CFO of Google, to the McKinsey Group, when he described that the role of statistician would become on the most wanted role on the next 10 years. From his own words: “The sexy job in the next ten years will be statisticians… The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill.” Umm, let me think for a moment. All this has a strong relationship. Well, the Data Analyst has to know many things, but what if we can teach or guide to the young professionals(like me) to this hard path? In this post, I will try to give my impressions about what is needed to be a good Data Scientist.
Use the right tool for the job: R and Apache Hadoop
R
Every single day when I finish my work of the day, I realize that R and Hadoop are the right tools for me. R is a free software environment for statistical computing and graphics. It is an implementation of the S language for statistical computing and graphics. For data analysis, it can be highly efficient to use a special-purpose language like S, compared to using a general-purpose language. In a real useful talk, offered by the Ph.D. Michael E. Driscoll(http://www.dataspora.com/blog), on the OSCON 2009, called “Open Source Analytics: Visualization and Predictive Modeling of Big Data with R”; he gave several keypoints why we should use R for Data Analysis, specially focused on Big Data.
Some of these points are:
- * It’s open source. Yes, that’s really cool and important for me
- * We can manipulate Data
- * We can build models based on statistics (The Real Wow with R)
- * We can visualize that data (with many packages: ggplot2, lattice, etc)
- * and it’s extensible via packages
“OK,” he said. We can find this kind of things with other languages, and he answered: “Yes, that’s true, but I give one language already with all of these things”. Actually, we can’t compite with that.
R is huge, and thanks to its extensibility, you can do a lot of things with the language. At the time of this writing, there are more than 1000 packages available for free on the <a href=” http://cran.r-project.org">CRAN site.
My recommended packages for big data:
- plyr: The Splitter R Package
- ggplot2: The Grammar of Graphics
- biglm: In-Memory data frame
- RApache: R for the Web
- REvoAnalytics, the amazing set of routines developed by Revolution Analytics
- Rcpp: The interface bewteen R and C++
- and other packages for in-parallel execution of code: Rmpi, papply, snow, multicore, etc
Apache Hadoop
There is a lot of interest by many companies and organization on this project. But, the question is Why? I will try to answer it. Apache Hadoop is a creation of Doug Cutting and Mike Cafarella. If you don’t know who is Cutting, you can remember the creator of Apache Lucene, the widely used search library.
From the well known Tom White’s Book “Hadoop: The Definitive Guide 2nd Edition”, from Oreilly, published on October, 2010:
“This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS and analysis by MapReduce. There are other
parts to Hadoop, but these capabilities are its kernel.”
Yes, it’s cool, I know, but you can say it too: “Wow, this is the solution to my problem for big data analysis”.
Hadoop was designed based in four axioms:
- System Shall Manage and Heal Itself
- Performance Shall Scale Linearly
- Compute Should Move to Data
- Simple Core, Modular and Extensible
- Can operate with structured and unstructured data
- Has a large community behind, and a actice ecosystem
- Has many user’s cases for all kind of company size
- And it’s Open Source, under the friendly Apache License 2.0
- a completed VMware image ready for use
- RPM packages for Red Hat-based distributions and SUSE/OpenSUSE
- deb. based packages for Debian and Ubuntu distributions
- And of course, source and binary files
MapReduce
HDFS
- Can store very large files (in the order of Gb, Tb and Pb)
- Separates the filesystem metadata(in a node called NameNode) and the application data (on one or more nodes called DataNodes)
- Asumes that hardware can fail. For that reason, it replicates the data across multiples machines in a cluster(The replication factor by default is 3)
- Each file is broken into chunks (by default in a block of 64 Mb, although many users uses 128 Mb), and stored across multiple data nodes as local OS files
- It’s based on the Write-Once-Read-Many-Times pattern
- Hive: is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files
- Pig: is a platform for analyzing large data sets. Pig’s language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records
- Hadoop Streaming: is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer
- Flume: is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing
- Sqoop: is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing
- Oozieis a tool developed by Yahoo! for write workflows for interdependent Hadoop jobs
- HUE: is a User interface framework and SDK for visual Hadoop applications
- Zookeeper: is a coordination service for distributed applications
- HBase is the Hadoop database for random read/write access
- Cascading: is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.
It’s time to enter to the Cloud
- Research, Business, everything is based on numbers: Statistics
- Mining, Mining: Data Mining
- Visualize it: Information Aesthetics