The role of the next years: The Data Scientis Part I

The first time that I heard about this role was when I was reading the “Beautiful Data: The Stories Behind Elegant Solutions” book, which was edited by Toby Segaran and Jeff Hammerbacher. So, I said: “Why this is so important ?” OK, let’s go to search information about this. I began my search and I found a interview that was given by Hal Varian, CFO of Google, to the McKinsey Group, when he described that the role of statistician would become on the most wanted role on the next 10 years. From his own words: “The sexy job in the next ten years will be statisticians… The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill.” Umm, let me think for a moment. All this has a strong relationship. Well, the Data Analyst has to know many things, but what if we can teach or guide to the young professionals(like me) to this hard path? In this post, I will try to give my impressions about what is needed to be a good Data Scientist.

Use the right tool for the job: R and Apache Hadoop

R

Every single day when I finish my work of the day, I realize that R and Hadoop are the right tools for me. R is a free software environment for statistical computing and graphics. It is an implementation of the S language for statistical computing and graphics. For data analysis, it can be highly efficient to use a special-purpose language like S, compared to using a general-purpose language. In a real useful talk, offered by the Ph.D. Michael E. Driscoll(http://www.dataspora.com/blog), on the OSCON 2009, called “Open Source Analytics: Visualization and Predictive Modeling of Big Data with R”; he gave several keypoints why we should use R for Data Analysis, specially focused on Big Data.

Some of these points are:

* It’s open source. Yes, that’s really cool and important for me
* We can manipulate Data
* We can build models based on statistics (The Real Wow with R)
* We can visualize that data (with many packages: ggplot2, lattice, etc)
* and it’s extensible via packages

“OK,” he said. We can find this kind of things with other languages, and he answered: “Yes, that’s true, but I give one language already with all of these things”. Actually, we can’t compite with that.

R is huge, and thanks to its extensibility, you can do a lot of things with the language. At the time of this writing, there are more than 1000 packages available for free on the <a href=” http://cran.r-project.org">CRAN site.

My recommended packages for big data:

plyr: The Splitter R Package
ggplot2: The Grammar of Graphics
biglm: In-Memory data frame
RApache: R for the Web
REvoAnalytics, the amazing set of routines developed by Revolution Analytics
Rcpp: The interface bewteen R and C++
and other packages for in-parallel execution of code: Rmpi, papply, snow, multicore, etc

Apache Hadoop

There is a lot of interest by many companies and organization on this project. But, the question is Why? I will try to answer it. Apache Hadoop is a creation of Doug Cutting and Mike Cafarella. If you don’t know who is Cutting, you can remember the creator of Apache Lucene, the widely used search library.

From the well known Tom White’s Book “Hadoop: The Definitive Guide 2nd Edition”, from Oreilly, published on October, 2010:

“This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS and analysis by MapReduce. There are other
parts to Hadoop, but these capabilities are its kernel.”

Yes, it’s cool, I know, but you can say it too: “Wow, this is the solution to my problem for big data analysis”.

Hadoop was designed based in four axioms:

System Shall Manage and Heal Itself
Performance Shall Scale Linearly
Compute Should Move to Data
Simple Core, Modular and Extensible
Can operate with structured and unstructured data
Has a large community behind, and a actice ecosystem
Has many user’s cases for all kind of company size
And it’s Open Source, under the friendly Apache License 2.0
a completed VMware image ready for use
RPM packages for Red Hat-based distributions and SUSE/OpenSUSE
deb. based packages for Debian and Ubuntu distributions
And of course, source and binary files

MapReduce

HDFS

Can store very large files (in the order of Gb, Tb and Pb)
Separates the filesystem metadata(in a node called NameNode) and the application data (on one or more nodes called DataNodes)
Asumes that hardware can fail. For that reason, it replicates the data across multiples machines in a cluster(The replication factor by default is 3)
Each file is broken into chunks (by default in a block of 64 Mb, although many users uses 128 Mb), and stored across multiple data nodes as local OS files
It’s based on the Write-Once-Read-Many-Times pattern
Hive: is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files
Pig: is a platform for analyzing large data sets. Pig’s language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records
Hadoop Streaming: is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer
Flume: is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing
Sqoop: is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing
Oozieis a tool developed by Yahoo! for write workflows for interdependent Hadoop jobs
HUE: is a User interface framework and SDK for visual Hadoop applications
Zookeeper: is a coordination service for distributed applications
HBase is the Hadoop database for random read/write access
Cascading: is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.

It’s time to enter to the Cloud

Research, Business, everything is based on numbers: Statistics
Mining, Mining: Data Mining
Visualize it: Information Aesthetics