The new era of a Data Scientist

The first time that I heard about this role was when I was reading the “Beautiful Data: The Stories Behind Elegant Solutions” book, which was edited by Toby Segaran and Jeff Hammerbacher. 
 So, I said: “Why this is so important ?” OK, let’s go to search information about this.

I began my search and I found a interview that was given by Hal Varian, CFO of Google, to the McKinsey Group, when he described that 
 the role of statistician would become on the most wanted role on the next 10 years. From his own words:

“The sexy job in the next ten years will be statisticians… The ability to take data — to be able to understand it, to process it, to 
 extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill.”

Umm, let me think for a moment. All this has a strong relationship. Well, the Data Analyst has to know many things, 
 but what if we can teach or guide to the young professionals(like me) to this hard path?

In this post, I will try to give my impressions about what is needed to be a good Data Scientist.

<h2>Use the right tool for the job: R and Apache Hadoop<h2>

<h3>R</h3>

Every single day when I finish my work of the day, I realize that R and Hadoop are the right tools for me. 
 R is a free software environment for statistical computing and graphics. It is an implementation of the 
 S language for statistical computing and graphics. For data analysis, it can be highly efficient 
 to use a special-purpose language like S, compared to using a general-purpose language.

In a real useful talk, offered by the Ph.D. Michael E. Driscoll(http://www.dataspora.com/blog), on the OSCON 2009, 
 called “Open Source Analytics: Visualization and Predictive Modeling of Big Data with R”; he gave several keypoints 
 why we should use R for Data Analysis, specially focused on Big Data. Some of these points are:

<ul> 
 <li>* It’s open source. Yes, that’s really cool and important for me</li> 
 <li>* We can manipulate Data</li> 
 <li>* We can build models based on statistics (The Real Wow with R)</li> 
 <li>* We can visualize that data (with many packages: ggplot2, lattice, etc)</li> 
 <li>* and it’s extensible via packages</li> 
 </ul>

“OK,” he said. We can find this kind of things with other languages, and he answered: “Yes, that’s true, but I give one language already with all of these things”. 
 Actually, we can’t compite with that.

R is huge, and thanks to its extensibility, you can do a lot of things with the language. At the time of this writing, there are more than 1000 packages available 
 for free on the <a href=” http://cran.r-project.org">CRAN</a> site. My recommended packages for big data: 
 <ul> 
 <li><a href=”http://had.co.nz/plyr">plyr: </a></li> 
 <li><a href=”http://had.co.nz/ggplot2">ggplot2: The Grammar of Graphics</a></li> 
 <li><a href=”http://cran.r-project.org/web/packages/biglm/index.html">biglm:</a> In-Memory data frame</li> 
 <li><a href=”http://cran.r-project.org/web/packages/glm/index.html">glm: </a></li> 
 <li><a href=”http://biostat.mc.vanderbilt.edu/rapache/">RApache: R for the Web</a></li> 
 <li>REvoAnalytics, the amazing set of routines (not free) developed by Revolution Analytics</li> 
 <li><a href=”http://cran.r-project.org/web/packages/Rcpp/index.html">Rcpp:</a> The interface bewteen R and C++</li> 
 <li>and other packages for in-parallel execution of code: Rmpi, papply, snow, multicore, etc</li> 
 </ul>

<h3>Apache Hadoop</h3> 
 There is a lot of interest by many companies and organization on this project. But, the question is Why? I will try to 
 answer it. <a href=”http://hadoop.apache.org">Apache Hadoop</a> is a creation of Doug Cutting and Mike Cafarella. If you don’t know who is Cutting, you can remember 
 the creator of Apache Lucene, the widely used search library.

From the well known Tom White’s Book “Hadoop: The Definitive Guide 2nd Edition”, from Oreilly, published on October, 2010:

“This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis 
 system. The storage is provided by HDFS and analysis by MapReduce. There are other 
 parts to Hadoop, but these capabilities are its kernel.”

Yes, it’s cool, I know, but you can say it too: “Wow, this is the solution to my problem for big data analysis”.

Hadoop was designed based in four axioms: 
 <ul> 
 <li>System Shall Manage and Heal Itself</li> 
 <li>Performance Shall Scale Linearly</li> 
 <li>Compute Should Move to Data</li> 
 <li>Simple Core, Modular and Extensible</li> 
 </u>

But, there is more: 
 <ul> 
 <li>Can operate with structured and unstructured data</li> 
 <li>Has a large community behind, and a actice ecosystem</li> 
 <li>Has many user’s cases for all kind of company size</li> 
 <li>And it’s Open Source, under the friendly Apache License 2.0</li> 
 </u>

You can search on the <a href=”http://wiki.apache.org/hadoop/PoweredBy">wiki</a>, how many companies used Hadoop today.

Actually, my friend, give a try to Hadoop. You can download it from <a href=”http://hadoop.apache.org/releases/">here</a> 
 or you can use the Cloudera Distribution for Hadoop (CDH). CDH is based on the most recent stable version of Hadoop more 
 several patches and updates. You can use it in many different ways: 
 <ul> 
 <li>a completed VMware image ready for use</li> 
 <li>RPM packages for Red Hat-based distributions and SUSE/OpenSUSE</li> 
 <li>deb. based packages for Debian and Ubuntu distributions</li> 
 <li>And of course, source and binary files</li> 
 </ul>

You can download it <a href=”http://www.cloudera.com/downloads">here</a> or you can use the 
 public packages’s repositories for <a href=”http://archive.cloudera.com/redhat/cdh/3/">Red Hat</a> 
 and <a href=”http://archive.cloudera.com/ubuntu/cdh/3/">Ubuntu</a> too.

<h4>MapReduce</h4>

MapReduce is based on the principles of functional programing. In this programming model, data is explicitly passed between 
 functions as parameters or return values which can only be changed by the active function at that moment. It’s a programming 
 model for data processing where parallelism is inherent. It’s organized as a “map” function which transform a piece of data 
 into some number of key/value pairs. Each of these elements will then be sorted by their key and reach to the same node, where 
 a “reduce” function is used to merge(of the same key) into a single result.

There are a lot of resources to study on deep the MapReduce programming model. For example, on the Google Labs, is the 
 <a href=”http://labs.google.com/papers/mapreduce.html">original implementation</a> of the model, or you can search on the 
 wiki <a href=”http://wiki.apache.org/hadoop/MapReduce">too</a>, or you can read the books of Tom and Jason Venner’s 
 (Pro Hadoop: Build scalable, distributed applications in the cloud) from Apress.

Sorry, I forgot to give you this <a href=”http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/">link: 10 MapReduce Tips</a> 
 from Tom.

<h4>HDFS</h4> 
 This is the other diamant of Hadoop: its distributed filesystem. The architecture of HDFS is described in 
 <a href=”http://storageconference.org/2010/Papers/MSST/Shvachko.pdf">“The Hadoop Distributed File System” by 
 Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler on the proceedings of MSST2010, in May 2010.

Some of its features: 
 <ul> 
 <li>Can store very large files (in the order of Gb, Tb and Pb)</li> 
 <li>Separates the filesystem metadata(in a node called NameNode) and the application data (on one or more nodes called DataNodes)</li> 
 <li>Asumes that hardware can fail. For that reason, it replicates the data across multiples machines in a cluster(The replication factor by default is 3)</li> 
 <li>Each file is broken into chunks (by default in a block of 64 Mb, although many users uses 128 Mb), and stored across multiple data nodes as local OS files</li> 
 <li>It’s based on the Write-Once-Read-Many-Times pattern</li> 
 </ul>

But, all these components are not the only pieces of the Hadoop ecosystem. There is more: 
 <ul> 
 <li><a href=”http://hadoop.apache.org/hive/">Hive:</a> is a data warehouse infrastructure built on top of Hadoop that provides tools to enable 
 easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files</li> 
 <li><a href=”http://pig.apache.org/">Pig:</a> is a platform for analyzing large data sets. Pig’s language, 
 Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, 
 and applying functions to records or groups of records</li> 
 <li>Hadoop Streaming: is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs 
 with any executable or script as the mapper and/or the reducer</li> 
 <li><a href=”http://github.com/cloudera/flume">Flume</a>: is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized 
 place for storage and processing</li> 
 <li><a href=”http://github.com/cloudera/sqoop">Sqoop</a>: is an open-source tool that 
 allows users to extract data from a relational database into Hadoop for further processing</li> 
 <li><a href=”http://github.com/yahoo/oozie">Oozie</a>is a tool developed by Yahoo! for write workflows for interdependent Hadoop jobs</li> 
 <li><a href=”http://github.com/cloudera/sqoop">HUE</a>: is a User interface framework and SDK for visual Hadoop applications</li> 
 <li><a href=”http://zookeeper.apache.org">Zookeeper</a>: is a coordination service for distributed applications</li> 
 <li><a href=”http://wiki.apache.org/hadoop/Hbase">HBase</a> is the Hadoop database for random read/write access</li> 
 <li><a href=”http://cascading.org">Cascading</a>: is a Data Processing API, Process Planner, and Process Scheduler used for defining 
 and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. 
 All without having to ‘think’ in MapReduce.</li> 
 </ul> 
 There are many problems that has been solved using this piece of technology and its ecosystem 
 <h2>It’s time to enter to the Cloud</h2>

There are a lot of companies that actually are using Cloud Services in many of their processes on their businesses. GitHub, The New York Times, Hopper.Travel, 
 RazorFish are examples of this. Now, there are big players on this movement: Amazon, Google and Microsoft.

There are many companies that uses the incredible business behind the <ha href=”http://aws.amazon.com/elasticmapreduce/">ElasticMapReduce</a> 
 and <ha href=”http://aws.amazon.com/s3/">S3</a> from Amazon. The first, like is described on its page “is a web service that enables businesses, 
 researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework 
 running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).”

And the second is: “a data storage service. You are billed monthly for storage and data transfer. Transfer between S3 and AmazonEC2 is free. 
 This makes use of S3 attractive for Hadoop users who run clusters on EC2. “

3. Research, Business, everything is based on numbers: Statistics

4. Mining, Mining: Data Mining

5. Visualize it: Information Aesthetics

-- Marcos Luis Ortiz Valmaseda Software Engineer (Distributed Systems) http://uncubanitolinuxero.blogspot.com
Marcos Ortiz

Marcos Ortiz