My little advices for young Data Scientist

Data Scientist is the sexy role for the next 20 years

This phrased was said by Hal Varian, Chief Financial Officer(CFO) at Google, in a interview to Mckinsey Quaterly News. Varian, who together to his team has become to Google to one of most profitable companies of the world, arriving to amazing numbers: 29,5 billions of dolars in a year.

But these numbers of the company, it would be not possible without three main roles that Varian calls: “Data Analyst”, “Statistician” and the “Data Visualization Expert”, described by the executive like the “hot and sexy jobs”. Varian says: “These professionals are and will be the key of the success of the companies in the next years, specially in these difficult times that is very hard to become to a business in a profitable piece”.So, there are many companies looking for a new professional that could combine these three skills: the “Data Scientist”. If you do a simple search in Indeed.com or Simplyhired.com searching “Data Scientist”, you can see the raising interest by the companies for this unique kind of professionals.

I want to be a Data Scientist, How I can prepare for the role?

This is a question that many young professionals (like me) have in their minds: “I want to be a Data Scientist, but How I can obtain the required knowledge for acomplish this?”. There are a lot of whitepapers, books, articles, blog sites; a lot of techniques, tools, etc. For that reason, when a new professional is faced to this insane quantity of resources, arrises a new question: What? There a lot of books, tools, How I can begin to do this? This is the main topic of this post, to help from my modest experience in this field to address to new professionals to select good and useful content. Ok, first, my books’s list:

All these amazing books helps me everyday, because they are writting for practitioners that use everyday theirs techniques and tools described in these texts. Remember, this is my personal list, you can build your list, adding more books or removing some. I let you a start point, you can decide how you should follow it. I recommend the order that I let you here, because the first book (Head First Data Analysis) do a amazing job explaining to you the tricky and challenging problems that can face a Data Analyst, in a concise and clean way, addressed for the outstanding way of its writing. (Note: All Head First’s books are incredible useful)

OK, I have the content, What about the tools?

I love Open Source, so, all the tools that I will recommend to you are developed and improved everyday under these principles:

  • Python: It’s a amazing language with a concise and clean syntax, easy to learn, easily extensible, with a lot of useful modules used by Scientists like Numpy and Maptplotlib
  • R: this amazing platform for statictical computing and data visualization has become on the “Lingua Franca for Statictians” today. The reasons are many.
  • It’s free
  • Runs on Unix, Windows and Mac OSX
  • Has a amazing built-in help system
  • Has excellent graphics capabilities
  • Has a powerful, easy to learn syntax with many built-in statictical functions.
  • Has several powerful GUI tools like RStudio, Revolution R Enterprise (this is free for academics and students), etc
  • and many more
  • Apache Hadoop and its ecosystem:The popular Open Source implementation of the MapReduce’s paradigm, based on a research paper by Google engineers in 2004. This project has become in one of the major trends today, with “Big Data” and “NoSQL”. Many companies are using today this amazing platform for large data sets processing (MapReduce) and distributed storage (HDFS) like Yahoo! for Social Graphs Analysis, Rackspace for Cross Data Center Log Processing, The New York Times for converting 4 TB of images of its archives to PDF files, VISA for Large Scale Transaction Analysis,eHarmony for Match Making, JP Morgan Chase for Data Processing for Finalcial Services and many more examples that you can find on the Hadoop World 2009site and on the last edition of 2011. There are many companies offering commercial versions of Apache Hadoop like Greenplum, the division of EMC with its Greenplum HD, MapR Technologies with its MR3 and MR5 editions, IBM with its BigInSights project, but for me, the leader in commercial support, training and even certifications is Cloudera, the company founded in 2009 by Amr Awadallah former, VP of Engineering for Data Systems at Yahoo!, (now is the current Cloudera CTO), Jeff Hammerbacher, former Data Scientists Team Manager at Facebook (Vice President of Products and Chief Scientist at Cloudera), Christophe Bisciglia and Mike Olson (currently the CEO of the company), former the CEO of Sleepycat, makers of BerkeleyDB, the open source embedded database engine, and then spent two years at Oracle acting like the Vice President for Embedded Technologies after Oracle’s adquisition of Sleepycat in 2006.

Final Thoughts

The rise of the Data Scientist began with Jeff, when he lead and created the Data Team at Facebook. And now in these days, every company, organization or whatever, are looking for this unique kind of professionals to do three key things, like Michael E. Driscoll (Co-founder and CEO of Metamarkets) said in his “Open Source Analytics Visualization and Predictive Modeling of Big Data with R”, in the OSCON 2009:

“We need professionals that they can able to munge, model and visualize data”

. So, It’s a great moment to develop these skills, and in that way, to be able to work in challenging problems that could solve a lot of headaches to your current or future CEO. For that reason , I let you to decide how to use this information, and if you have any comment, please, just send me an email.

Happy Hacking !!!

Marcos Ortiz

Marcos Ortiz