Hadoop: The Definitive Guide

Big Data is undoubtedly one of the hottest topics in contemporary IT business, and Apache Hadoop is a natural start point for early adopters on the technology. Hadoop: The Definitive Guide written by Tom White (now in 3rd edition) is supposed to be an obligatory item for those who want to take Hadoop and its ecosystem seriously.

A first look at the contents of the book tells one thing: its scope is unexpectedly wide. It covers not only the core of the Hadoop and the development of MapReduce software, but also covers the installing and administration of the Hadoop cluster (which deserves a separate book itself), takes a quick peek at some of the Hadoop complementary projects and describes few interesting use cases from the celebrities of the sector.

While the large scope is rather considered to be an advantage, the depth of the topics described is uneven, especially for a learner. Some of the topics are described very verbose, while the others are only skimmed. The best example here is probably the in-depth description of HDFS and all the filesystems that Hadoop is capable to work with, compared to the short and partial description of HBase.

Even though sometimes the topic is not presented in details, most of the concepts described contains an example, which is a great companion. Some of the readers may find it confusing to see the code samples in different programming languages, despite the fact that Hadoop is considered to be a Java solution. Luckily, the samples are rather straightforward and even if one knows nothing about Python or Ruby, there will be no problem understanding that.

The last thing I consider as a disadvantage is lack of a real tutorial. If you are a complete newbie in Big Data, MapReduce and Hadoop, you will not find a step-by-step instructions on what you should in order run your first Hadoop instance and your first MapReduce job. Fine, you will find the installation description later in the book and an “introductionary” task is presented at the beginning of the book, but it helps a lot in the process of learning to go after plain instructions. I’m not expecting any “Head First: Data Analysis with Hadoop”, but a quick start chapter is a must.

All in all, there is no better and up-to-date publication on Apache Hadoop right now. Reader should be aware though, that it is a reference material, rather than a textbook for learning Hadoop. If you are a developer wanting to know what the MapReduce buzz is all about, I would rather suggest you wait for “Developing MapReduce applications with Hadoop”. I hope such book will be written some day.

Review author: Mateusz Haligowski