Big data, fast: Avoiding Hadoop performance bottlenecks

by Jack Vaughan, SearchDataManagement

Hadoop shows a lot of promise as a relatively inexpensive landing place for the streams of big data coursing through organizations. The open source technology provides a distributed framework, built around highly scalable clusters of commodity servers, for processing, storing and managing data that fuels advanced analytics applications. But there’s no such thing as a free lunch: In production use, achieving high levels of Hadoop performance can be a challenge.

Despite all the attention it’s getting, Hadoop is still a relatively young technology — it only reached Version 1.0 status in December 2011. As a result, much of the work being done with Hadoop by users remains somewhat experimental in nature, especially outside of the large Internet companies that helped to create it and that are replete with Java programmers and systems administrators versed in deploying the technology.

In addition, the core combination of the Hadoop Distributed File System (HDFS) and MapReduce programming model has been joined by a continually expanding ecosystem of additional components…

… Meanwhile, some organizations are using complex event-processing engines to goose their Hadoop performance. Even Yahoo Inc., the company where Hadoop was first hatched, ran into problems with the technology, which had trouble keeping up with incoming information about the activities of online users that Yahoo wanted to correlate with its inventory of available ads… Read full article