Database Technology for the Web: Part 1 – The MapReduce Debate

by Colin White, BI Research

Over the course of the nearly forty years I have been working on database systems, there have been many debates and arguments about which database technology to use for any given application. These arguments have become heated, especially when a new database technology appears that claims to be superior to anything that came before.

When relational systems were first introduced, the hierarchical (IMS) and network (IDMS) database system camps argued that relational systems were inferior and could not provide good performance. Over time this argument proved false, and relational products now provide the database management underpinnings for a vast number of operational and analytical applications. Relational database products have survived similar battles with object-oriented database technology and multidimensional database systems.

Just when I thought the main relational products had become a commodity, several new technologies appeared that caused the debates to start again. Over the course of the next few newsletters, I want to review these new technologies and discuss the pros and cons of each of them. This time I want to look at MapReduce, which Michael Stonebraker (together with David DeWitt), one of the original relational database technology researchers, recently described as a “a giant step backwards.

MapReduce has been popularized by Google that uses it to process many petabytes of data every day. A landmark paper by Jeffrey Dean and Sanjay Ghemawat of Google states that:

“MapReduce is a programming model and an associated implementation for processing and generating large data sets…. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.”

Michael Stonebraker’s comments on MapReduce explain MapReduce in more detail ….

Neither MapReduce nor SQL are particularly suitable to the dynamic processing of in-flight data such as event data. This is why we are seeing extensions to SQL (such as StreamSQL) and new technologies such as stream and complex event processing to handle this need. MapReduce is, however, useful for the filtering and transforming of large event files such as web logs. Part 1 of White’s excellent analysis.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.