CCC Colloquium: Andy Pavlo (September 3, 2009)

MapReduce and Parallel DBMSs: A Comparison of Approaches to Large-Scale Data Analysis

Andy Pavlo (Brown University)

Thursday, September 3, 2009
4pm, AVW 3258

Talk slides: PDF (1.73 MB)


The MapReduce (MR) paradigm has been heralded as a revolutionary new platform for large-scale, massively parallel data access. Some proponents claim that the extreme scalability of MR will relegate relational database management systems (DBMS) to the status legacy technology. In this talk, however, we discuss the results from our recent benchmark study from that suggest that using MR systems to perform tasks that are best suited for DBMSs yields less than satisfactory results. This leads us to conclude that MR is more akin to an Extract-Transform-Load (ETL) system than a DBMS, as it is quickly able to load and analyze large amounts of data in an ad hoc manner. As such, it is complementary to DBMS technology, rather than a competitor. We also discuss the various differences in the architectural decisions of MR systems and database systems, and provide insight on how the two systems should complement one another.

About the Speaker

Andrew Pavlo is a third year Computer Science PhD student at Brown University's Data Management Group under the guidance of Dr. Stanley Zdonik.