Algorithms for MapReduce
Lecturer:
Professor Jeffrey D. Ullman
(Stanford University, USA).
About the Lecturer | Course Summary | Book to Download | Slides: 1 2 3 4 | Assignment
About the lecturer: Jeff Ullman is the Stanford W. Ascherman Professor of Engineering (Emeritus) in the Department of Computer Science at Stanford and CEO of Gradiance Corp. He received the B.S. degree from Columbia University in 1963 and the PhD from Princeton in 1966. Prior to his appointment at Stanford in 1979, he was a member of the technical staff of Bell Laboratories from 1966-1969, and on the faculty of Princeton University between 1969 and 1979. From 1990-1994, he was chair of the Stanford Computer Science Department. Ullman was elected to the National Academy of Engineering in 1989, the American Academy of Arts and Sciences in 2012, and has held Guggenheim and Einstein Fellowships. He has received the Sigmod Contributions Award (1996), the ACM Karl V. Karlstrom Outstanding Educator Award (1998), the Knuth Prize (2000), the Sigmod E. F. Codd Innovations award (2006), and the IEEE von Neumann medal (2010). He is the author of 16 books, including books on database systems, compilers, automata theory, and algorithms.
Course summary: We begin with an overview of a modern programming environment, involving massive numbers of inexpensive compute nodes, connected by an inexpensive network. The programming basis for such hardware is a "distributed file system," which is designed to store data reliably on cheap, failure-prone hardware. We introduce MapReduce (Hadoop) and other programming systems that are important components of this new environment, including SQL implementations (PIG. Hive) on top of MapReduce, object-stores, workflow systems, and graph-processing systems.
We then explain in detail how MapReduce works and give an important example: the join of relations. Then, we look at systems that support recursion (often called "graph-processing" systems) such as Pregel or its open-source equivalent Giraph. We look at transitive closure as an important example of nontrivial recursion and look at ways to avoid redundancy in parallel computation of the transitive closure.
Next, we consider a particular problem: an optimal algorithm for computing a multiway join in a single round of MapReduce. The critical resource in this and many other MapReduce algorithms is communication, so we view the problem as minimizing communication between the mappers and reducers. We show the problem can be set up simply using Lagrangean multipliers, and in most cases can be solved simply and exactly. We look at the exact solutions for two important classes of joins: chain joins, e.g., R(A,B) JOIN S(B,C) JOIN T(C,D), and star joins. The latter are important in analytic queries, where a very large "fact table" is joined with many smaller, but still large, "dimension tables." For star joins, not only is the optimum MapReduce algorithm easy to find, but it is always better than a cascade of two-way joins. Other multiway joins may or may not be better than a cascade of two-way joins.
We apply the multiway-join solution to the problem of computing "skew joins," where one or a few values appear so frequently that the tuples containing this values need to be handled specially, or the ability to compute in parallel is greatly limited. Interestingly, the implementations of skew join in PIG and Hive are not optimal, and we show how to handle these frequent values optimally.
Then, we develop a theory of MapReduce algorithms that enables us to quantify the tradeoff between computation and communication. We begin with a real example of what went wrong in what appeared to be a simple application of MapReduce, involving a search for drug interactions. We show how to redesign a bad algorithm for this problem, which is, in the abstract, the "all-pairs" problem of comparing each two items in a large set. We offer a family of algorithms that are each optimal for some tradeoff factor between computation and communication. Using the theory of "mapping schemas," which represent the way MapReduce algorithms are more specialized that general parallel algorithms, we are able to prove a lower bound on the "replication rate" (communication per input) as a function of "reducer size" (the largest number of inputs one reducer is allowed to receive) that exactly matches the algorithm for the all-pairs problem.
Finally, we look at similar tradeoffs for some other problems. We are able to provide a family of algorithms and a matching lower bound for (a) The "HD1" problem of finding those pairs of input bit strings that differ in exactly one bit, and (b) Matrix multiplication. In the latter case we are able to show that the two-round MapReduce algorithm for matrix multiplication is superior to any one-round algorithm.
Book to download: Much of this material is found in Chapter 2 of a FREE on-line book:
- J. Leskovec, A. Rajaraman, and J. D. Ullman, "Mining of Massive Datasets," Cambridge Univ. Press, 2012.
You can find the problem set here. Please send your solutions (or questions, should you have any) to Bartosz Klin by Sunday, February 2.