Multi-join Query Evaluation on Big Data

Lecturer: Dan Suciu (University of Washington, USA).

About the lecturer | Course Summary | Slides: 1 2 3 4 | Exercises: 1 2 | Assignment

About the lecturer: Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM, holds twelve US patents, received the best paper award in SIGMOD 2000 and ICDT 2013, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, the 10 Year Most Influential Paper Award in ICDE 2013, the VLDB Ten Year Best Paper Award in 2014, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu serves on the VLDB Board of Trustees, and is an associate editor for the VLDB Journal, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS and ACM TOIS. Suciu's PhD students Gerome Miklau and Christopher Re received the ACM SIGMOD Best Dissertation Award in 2006 and 2010 respectively, and Nilesh Dalvi was a runner up in 2008.

Course summary: Query evaluation and optimization have been studied since the late 70's and are implemented today in all relational database systems. The traditional assumption is that a query is evaluated by computing one operator at a time: first construct a query plan, then evaluate the operators one at a time. But new theoretical results coupled with novel requirements in big data analytics have lead to an entirely new approach to query evaluation, which computes the entire query at once. This course discusses the latest algorithms for multi-join query evaluation, both on a single server and on massively parallel systems.

We start with a basic review of the traditional approach to query evaluation and optimization. Query optimizer are cost-based, using database statistics, while theoretical studies has been structured-based, using the query's hypergraph.

Next, we turn to pure theory, and prove a new, yet quite simple and elementary inequality by Friedgut, which we use to derive the AGM bound (named after Atserias, Grohe, and Marx) on the size of the output of a query. Then, we discuss a novel worst-case optimal query evaluation algorithm, whose runtime is guaranteed to be no larger than the AGM bound. Unlike traditional approaches to query processing, this algorithm does not use database statistics, and does not use a query plan: this makes it more practical for big data analytics (where statistics are unavailable) and provably more efficient by a polynomial factor in the size of the database than any query plan. All techniques in this part of the course will refer to a "fractional edge cover" of the query's hypergraph.

We continue by turning to parallel query evaluation. Large companies today perform big data analytics over thousands of servers, and here the dominating cost is the amount of communication and the number of communication rounds. We briefly review the traditional hash-partitioned parallel join, still used by all modern big data systems to date. Next, we discuss a novel algorithm that computes any query in a single communication round, assuming that the data is not skewed. We will discuss how to minimize the communication cost of this algorithm, then prove it is optimal. Our entire analysis for the parallel framework relies on a "fractional edge packing" of the query's hypergraph.

Finally, we discuss two open ended topics in parallel query evaluation. The first is skew. We show how skew affects the communication cost of a parallel query evaluation algorithm. We then present a simple algorithm to compute the join of two relations which is provably optimal, even in the presence of skew, and discuss how this algorithm could be extended to arbitrary queries. We will also prove a lower bound on the communication cost of any one-round algorithm that takes into account the skew in the data.

The second topic refers to multiple rounds of communication. We discuss two extensions of the single round algorithm to multiple rounds, and prove a lower bound on the amount of communcation that takes into account the number of communication rounds.

Bibliography. The course will cover material described in the following papers:

Assignment: You can find the problem set here. Please send the solutions to the address provided on the problem sheet by Thursday, April 30.