Hashing-based data structures

Gregory Kucherov CNRS & Gustave Eiffel University

Course Summary | About the lecturer | Location and schedule |

Registration form:

Register here

Course summary:

Probabilistic data structures provide very compact representation of data supporting fast query processing with controllable error guarantees. As such, they are widely used in practice in numerous real-life applications. In this course, we will focus on a family of probabilistic data structures based on random hash functions.

We will start with a quick reminder of textbook definitions of hash tables — basic application of hash functions. In particular, we will introduce the idea of random hash functions through universal hashing and then apply it to a classical construction of perfect hash tables. We will then proceed to cuckoo hashing — an elegant and practical variant of perfect hashing — and consider a mathematical model of random graphs explaining the behavior of cuckoo hash tables.

We then turn to another type of data structures, sometimes called “sketches”, based on random hash functions. We start with the most basic such data structure — Bloom filter — widely used in practice. We then study two distinct generalizations of Bloom filters. The first one, called Count-min sketch, is designed to track multiplicities of keys in a multi-set, possibly provided from a stream. The second generalization, called Invertible Bloom filter, can be seen as a peculiar hash table capable of storing a dynamic set of keys of unlimited size and still having some very useful applications. Both these generalization admit a non-trivial analysis relying on preperties of random graphs that we outline in our course.

Along this journey, we will pay attention to potential applications of these data structures that we illustrate with examples from bioinformatics. If time permits, we will look into other hashing-based data structures successfully applied nowadays in data-intensive applications.

About the lecturer:

Gregory Kucherov is a CNRS research director in Gaspard Monge Lab for Computer Science (LIGM) in Gustave Eiffel University (until 2019, Paris-Est University) at Marne-la-Vallée, France. His research interests include algorithm design and engineering, algorithmic complexity, word combinatorics, data structures, sequence and graph algorithms with applications to bioinformatics and other big data applications. Gregory Kucherov (co)-authored about 50 journal publications and more than 50 conference papers. He regularly serves on program committees of international conferences, has been invited to present his work to various international events, has served as editor of LNCS volumes and journal special issues. He supervised several software projects in bioinformatics, including ProPhyle, mreps, Yass, or Norine. He has been a (co-)advisor of nine PhD students.

Materials: Slides and exam assignement

Location and schedule:
Wednesday, May 22 in 5820
16:15 - 17:45Lecture
Thursday, May 23 in 5820
14:15 - 15:45 Lecture
16:15 - 17:45 Tutorial
Friday, May 24 in 5820
14:15 - 15:45 Lecture
16:15 - 17:45 Tutorial