String Indexing in the Word RAM model

Lecturer: Paweł Gawrychowski (Max-Planck-Institut für Informatik, Germany).

About the lecturer | Course Summary | Slides: 1 2 3 4 | Slides | Assignment

About the lecturer: Paweł Gawrychowski received his PhD from the University of Wrocław in 2011 and then started working at the Institute of Computer Science there. Since November 2011 he is a post-doc at the Max-Planck-Institut für Informatik in Saarbrücken. He is interested in designing efficient algorithms for processing strings, especially in the setting where the input is given in a compressed form. He won the ESA'11 Best Student Paper Award and the Witold Lipski prize for Young Researchers in Computer Science in 2013.

Course summary: We consider a few basic questions concerning processing strings. The strings could be either an electronically stored text data, or a non-text data, say a DNA sequence. Especially in the latter case, the size of the input could be extremely large, hence achieving a linear (or close to linear) complexity is of importance. In such setting it makes sense to work in the Word RAM model, where the space is measured in words, each containing a logarithmic number of bits, which can be individually accessed. Operating in such a model allows a really fine-grained analysis of the complexity. We will be mostly interested in indexing a given string, i.e., in building a (small) structure, which later allows us to (quickly) detect occurrences of a given pattern. Techniques and tools used to design such structures are often more general and might be interesting on their own.

We start with the very basic question of string indexing, where given a string, we want to preprocess it so that later we can retrieve all (or any) occurrence of a pattern inside the preprocessed string. This can be efficiently solved using suffix arrays and the so-called LCP queries. For the latter, one needs an efficient Range Minimum Query structure, which might be of independent interest. An alternative solution is to use suffix trees, and surprisingly sometimes it pays off to combine these two structures.

When the size of the alphabet is small, or the string has a low entropy, even a linear size indexing structure might be too large. In such settings we often measure the space in bits instead of words, and want the size of the structure to be proportional to the information-theoretical bound. For instance, given a binary text of length n, we would like to use just O(n) additional bits. Then a better solution for indexing is to use a compact or a compressed suffix array. To implement them one needs a rank/select structure and wavelet trees, which again might be of independent interest. Nevertheless, for some very repetitive sequences, even using a linear number of additional bits might be too large, because instead of an explicit description of the string we are given its compressed representation. In other words, we might want the space of our structure to depend not on the length of the string, but on the size of its (say, Lempel-Ziv) compressed representation. It turns out to be possible by reducing the problem to geometric range searching.

Finally, we consider a generalisation of string indexing, where we are interested in approximate occurrences of a pattern. Approximate might mean that the pattern contains some don't care characters, or that we are looking for a fragment of the string with a bounded edit distance to the pattern. Here an important tool is the centroid decomposition of a tree.

Assignment: You can find the problem set here. Please send the solutions to the address provided on the problem sheet by Wednesday, May 7.