String algorithms in the Word RAM model

Pawel Gawrychowski University of Wrocław

Course Summary | About the lecturer | Location and schedule |

Registration form:

Register here

Course summary:

We consider algorithms and data structures operating on strings, that is, finite sequences of characters. A prime example (and application) is biological data, such as DNA sequences. The sequences might be very long, and thus it is desirable to achieve complexity that is linear (or close to linear) in their length. In the most classical setting, it was assumed that each character is an atomic object, and the only allowed operation is comparing two such objects. However, a more realistic scenario is that each character is a number, and due to the biological applications it is natural to assume that it consists of only a few bits. This naturally leads to the Word RAM model, where each word consists of a logarithmic (in the length of the input) number of bits, and operating on such words takes constant time. In such a setting, we might even assume that the input string is given in a word-packed representation, that is, with multiple characters stored in a single word. This leads to the challenge of designing algorithms and data structures that achieve complexity depending on the size of the input measured in the number of words.

Two basic questions considered in this area is pattern matching (find one string in another string) and string indexing (build a structure for a given string that later allows efficiently locating occurrences of another given string). We will first recap the classical solutions known for those problems, and then move to the Word RAM model. In this model, we will consider the so-called packed string matching, and compact/compressed suffix arrays. To design the latter, we will introduce tools such as rank/select structures and wavelet trees, which might be of independent interest. We will also briefly discuss other problems concerning the so-called succinct data structures, where the goal is to match the information-theoretical bound on the required number of bits (for a given type of queries), where the canonical example are Range Minimum Queries.

We will also discuss how to make an indexing structure dynamic, that is, maintain a string under appending characters while providing efficient pattern matching queries. This will, again, extensively use the Word RAM model.

About the lecturer:

Paweł Gawrychowski received his PhD from the University of Wrocław in 2011, and now works as an associate professor there. Before moving back to Wrocław, he was a post-doc at the Max-Planck-Institut für Informatik in Saarbrücken, University of Warsaw, and University of Haifa. He is interested in designing efficient algorithms for processing strings, especially in the setting where the input is given in a compressed form, and graphs, especially planar graphs. He won the Witold Lipski prize for Young Researchers in Computer Science in 2013, received the Best Student Paper Award at MFCS 2009, ESA 2011, and CPM 2012, and Best Paper Award at CIAA 2011, MFCS 2015, ICALP 2020, ISAAC 2020, WADS 2021, CPM 2022.

Materials: Slides day 1     Problem set day 1     Slides day 2     Problem set day 2     Slides day 3     Homework

Location and schedule:
Wednesday, December 13 in 3180
14:15 - 17:45 lecture + tutorial
Thursday, December 14 in 4420
14:15 - 17:45 lecture + tutorial
Friday, December 15 in 3180
14:15 - 17:45 lecture + tutorial