Longest Repeated Consecutive Substring in Lisp

I wanted to document my journey figuring out how I solved a unique problem and how doing so in Lisp helped me find a concise and correct solution.

Twitch is a live streaming platform for gaming and Twitch chat, a group chat accompanying the live-streaming video, can become incomprehensible and unwieldy with large amounts of messages submitted rapidly. A streamer or a viewer paying attention to Twitch chat may miss messages as they are buried by the ocean of new incoming messages.

I noticed that the screen real estate is occupied by long messages and often repeated messages. So I tried to tackle that problem with compakt, a Chrome Extension for Twitch chat.

To compress long messages, I used suffix arrays and an adaptation to the classic longest repeated substring. Prototyping all this in Lisp really help me decide on a concise and accurate solution.

Suppose we had the following string: abracadabra. Suffixes are substrings that contain all characters from a position in the string to the end of the string. For the aforementioned string, all possible suffixes are:

abracadabra$
bracadabra$
racadabra$
acadabra$
cadabra$
adabra$
dabra$
abra$
bra$
ra$
a$

’$’ denotes the end of the string. Typically a character that isn’t present originally in the string is used for this. A suffix array is an array that contains all the suffixes for a string in sorted order. Again for the aforementioned string, the suffix array is:

a$
abra$
abracadabra$
acadabra$
adabra$
bra$
bracadabra$
cadabra$
dabra$
ra$
racadabra$

We take advantage of this sorted property to find the longest repeated substring!

In the example above, we explicitly list out every suffix for demonstration-sake. But we can use a more compact representation of the suffix array. Instead of the suffixes themselves, we use the position of the start of the suffix (0-based index). Because suffixes are substrings from a position to the end of the string, the position can be used to uniquely identify a suffix. The suffix array from above then becomes:

Suffix array construction takes $O(n^{2} \log (n))$ time: sorting the suffixes is $O(n \log (n))$ and each suffix comparison is $O(n)$. A suffix array takes $O(n)$ space.

Note

There are algorithms that can construct the suffix array in $O(n)$ time though the implementation I present here will not be one of those.

Sources:

1 2 3 4 5 6

Note

You may have heard of suffix trees. Often, these are presented before suffix arrays in literature. I went with a suffix array because it was easier so me to understand.

The longest repeated substring is a well-document problem. What follows is a description of the problem and how it’s solved with suffix arrays. Then I’ll describe the steps I take to adapt it to my problem: the longest repeated consecutive substring.

Longer repeated substrings are more likely to occur between lexicographically consecutive suffixes.

What I want to do first is establish an intuition for determining the longest repeated substring from a suffix array. Suppose we simply wanted to find an occurrence of a pattern in the string abracadabra. Let this pattern be bra. What we would do is a binary search on the suffix array. If the pattern exists in the string it is equal to a suffix or a substring of a suffix, also known as a prefix. Knowing this, how do we find an arbitrary substring that repeats in the string? We compare suffixes to each other! A repeated substring would appear in at least 2 different suffixes. In other words, the repeated substring should appear in at least 2 different places of the string. And knowing that the suffix array is sorted, we don't need a $O(n^2)$ comparison i.e. comparing one suffix to rest to find the repeated substring. Longer repeated substrings are more likely to occur between lexicographically consecutive suffixes. So we look for the longest common prefix between all pairs of consecutive suffixes.

Finding the longest repeated substring is a simple matter of retaining a best-so-far match and comparing the length of new potential matches.

(defun lcp (a b)
  (let ((i (string< a b)))
    (subseq a 0 i))
)

(defun returnLonger (a b)
  (if (> (length a) (length b))
    a
    b
  )
)

(defun findLongestCommonSubstring (suffixarr str best)
  (let ((best (returnLonger
                (lcp (subseq str (first suffixarr)) (subseq str (second suffixarr)))
                best)))
    (if (<= (list-length suffixarr) 2)
      best
      (findLongestCommonSubstring (rest suffixarr) str best)))
)

(defun _range (count i l)
    (if (<= count 0)
        l
        (_range (1- count) (1+ i) (append l (list i))))
)

(defun range (count)
    (_range count 0 ())
)

(let ((s (read-line)))
  (let ((suffixarray (sort (range (length s)) #'(lambda (x y) (string< x y)) :key #'(lambda (i) (subseq s i)))))
    (let ((lrcs (findLongestCommonSubstring suffixarray s "")))
      (print lrcs))))

We benefit from iterating through the suffix array in $O(n)$. But since we compare each character of the string, the algorithm is $O(n^{2})$.

We have a strong foundation for what we want to do. However, it’s not perfect. Take this for example:

ATCGATCGA$

The longest repeated substring is ATCGA. This repeated substring overlaps with itself. We would like to find a repeated substring that doesn’t overlap and is consecutive.

Non-overlapping but also not consecutive:

ABBAissocoolIloveyouABBA\$

Repeated string ABBA occurs but not sequentially.

Non-overlapping and consecutive:

yoloyoloyolobaggins\$

Repeated string yolo is repeated multiple times and is non-overlapping and consecutive.

So how do we do this?

Finding the repeated consecutive substrings is simple: the repeated substring must be the same distance as the distance between 2 suffixes. So for every longest common prefix that we find, check if its length is equal to the distance between the 2 suffixes. If so, it is repeated, non-overlapping and consecutive.

(defun lcp (a b)
  (let ((i (string< a b)))
    (subseq a 0 i))
)

(defun updatelcp (oldlcp checklcp suffix1 suffix2)
  (if (and (> (length checklcp) (length oldlcp))
           (checkConsecutive checklcp suffix1 suffix2))
    checklcp
    oldlcp
  )
)

(defun checkConsecutive (lcp suffix1 suffix2)
  (= (length lcp) (abs (- suffix1 suffix2)))
)

(defun findLongestCommonSubstring (suffixarr str best)
  (let ((best (updatelcp
                best
                (lcp (subseq str (first suffixarr)) (subseq str (second suffixarr)))
                (first suffixarr)
                (second suffixarr))))
    (if (<= (list-length suffixarr) 2)
      best
      (findLongestCommonSubstring (rest suffixarr) str best)))
)

(defun _range (count i l)
    (if (<= count 0)
        l
        (_range (1- count) (1+ i) (append l (list i))))
)

(defun range (count)
    (_range count 0 ())
)

(let ((s (read-line)))
  (let ((suffixarray (sort (range (length s)) #'(lambda (x y) (string< x y)) :key #'(lambda (i) (subseq s i)))))
    (let ((lrcs (findLongestCommonSubstring suffixarray s "")))
      (print lrcs))))

Before I explain how Lisp helped me understand this logic, I want to quickly describe one other extension I made for my application.

I only wanted to find repeated substrings that consisted of whole words from the original message. So I tokenized the original string and treated each token as a character. Then I performed the longest repeated consecutive substring on this new structure. You can give the code here on my GitHub.

I love Lisp

In my head, I imagined this section as a list of features that I would ramble on about how they helped me implementation this algorithm. But I think I only need to say one thing: functional programming.

I was forced to take advantage of the existing structures and their properties instead of iterators and counts making for an efficient use of my allocated memory.
I was able to see the meat and bones of the algorithm: a lcp search between each pair of lexicographically consecutive suffixes.

I’m still very new to Lisp but I know for sure I want to dive deeper into functional programming. I’m happy to hear any feedback on this post or the code in this post. Check out the Lisp code for my application here. And please check out my extension! It’s open-sourced!