Skip to main content

String Hashing

A polynomial rolling hash maps a string ss of length nn to an integer modulo a large prime MM by evaluating it as a polynomial in a fixed base bb. If the characters are ss at position 00, ss at position 11, and so on, then the hash is s0bn1+s1bn2++sn1s_0 \cdot b^{n-1} + s_1 \cdot b^{n-2} + \dots + s_{n-1}, all taken modulo MM. With a precomputed prefix-hash array and the powers of bb, the hash of any substring can be extracted in constant time by subtracting two prefix hashes, exactly the way a prefix-sum supports range sums.

In practice this gives a fast probabilistic test for substring equality: two substrings whose hashes agree under well-chosen bb and MM are equal with overwhelming probability. To drive the collision probability to negligible levels one typically uses a double hash with two independent primes, but a single large prime is enough for most casual uses.

Codes

def build_hash(s, base=131, mod=(1 << 61) - 1):
n = len(s)
prefix = [0] * (n + 1)
power = [1] * (n + 1)
for i in range(n):
prefix[i + 1] = (prefix[i] * base + ord(s[i])) % mod
power[i + 1] = (power[i] * base) % mod
return prefix, power, mod

def substring_hash(prefix, power, mod, l, r):
"""Hash of s at indices l through r inclusive."""
return (prefix[r + 1] - prefix[l] * power[r - l + 1]) % mod

def solve(xs):
s, queries = xs
prefix, power, mod = build_hash(s)
return [substring_hash(prefix, power, mod, l, r) for l, r in queries]

Example

Loading Python runner...

Description

Run time analysis

Building the prefix-hash array is O(n)O(n) because each position does a constant amount of multiplication and addition. Each substring-hash query is then O(1)O(1), so answering qq queries costs O(n+q)O(n + q) in total.

Space analysis

O(n)O(n) for the prefix-hash array and the power array. No extra memory is needed per query.

Proof of correctness

Let H(i)H(i) denote prefix at index ii, so H(i)H(i) equals the polynomial s0bi1+s1bi2++si1s_0 \cdot b^{i-1} + s_1 \cdot b^{i-2} + \dots + s_{i-1}, mod MM. By induction on ii the recurrence H(i+1)=H(i)b+siH(i+1) = H(i) \cdot b + s_i holds. For a substring of length L=rl+1L = r - l + 1, the desired polynomial is slbL1++srs_l \cdot b^{L-1} + \dots + s_r, which by expanding H(r+1)H(r+1) and H(l)bLH(l) \cdot b^{L} equals H(r+1)H(l)bLH(r+1) - H(l) \cdot b^{L} modulo MM. Hence the query formula is exact. Collision probability between two unequal strings of length at most nn is bounded above by roughly n/Mn / M by the Schwartz-Zippel lemma over the prime field, which is negligible for MM near 2612^{61}.

Extensions

Applications

References

  • cp-algorithms — String Hashing
  • Karp, R. M., and Rabin, M. O. "Efficient randomized pattern-matching algorithms". IBM Journal of Research and Development 31(2), 1987.
  • Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms, 3rd ed., Section 32.2 (The Rabin-Karp algorithm).