String Hashing
A polynomial rolling hash maps a string of length to an integer modulo a large prime by evaluating it as a polynomial in a fixed base . If the characters are at position , at position , and so on, then the hash is , all taken modulo . With a precomputed prefix-hash array and the powers of , the hash of any substring can be extracted in constant time by subtracting two prefix hashes, exactly the way a prefix-sum supports range sums.
In practice this gives a fast probabilistic test for substring equality: two substrings whose hashes agree under well-chosen and are equal with overwhelming probability. To drive the collision probability to negligible levels one typically uses a double hash with two independent primes, but a single large prime is enough for most casual uses.
Codes
- Python
- C++
def build_hash(s, base=131, mod=(1 << 61) - 1):
n = len(s)
prefix = [0] * (n + 1)
power = [1] * (n + 1)
for i in range(n):
prefix[i + 1] = (prefix[i] * base + ord(s[i])) % mod
power[i + 1] = (power[i] * base) % mod
return prefix, power, mod
def substring_hash(prefix, power, mod, l, r):
"""Hash of s at indices l through r inclusive."""
return (prefix[r + 1] - prefix[l] * power[r - l + 1]) % mod
def solve(xs):
s, queries = xs
prefix, power, mod = build_hash(s)
return [substring_hash(prefix, power, mod, l, r) for l, r in queries]
#include <string>
#include <vector>
#include <cstdint>
struct StringHash {
static const uint64_t MOD = (1ULL << 61) - 1;
static const uint64_t BASE = 131;
std::vector<uint64_t> prefix, power;
explicit StringHash(const std::string& s) {
int n = (int)s.size();
prefix.assign(n + 1, 0);
power.assign(n + 1, 1);
for (int i = 0; i < n; ++i) {
prefix[i + 1] = (__int128(prefix[i]) * BASE + (uint8_t)s[i]) % MOD;
power[i + 1] = (__int128(power[i]) * BASE) % MOD;
}
}
// Hash of s[l..r] inclusive.
uint64_t get(int l, int r) const {
__int128 h = prefix[r + 1] - __int128(prefix[l]) * power[r - l + 1] % MOD;
h %= (int64_t)MOD;
if (h < 0) h += MOD;
return (uint64_t)h;
}
};
Example
Description
Run time analysis
Building the prefix-hash array is because each position does a constant amount of multiplication and addition. Each substring-hash query is then , so answering queries costs in total.
Space analysis
for the prefix-hash array and the power array. No extra memory is needed per query.
Proof of correctness
Let denote prefix at index , so equals the polynomial
, mod . By
induction on the recurrence holds. For a
substring of length , the desired polynomial is
, which by expanding and
equals modulo . Hence
the query formula is exact. Collision probability between two unequal
strings of length at most is bounded above by roughly by the
Schwartz-Zippel lemma over the prime field, which is negligible for
near .
Extensions
Applications
References
- cp-algorithms — String Hashing
- Karp, R. M., and Rabin, M. O. "Efficient randomized pattern-matching algorithms". IBM Journal of Research and Development 31(2), 1987.
- Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms, 3rd ed., Section 32.2 (The Rabin-Karp algorithm).