Rabin-Karp String Search in Zig

Rabin-Karp String Search in Zig

# Rabin-Karp String Search in Zig ## Introduction String searching is a fundamental problem in computer science. Given a large text (haystack) and a smaller pattern (needle), we want to find all occurrences of the pattern in the text. The naive approach checks every possible position, resulting in O(n×m) time complexity where n is haystack length and m is needle length. The **Rabin-Karp algorithm** improves average-case performance using **rolling hash** techniques, achieving O(n+m) average time complexity. ## The Mathematics Behind Rabin-Karp ### Hashing as a Fingerprint The core idea: use a hash function to compute a "fingerprint" of the needle and compare it against fingerprints of substrings in the haystack. If hashes match, we verify character-by-character to avoid false positives from hash collisions. ### Rolling Hash Formula We treat strings as numbers in base B (typically 256 for ASCII): ``` hash("abc") = ((a × B + b) × B + c) % P ``` Where: - a, b, c are ASCII values - B = 256 (base) - P = a large prime (101 in our implementation) **Rolling property**: When sliding the window by one character, we can compute the new hash from the old hash in O(1): ``` new_hash = (B × (old_hash - leading_char × B^(m-1) % P) + trailing_char) % P ``` This allows us to update the hash in constant time per shift, avoiding recomputation from scratch. ### Modular Arithmetic Using modulo P keeps hash values bounded and prevents overflow. The choice of prime P reduces collision probability. In practice, we use a larger prime (like 10^9+7) for fewer collisions, but 101 works for demonstration. ## Implementation Details ### Core Algorithm (Zig) ```zig pub fn rabinKarp(haystack: []const u8, needle: []const u8) ?usize { if (needle.len == 0) return 0; if (needle.len > haystack.len) return null; const base: u64 = 256; const prime: u64 = 101; var needle_hash: u64 = 0; var window_hash: u64 = 0; var i: usize = 0; // Compute initial hashes while (i haystack.len) return null; var i: usize = 0; while (i <= haystack.len - needle.len) { var j: usize = 0; while (j < needle.len and haystack[i + j] == needle[j]) { j += 1; } if (j == needle.len) return i; i += 1; } return null; } ``` ## Complexity Analysis | Algorithm | Average Case | Worst Case | Space | |-----------|--------------|------------|-------| | Naive | O(n×m) | O(n×m) | O(1) | | Rabin-Karp| O(n+m) | O(n×m) | O(1) | **Worst case** occurs when many hash collisions happen (e.g., both strings consist of repeated characters). The average case assumes a good hash function with low collision probability. ## Benchmark Results We compiled both implementations with `zig build-exe -O ReleaseSafe` and ran them on an AMD Ryzen 7 5800X (Ubuntu 22.04). Needle length was fixed at 10 characters. Each measurement is the median of 1000 iterations. ``` Rabin-Karp vs Naive Search Benchmark (Zig, ReleaseSafe) Haystack size: 100 characters Rabin-Karp average: 158.41 ns Naive average: 57.53 ns Speedup: 0.36x (naive faster) Haystack size: 1,000 characters Rabin-Karp average: 1,296.71 ns Naive average: 296.76 ns Speedup: 0.23x Haystack size: 10,000 characters Rabin-Karp average: 11,463.13 ns Naive average: 1,944.71 ns Speedup: 0.17x Haystack size: 100,000 characters Rabin-Karp average: 114,006.95 ns Naive average: 20,349.23 ns Speedup: 0.18x Haystack size: 1,000,000 characters Rabin-Karp average: 1,082,222.74 ns Naive average: 198,653.58 ns Speedup: 0.18x ``` **Observations**: - For very small inputs (< 1 KB), naive is faster due to Rabin‑Karp’s hash overhead. - In these runs, naive remained faster even at larger sizes. This suggests that for short needles (10 chars) on modern CPUs, the naive inner loop is heavily optimized and benefits from branch prediction, while the modular arithmetic and extra operations in Rabin‑Karp increase the constant factor significantly. - The expected asymptotic advantage of Rabin‑Karp (O(n+m)) typically emerges with **longer needles** or when the same needle is searched repeatedly across many haystacks (the needle hash is computed once). Our benchmark computed the needle hash each call; in practice you can pre‑compute it once for multiple searches. - Additionally, using a **larger prime** (e.g., 1 000 000 007) and avoiding division via multiplication tricks can improve RK’s speed. The current implementation uses a small prime (101) to keep numbers small and avoid 64‑bit division overhead; however, division is still relatively expensive compared to the naive character comparisons for small m. ## Why Zig? Zig is a modern systems programming language that offers: - **No hidden control flow** – explicit and predictable performance - **Manual memory management** – no garbage collector pauses - **Compile-time execution** – potential for algorithm specialization - **C interoperability** – easy to integrate with existing codebases - **Simple, readable syntax** – clear mapping to algorithm logic Our implementation takes advantage of: - Slices for efficient string views - Fixed-width integers for arithmetic precision - Comptime for potential compile-time optimization - Built-in testing framework ## Building and Testing ```bash # Compile zig build-exe rabin_karp.zig -O ReleaseSafe zig build-exe benchmark.zig -O ReleaseSafe # Run tests zig test rabin_karp.zig # Run benchmarks ./benchmark ``` ## Conclusion Our benchmark yields a surprising result: for short needles (10 characters) on random data, the naive algorithm outperformed Rabin‑Karp by a factor of ~5×. This highlights a key lesson: **asymptotic complexity is not the whole story**. Constant factors, CPU branch prediction, and the cost of modular arithmetic matter a lot in practice. Rabin‑Karp can still shine: - When the needle is long (hundreds of characters) - When you need to search for many different needles in the same haystack (hash the haystack windows once) - When you can use a larger prime and optimized rolling hash (e.g., using multiplication tricks to avoid division) Zig gave us full control to implement both algorithms exactly as described and measure their real performance without hidden costs. The takeaway: always benchmark your specific workload before choosing an algorithm. --- *Part of the Dex Time Challenge #25 – Implementing Rabin-Karp in Zig*