Arbitrary-Precision Integer: NTT + Karatsuba Hybrid Multiplication

When implementing arbitrary-precision integers in C++, the key challenge is how fast you can perform multiplication. This post introduces a two-stage hybrid approach combining NTT and Karatsuba.

Internal Representation
Multiplication Strategy — Two-Stage Hybrid
Division — Knuth Algorithm D

Internal Representation

BigInteger stores its absolute value as a little-endian limb array.

\[\text{value} = \sum_{i=0}^{L-1} \text{limbs}[i] \times B^i, \quad B = 10^7\]

Each limb is an int in the range $[0, B)$, with the least significant digits at index 0. The sign is stored as a separate bool flag.

Number: 123456789012345678901234567890
Limb array (little-endian):
  limbs[0] = 1234567   <- least significant 7 digits
  limbs[1] = 8901234
  limbs[2] = 5678901
  limbs[3] = 2345678
  limbs[4] = 1         <- most significant

Why $B = 10^7$?

When multiplying two limbs as long long, there must be no overflow.

\[B^2 = 10^{14} < 2^{63} - 1 \approx 9.2 \times 10^{18}\]

$10^7$ is the largest power of 10 that satisfies this constraint.

Multiplication Strategy — Two-Stage Hybrid

The core of multiplying two numbers is computing the integer convolution of their limb arrays.

\[c[k] = \sum_{i+j=k} a[i] \cdot b[j]\]

Propagating carries through the convolution result yields the final product.

NTT-Based Polynomial Convolution

Number Theoretic Transform (NTT) is FFT implemented over modular arithmetic. It computes integer convolution in $O(n \log n)$ without floating-point error.

The two primes used for the transform are NTT-friendly primes of the form $p = k \cdot 2^m + 1$.

Prime	Value	Form	Max Transform Size
$p_1$	998,244,353	$119 \cdot 2^{23} + 1$	$2^{23}$
$p_2$	985,661,441	$235 \cdot 2^{22} + 1$	$2^{22}$

NTT proceeds as follows.

\[a, b \xrightarrow{\text{NTT mod }p} \hat{a}, \hat{b} \xrightarrow{\text{pointwise}} \hat{c} \xrightarrow{\text{INTT mod }p} c\]

Running this process for each of $p_1$ and $p_2$ gives each result coefficient $c[i]$ expressed under two moduli.

\[c[i] \equiv r_1 \pmod{p_1}, \quad c[i] \equiv r_2 \pmod{p_2}\]

Result Recovery via Garner’s CRT

Recovering the actual value from two residues is the Chinese Remainder Theorem (CRT). We use Garner’s algorithm, which is numerically efficient.

Garner’s algorithm uses a mixed-radix representation.

\[c[i] = c_1 + c_2 \cdot p_1\]

where

\[c_1 = r_1, \quad c_2 = (r_2 - r_1) \cdot p_1^{-1} \pmod{p_2}\]

Since $p_1 \cdot p_2 \approx 9.84 \times 10^{17}$, any coefficient smaller than this product is recovered exactly. The maximum value of each coefficient is $n \cdot (B-1)^2$, so the maximum number of limbs for which NTT can be safely applied is:

\[n \cdot (B-1)^2 < p_1 \cdot p_2 \implies n < \frac{p_1 \cdot p_2}{(10^7)^2} \approx 9860\]

This means NTT can be applied directly to numbers with up to approximately 69,000 decimal digits.

Karatsuba Recursive Split

For larger numbers, Karatsuba’s algorithm splits the problem in half.

Split $a$ and $b$ each into two halves.

\[a = a_{\text{hi}} \cdot x^m + a_{\text{lo}}, \quad b = b_{\text{hi}} \cdot x^m + b_{\text{lo}}\]

The product becomes

\[a \cdot b = z_2 \cdot x^{2m} + z_1 \cdot x^m + z_0\]

Karatsuba’s key insight is computing the middle term $z_1$ with only three multiplications.

\[z_0 = a_{\text{lo}} \cdot b_{\text{lo}}\] \[z_2 = a_{\text{hi}} \cdot b_{\text{hi}}\] \[z_1 = (a_{\text{lo}} + a_{\text{hi}}) \cdot (b_{\text{lo}} + b_{\text{hi}}) - z_0 - z_2\]

Compared to a naive divide-and-conquer with 4 recursive calls, 3 calls reduce the complexity to:

\[T(n) = 3T(n/2) \implies T(n) = O(n^{\log_2 3}) = O(n^{1.585})\]

Two-stage switching logic:

padded_size = minimum power-of-two size needed for the convolution of the two limb arrays

if padded_size <= 2^21:
    compute directly with NTT  <- O(n log n)
else:
    split with Karatsuba -> recurse on each sub-problem  <- O(n^1.585)

This structure allows multiplication of arbitrarily large numbers, even beyond NTT’s transform size limit ($2^{22}$).

Division — Knuth Algorithm D

Division is implemented using Algorithm D from Knuth’s The Art of Computer Programming Vol. 2. The algorithm proceeds in the following steps.

Normalization: Multiply both dividend and divisor by the same integer $d$ so that the most significant limb of the divisor is at least $B/2$.
Quotient estimation: Estimate each digit of the quotient ($\hat{q}$) using the top two limbs.
Correction: Since $\hat{q}$ can exceed the true quotient by at most 2, verify the estimate and reduce it up to twice if needed.
Denormalization: Divide the remainder by $d$ to recover the true remainder.

Sign convention follows the C++ standard (truncate toward zero).

Dividend	Divisor	Quotient	Remainder
17	5	3	2
-17	5	-3	-2
17	-5	-3	2
-17	-5	3	-2

Together, these three components — the $10^7$-based limb representation, the NTT + Garner CRT convolution engine, and Knuth’s Algorithm D — form a self-contained arbitrary-precision integer library that handles numbers of any size with predictable, well-understood complexity at each layer.

Table of Contents