November 2024 > Math, Algorithms

The Karatsuba Algorithm

This is a spin-off page of II. Exact Arithmetic, the second part of the series on triangle mesh boolean operations.

The Karatsuba algorithm (or Toom-Cook variant 2-2, or simply Toom22), is a sub-quadratic algorithm for multiplying multi-precision integers. Multi-precision integers are typically represented as arrays or lists of 32-bit or 64-bit unsigned integers. One can view such a list as a very large collection of big-endian binary digits or — if one prefers the equivalent formulation — a representation in base \(2^{32}\) or \(2^{64}\) with each "digit" (or "word") an entry in the list. Taking the second option in base 64, two numbers \(a=\sum_{j=0}^p a_j 2^{64j}\) and \(b=\sum_{j=0}^q b_j 2^{64j}\) have product \[ ab=\left(\sum_{j=0}^p a_j 2^{64j}\right)\left(\sum_{j=0}^q b_j 2^{64j}\right)=\sum_{j=0}^p \sum_{k=0}^q a_jb_k 2^{64(j+k)}\,. \] Each sub-product \(a_j b_k\) is relatively easy to calculate (on Intel there is a 2-word _mulx_u64 intrinsic for this very purpose), so one way to go about this multiplication is to properly align each product \(a_j b_k\) to the word corresponding to the \(2^{64(j+k)}\) place before adding it to an accumulator. This is the so-called "grade-school" method, and it requires \((p+1)(q+1)\) multiplications and additions.

Of the several options for fast multi-precision multiplication algorithms (see the linked Exact Arithmetic page under "Multiplication" for more of them), there is a range of \(p\) and \(q\) for which the Karatsuba algorithm, aka Toom22, is most efficient. Broadly speaking, it groups the multiplications in the grade-school method in such a way as to trade multiplication for addition. More concretely, the easiest (but perhaps most mysterious) way to approach the Karatsuba algorithm is via the following observation: \[ \begin{align*} (w+2^Nx)(y+2^Nz)&=wy+2^N(xy+wz)+2^{2N}xz \\ &= wy+2^N((x+w)(y+z)-wy-xz)+2^{2N}xz \end{align*} \] We calculate only three products — \(wy\), \(xz\) and \((x+w)(y+z)\) — instead of the expected four pairwise products. This is great news, since addition is far cheaper than multiplication. To transform this observation into a workable algorithm, we apply a recursive divide-and-conquer method in the following way:

Given integers \(a\) and \(b\), if \(a\) and \(b\) are small enough according to a reasonable heuristic, calculate \(ab\) using the grade-school method and return.
Otherwise, split \(a\) and \(b\) into two parts each, writing \(a=w+2^Nx\) and \(b=y+2^Nz\). Here, \(N\) is a multiple of 64 so that this decomposition is literally splitting the word list in two.
Calculate \(wy\), \(xz\) and \((x+w)(y+z)\) recursively. Subtract \(wy\) and \(xz\) from \((x+w)(y+z)\), and combine the results using bit shifting and addition to produce \(ab\).

Polynomial Rings

The Karatsuba algorithm is also named Toom22 because it is a specific example of a Toom-Cook procedure, which more generally splits \(a\) and \(b\) into \(n\) parts. A better way to approach the Karatsuba algorithm that extends to the broader Toom-Cook class is via polynomial rings and the Chinese Remainder Theorem. The "road map," if you will, looks like this: \[ \mathbb{Z}\cong \mathbb{Z}[\beta]/(2^N-\beta)\to\mathbb{Z}[\beta]\to\mathbb{Z}[\beta]/(\beta^2-\beta)\cong(\mathbb{Z}[\beta]/(\beta))\times(\mathbb{Z}[\beta]/(\beta-1))\,. \] We start with the integers \(\mathbb{Z}\), reinterpret them in base \(2^N\) (\(\mathbb{Z}[\beta]/(2^N-\beta)\)), treat \(2^N\) as a formal variable (\(\mathbb{Z}[\beta]\)), consider this polynomial within a suitably large modular ring (\(\mathbb{Z}[\beta]/(\beta^2-\beta)\)), and finally map to an isomorphic ring with the Chinese Remainder Theorem (\((\mathbb{Z}[\beta]/(\beta))\times(\mathbb{Z}[\beta]/(\beta-1))\)).

Before we begin in earnest, let's consider multiplication in a ring as viewed through a homomorphism. It's straightforward to conceptualize the equivalence of multiplication in a ring isomorphic to another, since the isomorphism \(\varphi\) has a two-sided inverse and \[ ab=\varphi^{-1}(\varphi(a))\varphi^{-1}(\varphi(b))=\varphi^{-1}(\varphi(a)\varphi(b))\,. \] We start in some ring \(R\) containing \(a\) and \(b\), map \(a\) and \(b\) to another ring \(S\) via \(\varphi\), perform the multiplication in \(S\) as \(\varphi(a)\varphi(b)\), then map back to \(R\) with \(\varphi^{-1}\).

Consider, however, a non-bijective homomorphism \(\varphi:R\to S\). For \(a,b\in R\), we can say that \(\varphi(ab)=\varphi(a)\varphi(b)\). Insofar as \(\varphi^{-1}\) is a multi-valued function, we have \(ab\in \varphi^{-1}(\varphi(a)\varphi(b))\). The inverse image \(\varphi^{-1}(s)\) is a coset \(r+\text{ker}(\varphi)\subset R\); to identify \(ab\) we need to pick out the correct entry from the coset \(\varphi^{-1}(\varphi(a)\varphi(b))\).

On the other hand, suppose we have a surjective homomorphism \(\varphi:S\to R\). For \(a,b\in R\), we can pick any elements \(s_a\) and \(s_b\) in \(S\) with \(\varphi(s_a)=a\) and \(\varphi(s_b)=b\). Then \(ab=\varphi(s_a)\varphi(s_b)=\varphi(s_as_b)\). This is the easier case. Instead of needed to identify a correct entry from a pre-image, we can pick any entry from a pre-image and correctly generate \(ab\).

With that out of the way, let's return to our road map. \[ \mathbb{Z}\underset{\varphi_1}{\cong} \mathbb{Z}[\beta]/(2^N-\beta)\underset{\varphi_2}{\leftarrow}\mathbb{Z}[\beta]\underset{\varphi_3}{\to}\mathbb{Z}[\beta]/(\beta^2-\beta)\underset{\varphi_4}{\cong}(\mathbb{Z}[\beta]/(\beta))\times(\mathbb{Z}[\beta]/(\beta-1)) \] The first isomorphism, \(\varphi_1\), maps an integer \(a\) to the coset \(a+(2^N-\beta)\mathbb{Z}[\beta]\). It has two-sided inverse \(p(\beta)+(2^N-\beta)\mathbb{Z}[\beta]\mapsto p(2^N)\).

The first homomorphism is backwards: \(\varphi_2:\mathbb{Z}[\beta]\to \mathbb{Z}[\beta]/(2^N-\beta)\). To correctly identify a product \(ab\) in \(\mathbb{Z}[\beta]/(2^N-\beta)\), we need to retrieve any pre-image \(\varphi_2^{-1}(a)\) and \(\varphi_2^{-1}(b)\). However, not all choices will be equally effective for our algorithm. For our purposes, choose the polynomial \(p_a\in \mathbb{Z}[\beta]\) in the coset \(a\) that contains only non-negative integer coefficients less than \(2^N\). This is equivalent to expressing an integer representative \(z\) of \(a\) in base \(2^N\), since \[ \begin{align*} z+(2^N-\beta)\mathbb{Z}[\beta]&=\sum_{i=0}^p z_i2^{iN}+(2^N-\beta)\mathbb{Z}[\beta] \\ &= \sum_{i=0}^p z_i\beta^i+(2^N-\beta)\mathbb{Z}[\beta] \end{align*}\,. \]

Our next homomorphism, \(\varphi_3\), is the canonical one sending \(p(\beta)\mapsto p(\beta)+(\beta^2-\beta)\mathbb{Z}[\beta]\). For polynomials \(p\) and \(q\) in \(\mathbb{Z}[\beta]\), we can only determine \(pq\) if we correctly identify the proper entry in \(\varphi_3^{-1}(\varphi(p)\varphi(q))\). This is where our choice of \(N\) becomes important. Suppose \(a\) and \(b\) are integers. Our map \(\varphi_2^{-1}\circ \varphi_1\) so far consists of writing \(a\) and \(b\) in base \(2^N\), then replacing every instance of \(2^N\) with \(\beta\). By choosing \(N\) large enough so that \(a\) and \(b\) have no \(2^{N+1}\) term or larger, we can ensure that \(p\) and \(q\) have no term larger than \(\beta\). In particular, \(a=w+x2^N\) maps to \(p(\beta)=w+x\beta\), and \(b=y+z2^N\) maps to \(q(\beta)=y+z\beta\). Then \(pq\) is, at most, a quadratic. To completely identify \(pq\) from a coset \(s(\beta)+(\beta^2-\beta)\mathbb{Z}[\beta]\), we first convert \(s(\beta)\) to a lowest-degree representative \(c+d\beta\), and note by degree considerations that \[ p(\beta)q(\beta)=c+d\beta + \gamma(\beta^2 - \beta)=\gamma \beta^2 + (d-\gamma)\beta + c\,. \] The \(\gamma\) term must be \(xz\), since the quadratic term of \(pq\) comes (solely) from \(x\beta\cdot z\beta=xz\beta^2\).

Finally, we arrive at the meat-and-potatoes homomorphism, the isomorphism \(\varphi_4\) provided by the Chinese Remainder Theorem because \(\beta\) and \(\beta-1\) are coprime in \(\mathbb{Z}[\beta]\). For a representative \(c+d\beta\) in \(\mathbb{Z}[\beta]/(\beta^2-\beta)\), \(\varphi_4\) acts via \[ \begin{align*} \varphi_4(c+d\beta+(\beta^2-\beta)\mathbb{Z}[\beta])&=(c+\beta\mathbb{Z}[\beta],c+d+(\beta-1)\mathbb{Z}[\beta]) \\ \varphi_4^{-1}(q(\beta)+\beta\mathbb{Z}[\beta], w(\beta)+(\beta-1)\mathbb{Z}[\beta])&=q(0)+(w(0)-q(1))\beta + (\beta^2-\beta)\mathbb{Z}[\beta] \end{align*} \] Eliding the cluttered coset notation, this is \(\varphi_4(c+d\beta)=(c, c+d)\) and \(\varphi^{-1}(q,w)=q+(w-q)\beta\).

Now let's put it all together. Two integers \(a\) and \(b\) are written in base \(2^N\) notation as \(a=w+x2^N\) and \(b=y+z2^N\). These are converted into polynomials \(p(\beta)=w+x\beta\) and \(q(\beta)=y+z\beta\), reduced modulo \(\beta^2-\beta\) (so, they don't change), then sent to the pairs \((w,w+x)\) and \((y,y+z)\) over \(\varphi_4\). The multiplication happens here, with \[ (w,w+x)\times (y,y+z)=(wy, (w+x)(y+z))\,. \] This is sent back over \(\varphi_4^{-1}\), resulting in the polynomial representative \(wy+((w+x)(y+z)-wy)\beta\). To find the correct representative in \(\varphi_3^{-1}(wy+((w+x)(y+z)-wy)\beta)\), we form \(\gamma \beta^2 + (d-\gamma)\beta + c\) with \(\gamma=xz\), \(c=wy\), and \(d=(w+x)(y+z)-wy\). This results in the polynomial \(xz \beta^2 + ((w+x)(y+z)-wy-xz)\beta + wy\). Mapping through \(\varphi_2^{-1}\) is the evaluation map \(\beta \mapsto 2^N\), resulting in \[ ab=xz 2^{2N} + ((w+x)(y+z)-wy-xz)2^N + wy\,. \]

This seems like a lot of work for such a simple expression! You might change your mind, however, after attempting to rearrange \((a+b2^N+c2^{2N})(d+e2^N+f2^{2N})\) in a way that uses only five \(N\)-bit multiplications. Indeed, larger Toom-\(n\) variants would be extremely difficult to generate by hand.

Recursive Outline

It is not efficient to allocate memory dynamically during the recursive Karatsuba algorithm. Instead, we pass in sufficient extra "scratch" memory to be shared by all recursive calls. This presents an interesting challenge: how much memory will we need? To approach the question, we'll need to understand the particular recursive structure of the algorithm.

Here is the function signature:

template<int n, bool ext>
void exact::mult_karatsuba_internal(
        const exact::integer<n, ext>& a, int32_t a_slice_min, const int32_t a_slice_max_excl, 
        const exact::integer<n, ext>& b, int32_t b_slice_min, const int32_t b_slice_max_excl,
        exact::integer<n, ext>& store, const int32_t store_slice_min, const int32_t store_slice_max_excl,
        exact::integer<n, ext>& aux_store, const int32_t aux_slice_min, const int32_t aux_slice_max_excl
);

Integers a and b have word lists a.word(0), a.word(1), ..., a.word(a.word_count - 1) and b.word(0), b.word(1), ..., b.word(b.word_count - 1) from most significant to least significant. The function definition must compute the product of slices of a and b, namely the non-negative integers with word lists, respectively,

[a.word(a_slice_min), a.word(a_slice_min + 1), ..., a.word(a_slice_max_excl - 1)]
[b.word(b_slice_min), b.word(b_slice_min + 1), ..., b.word(b_slice_max_excl - 1)]

The result must be placed in store, with the least significant result word at the (store_slice_max_excl - 1)th word of store. Finally, aux_store is our designated scratch space, with the words aux_slice_min, aux_slice_min + 1, ..., aux_slice_max_excl - 1 available for use by this particular recursive call. For example, the following code computes the product of the integers a and b:

exact::mult_karatsuba_internal(
    a, 0, a.word_count,
    b, 0, b.word_count,
    store, 0, store.word_count,
    aux_store, 0, aux_store.word_count
);

There are four cases in the body of mult_karatsuba_internal. The first occurs when a_width = a_slice_max_excl - a_slice_min or b_width = b_slice_max_excl - b_slice_min is less than some fixed threshold, that is, the slice of a or b is small enough. This functions as the base case of our algorithm, with the threshold being the maximum size for which the naive, grade-school multiplication algorithm is more efficient. In this case, a separate function exact::mult_school writes directly to store and requires no auxiliary storage.

The remaining cases depend on the value of N, calculated as

const int32_t a_width = a_slice_max_excl - a_slice_min;
const int32_t b_width = b_slice_max_excl - b_slice_min;

const int32_t N = std::max(a_width + 1, b_width + 1) / 2;

This functions as \(N\) from the previous section, though scaled by \(64\). The general process is to split a and b each into two parts \(a=w+x2^{64N}\) and \(b=y+z2^{64N}\). Here \(N\) is supposed to be about half the number of words of \(a\) and \(b\), so that \(w\) and \(x\) are roughly the same number of words, as is \(y\) and \(z\). But because \(a\) and \(b\) might have wildly different orders of magnitude, it may be the case that N is larger than or equal to a_width, corresponding to the case where \(y=0\). There's no point in applying the Karatsuba recursion here, since \(ab=wy+2^{64N}wz\) requires only two N-word multiplications. Instead, we first compute \(wz\) recursively in the higher words of the store slice, transfer some words of the result to aux_store, compute \(wy\) in the lower words of store, then add the cached words in aux_store back at the appropriate place. See the below diagram.