Matrix Multiplication Optimization

Motivation

This study is about optimization especially focused on the matrix multiplication.
Matrix mutliplication is a topic that has been studied for decades.
This study was done from paper reviews about matrix mutliplication.

Naive matrix multiplication

Naive matrix multiplication is an operation on matrices.
Matrix is a set of numbers like follows.
$$\begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3}\\ a_{2,1} & a_{2,2} & a_{2,3} \end{pmatrix} \times \begin{pmatrix} b_{1,1} & b_{1,2}\\ b_{2,1} & b_{2,2}\\ b_{3,1} & b_{3,2}\\ \end{pmatrix} \\= \begin{pmatrix} a_{1,1} \cdot b_{1,1} + a_{1,2} \cdot b_{2,1} + a_{1,3} \cdot b_{3,1} & a_{1,1} \cdot b_{1,2} + a_{1,2} \cdot b_{2,2} + a_{1,3} \cdot b_{3,2}\\ a_{2,1} \cdot b_{1,1} + a_{2,2} \cdot b_{2,1} + a_{2,3} \cdot b_{3,1} & a_{2,1} \cdot b_{1,2} + a_{2,2} \cdot b_{2,2} + a_{2,3} \cdot b_{3,2}\\ \end{pmatrix}$$ In other words, $$c_{i,j} = \sum\limits_{k=1}^{K}a_{i,k}b_{k,j}$$

For $(i,j) \leftarrow (1,1), \cdots, (n,m)$ in parellel

For $k \leftarrow 1, \cdots, K$

$c_{i,j} = c_{i,j} + a_{i,k}b_{k,j}$

Algorithm 1. Naive matrix multiplication

This algorithm take $O(n^3)$.out matrix mutliplication.
Here are implementations for it.

Dense matrix

Cache optimization

The basic optimization will be use cache to reduce memory usage.

For $(i,j) \leftarrow (1,1), \cdots, (n,m)$ in parellel

$\text{local_cache} \leftarrow 0$
For $k \leftarrow 1, \cdots, K$

$\text{local_cache} = \text{local_cache} + a_{i,k}b_{k,j}$

$c_{i,j} = \text{local_cache}$

Algorithm 2. Naive matrix multiplication

This algorithm take $O(n^3)$.out matrix mutliplication.
Here is a GPU implementation for it.

Blocked matrix multiplication multiplication

Matrix can be subdivided into blocks.
In other words, $$A = \begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} & a_{1,4}\\ a_{2,1} & a_{2,2} & a_{2,3} & a_{2,4}\\ a_{3,1} & a_{3,2} & a_{3,3} & a_{3,4}\\ a_{4,1} & a_{4,2} & a_{4,3} & a_{4,4} \end{pmatrix} = \begin{pmatrix} A_{1,1} & A_{1,2}\\ A_{2,1} & A_{2,2} \end{pmatrix} $$
For $$A_{1,1} = \begin{pmatrix} a_{1,1} & a_{1,2}\\ a_{2,1} & a_{2,2} \end{pmatrix} \hspace{1cm} A_{1,2} = \begin{pmatrix} a_{1,3} & a_{1,4}\\ a_{2,3} & a_{2,4} \end{pmatrix} $$
$$A_{2,1} = \begin{pmatrix} a_{3,1} & a_{3,2}\\ a_{4,1} & a_{4,2} \end{pmatrix} \hspace{1cm} A_{2,2} = \begin{pmatrix} a_{3,3} & a_{3,4}\\ a_{4,3} & a_{4,4} \end{pmatrix} $$ Matrix multiplication can be computed by the blocked way either.
Notice that it assumed that blockSize divides well for matrix.
However, it can be easily generalized.
$$C_{i,j} = \sum\limits_{k=1}^{K/BlockSize}A_{i,k}B_{k,j}$$

For $(i,j) \leftarrow (1,1), \cdots, (n/BlockSize, m/BlockSize)$ in parellel

$x_{base} \leftarrow (i - 1) \times BlockSize$
$y_{base} \leftarrow (j - 1) \times BlockSize$
For $(x,y) \leftarrow (1,1), \cdots, (BlockSize, BlockSize)$ in parellel

For $k \leftarrow 1, \cdots, K$

$C_{x + x_{base},y + y_{base}} += a_{x + x_{base},k} \times b_{k,y + y_{base}}$

Algorithm 3. Blocked matrix multiplication

Here is a GPU implementation for it.

Shared memory

There is an optimization that can be applied when algorithm divided to blocks and there is a shared memory either. It assumed that there is a shared memory size of $BlockSize \times BlockSize$.

For $(i,j) \leftarrow (1,1), \cdots, (n/BlockSize, m/BlockSize)$ in parellel

$\text{local_cache} \leftarrow 0$
$x_{base} \leftarrow (i - 1) \times BlockSize$
$y_{base} \leftarrow (j - 1) \times BlockSize$
$A \leftarrow \text{matrix that size of }BlockSize \times BlockSize$
$B \leftarrow \text{matrix that size of }BlockSize \times BlockSize$
For $k \leftarrow 1, \cdots, K/BlockSize$

$k_{base} \leftarrow (k - 1) \times BlockSize$
For $(x,y) \leftarrow (1,1), \cdots, (BlockSize, BlockSize)$ in parellel

$A_{x,y} = a_{x + x_{base},y + k_{base}}$
$B_{x,y} = b_{x + k_{base},y + y_{base}}$

For $(x,y) \leftarrow (1,1), \cdots, (BlockSize, BlockSize)$ in parellel

For $k \leftarrow 1, \cdots, BlockSize$

$\text{local_cache} = \text{local_cache} + A_{x, k} \times B_{k, y}$

For $(x,y) \leftarrow (1,1), \cdots, (BlockSize, BlockSize)$ in parellel

$C_{x + x_{base},y + y_{base}} = \text{local_cache}$

Algorithm 4. Naive matrix multiplication

This algorithm take $O(n^3)$.
There are some GPU implementations here.
Notice that there are possible 4 combinations for shared memory matrices by transposing matrix.

Strassen's algorithm

Strassen's algorithm is an algorithm that can compute matrix multiplication fast.
In general, strassen's algorithm takes $O(n^{\log_2 7})$.
Notice that strassen's algorithm works with even number of rows and colums.
However, it can be fixed by padding numbers at the end of row and column.

$ \begin{array}{lcl} M_1 & = & (A_{1,1} + A_{2,2})(B_{1,1} + B_{2,2}) \\ M_2 & = & (A_{2,1} + A_{2,2})B_{1,1} \\ M_3 & = & A_{1,1}(B_{1,2} - B_{2,2}) \\ M_4 & = & A_{2,2}(B_{2,1} - B_{1,1}) \\ M_5 & = & (A_{1,1} + A_{1,2})B_{2,2} \\ M_6 & = & (A_{2,1} - A_{1,1})(B_{1,1} + B_{1,2}) \\ M_7 & = & (A_{1,2} - A_{2,2})(B_{2,1} + B_{2,2}) \\ \end{array} $ $$ \begin{pmatrix} C_{1,1} & C_{1,2}\\ C_{2,1} & C_{2,2} \end{pmatrix} = \begin{pmatrix} M_1 + M_4 - M_5 + M_7 & M_3 + M_5\\ M_2 + M_4 & M_1 - M_2 + M_3 + M_6 \end{pmatrix} $$

This algorithm uses 7 mutiplications and 18 additions.
From the master theorem, it takes $O(n^{\log_2 7})$.
Notice that naive matrix multiplication have 8 multiplications and 4 additions in the blocked way.
To implement this on GPU, there are some problems need to be solved.
First problem is a memory allocation problem.
Since it need some extra values, it needs more memory spaces.
Noticet that these memory spaces are needed because it will be called recursively.
In general, this algorithm requires $\frac{21}{4}n^2$ to compute $n$ by $n$ matrix at depth 1.
If algorithm wants to go deeper, it needs more memory spaces like $\frac{21}{4}n^2 = 5.25n^2, (\frac{21}{4} + \frac{147}{16})n^2 = 14.4375n^2$.
Which may too big to be used at some point.
Therefore, this algorithm decided to compute $M_1$ to $M_7$ one by one to reduce memory usage.
It will result in smaller extra memory space like $\frac{3}{4}n^2 = 0.75n^2, (\frac{3}{4} + \frac{15}{16})n^2 = 0.9375n^2$ to compute $n$ by $n$ matrix.
Notice that it will less than $n^2$ with any depth.
At the same time, to remove memory allocation overhead, it precomputes memory usage and allocate it first then computes with pre-allocated memory spaces.
Here is a GPU implementation for it.
Now, here is the point to think about second problem which is a computation order.
Thinking about the memory usage, if it computes $M2$, there is no need to load $B_{1,1}$ on actual memory to compute $M2$.
Therefore, it can be ignored.
Like this, if both algorithm uses the same values or simillar values, it can reduce memory usage.
Each $M_i$ can be thought as $S_i \times T_i$ then edit distance can be constructed as follow.

$\text{Distance}(\text{State},\text{Target}) = \sum\limits_{i=1}^{2}\sum\limits_{j=1}^{2} (\unicode{x1D7D9}(\text{Target} \in A_{i,j}, \text{State} \not\in A_{i,j}) + \unicode{x1D7D9}(\text{State} \in A_{i,j}, \text{Target} \not\in A_{i,j}))$
$\text{EditDistance(S)}_{i,j} = \operatorname{min}(\text{Distance}(S_i, S_j), 1 + \text{Distance}(-S_i, S_j), 1 + \text{Distance}(\emptyset, S_j))$
$\text{EditDistance(T)}_{i,j} = \operatorname{min}(\text{Distance}(T_i, T_j), 1 + \text{Distance}(-T_i, T_j), 1 + \text{Distance}(\emptyset, T_j))$

Notice that there are three cases to update values.

Just update values on need($\text{Distance}(S_i, S_j)$)
Flip values and update($1 + \text{Distance}(-S_i, S_j)$)
Empty values and update($1 + \text{Distance}(\emptyset, S_j)$)

Then, it changed to traversal problem.
This problem is an ordering problem that order $M_1$ to $M_7$ to minimize sum of edit distance.
To select distance, there are three ways.

Update matrix and input to new one and add a cost of $(EditDistance(S)(k, k+1) + EditDistance(T)(k, k+1))$.
In this case, $S_k$ and $T_k$ will be updated.
If $T_k$ has only one element from possible values.
Update matrix to new one and add a cost of $(EditDistance(S)(k, k+1))$.
In this case, $S_k$ will be updated.
If $S_k$ has only one element from possible values.
Update input to new one and add a cost of $(EditDistance(T)(k, k+1))$.
In this case, $T_k$ will be updated.

In this case, it results in optimized implementation at here.
Notice that there were 18 setMat and 18 addMat operations but now it has 14 setMat and 18 addMat operations.
Interesting thing is it never flip or update values but it just empty out and fill it from the ground.

Winograd's algorithm

$ \begin{array}{lcl} S_1 & = & A_{2,1} + A_{2,2} & \hspace{1cm} & S_2 & = & S_1 - A_{1,1} \\ S_3 & = & A_{1,1} - A_{2,1} & \hspace{1cm} & S_4 & = & A_{1,2} - S_2 \\ T_1 & = & B_{1,2} - B_{1,1} & \hspace{1cm} & T_2 & = & B_{2,2} - T_1 \\ T_3 & = & B_{2,2} - B_{1,2} & \hspace{1cm} & T_4 & = & T_2 - B_{2,1} \\ M_1 & = & A_{1,1}B_{1,1} & \hspace{1cm} & M_2 & = & A_{1,2}B_{2,1} \\ M_3 & = & S_4B_{2,2} & \hspace{1cm} & M_4 & = & A_{2,2}T_4 \\ M_5 & = & S_1T_1 & \hspace{1cm} & M_6 & = & S_2T_2 \\ M_7 & = & S_3T_3 & \hspace{1cm} & U_1 & = & M_1 + M_2 \\ U_2 & = & M_1 + M_6 & \hspace{1cm} & U_3 & = & U_2 + M_7 & \hspace{1cm} \\ U_4 & = & U_2 + M_5 & \hspace{1cm} & U_5 & = & U_4 + M_3 & \hspace{1cm} \\ U_6 & = & U_3 - M_4 & \hspace{1cm} & U_7 & = & U_3 + M_5 \\ \end{array} $ $$ \begin{pmatrix} C_{1,1} & C_{1,2}\\ C_{2,1} & C_{2,2} \end{pmatrix} = \begin{pmatrix} U_1 & U_5 \\ U_6 & U_7 \end{pmatrix} $$

This algorithm uses 7 multiplications and 15 additions.
Which does less computations.
To implement this, there is another problem unlike Strassen's algorithm.
To compute the final value, it needs a strict order.
For example, $S4$ needs $S2$ and $S2$ needs $S1$.
Which result in $S1 \rightarrow S2 \rightarrow S4$ order.
Therefore, there are few ways to use auxiliary memory spaces.
In this project, I used one of them.
As a result, it uses the same amount of memory with strassen's algorithm but it uses much less operations.
There wer 14 setMat and 18 addMat operations but now it has 8 setMat and 15 addMat operations.
Here is the implementation of it.