GPU utilization

Motivation

Most of time, GPU sufferes from a lack of memories.
There are many techniques that tries to optimize memory.
However, utilizing GPU memory is a hard problem.
This study was done for studying about algorithmic approach to determine a good memory ordering for a gpu utilization.

Problem definition

Let's assume there are M data.
Then, M data will be consisted by the some combination of address spaces.
For given data $D = \{D_1, D_2, \cdots, D_m\}$, each data will have a transfer cost.
Let's call each transfer costs as $C = \{c_1, c_2, \cdots, c_m\}$.
These data will be used on the kernel calls $K = \{K_1, K_2, \cdots, K_n\}$.
Now each kernel call will require some set of data in D.
Let's define these required data as $R = \{R_1, R_2, \cdots, R_n\}$
Notice that $R_i \subseteq 2^D$ and $c_j > 0$ $\forall 1 \le i \le n ,1 \le j \le m$.
Now, the problem can be a minimization problem as follows.
$$\text{Minimize } \sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m} \unicode{x1D7D9}(D_j \in R_i, D_j \not\in R_{i - 1})c_j$$ In here, it is assumed to be a round-kernel call. Which means, it will be $K_1$ $\rightarrow$ $K_2$ $\rightarrow$ $\cdots$ $K_n = K_0$ $\rightarrow$ $K_1$ and so-on.

Naive approach

Native way to solve this is trying every possible order and checking it.
In general, it is possible but it takes too much if there are too many kernel calls.
Notice that it takes $O(n!)$. Therefore, it is not realistic to use as it is.

Approximated approach

Formula above is a bit complex, but it can be transfered as follows.
$ \begin{array}{lcl} & \sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m} \unicode{x1D7D9}(D_j \in R_i, D_j \not\in R_{i - 1})c_j \\ = & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i, D_j \not\in R_{i - 1})c_j & + & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i, D_j \not\in R_{i - 1})c_j \\ = & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}(\unicode{x1D7D9}(D_j \in R_i \cup R_{i - 1}) - \unicode{x1D7D9}(D_j \in R_{i - 1}))c_j & + & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}(\unicode{x1D7D9}(D_j \in R_i)- \unicode{x1D7D9}(D_j \in R_i \cap R_{i-1}))c_j \\ = & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i \cup R_{i - 1})c_j - \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_{i - 1})c_j & + & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i)c_j - \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i \cap R_{i-1})c_j \\ = & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i \cup R_{i - 1})c_j - \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i \cap R_{i-1})c_j & + & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i)c_j - \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_{i - 1})c_j \\ = & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}(\unicode{x1D7D9}(D_j \in R_i \cup R_{i - 1}) - \unicode{x1D7D9}(D_j \in R_i \cap R_{i-1}))c_j \\ = & \frac{1}{2}\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}(\unicode{x1D7D9}(D_j \in R_i \cup R_{i - 1}) - \unicode{x1D7D9}(D_j \in R_i \cap R_{i-1}))c_j \\ \end{array} $ Notice that $\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_i)c_j $ $=$ $\sum\limits_{i = 1}^{n}\sum\limits_{j = 1}^{m}\unicode{x1D7D9}(D_j \in R_{i - 1})c_j$ because entire costs doesn't depend on the order.
As a result, problem is converted to be a symmetric unlike before.
Now, it can be converted to the metric.
Let's define $d_{i,j}$ $=$ $\sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in R_i \cup R_j) - \unicode{x1D7D9}(D_x \in R_i \cap R_j))c_x$
Now, it fullfills following property of metrics.

$d_{i,j} = 0$ $\Leftrightarrow$ $i = j$
$d_{i,j} = d_{j,i}$
$d_{i,j} \le d_{i,k} + d_{k,j}$

Proofs are as follows.

It need to be proven for both ways.
- ($\Leftarrow$)
  $d_{i,i}$ $=$ $\sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in R_i \cup R_i) - \unicode{x1D7D9}(D_x \in R_i \cap R_i))c_x$ $=$ $\sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in R_i) - \unicode{x1D7D9}(D_x \in R_i))c_x$ $=$ $0$
- ($\Rightarrow$)
  $0$ $=$ $d_{i,j}$ $=$ $\sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in R_i \cup R_j) - \unicode{x1D7D9}(D_x \in R_i \cap R_j))c_x$
  $\Leftrightarrow$ $\unicode{x1D7D9}(D_x \in R_i \cup R_j) - \unicode{x1D7D9}(D_x \in R_i \cap R_j) = 0$ $\Leftrightarrow$ $R_i = R_j$.
Notice that, if two kernels uses the same data, it can be combined in to one kernel.
$d_{i,j}$ $=$ $\sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in R_i \cup R_j) - \unicode{x1D7D9}(D_x \in R_i \cap R_j))c_x$ $=$ $\sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in R_j \cup R_i) - \unicode{x1D7D9}(D_x \in R_j \cap R_i))c_x$ $=$ $d_{j,i}$
$ \begin{array}{lcl} d_{i,j} & = & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in R_i \cup R_j) - \unicode{x1D7D9}(D_x \in R_i \cap R_j))c_x \\ & = & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in R_i - R_j) + \unicode{x1D7D9}(D_x \in R_j - R_i))c_x \\ & = & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in (R_i - R_j) \cap R_k) + \unicode{x1D7D9}(D_x \in (R_i - R_j) \cap (U - R_k)))c_x \\ & + & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in (R_j - R_i) \cap R_k) + \unicode{x1D7D9}(D_x \in (R_j - R_i) \cap (U - R_k)))c_x \\ & = & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in (R_i - R_j) \cap R_k) + \unicode{x1D7D9}(D_x \in (R_j - R_i) \cap (U - R_k)))c_x \\ & + & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in (R_j - R_i) \cap R_k) + \unicode{x1D7D9}(D_x \in (R_i - R_j) \cap (U - R_k)))c_x \\ & = & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in ((R_i \cap R_k) - (R_j \cap R_k))) + \unicode{x1D7D9}(D_x \in (R_j - R_i - R_k)))c_x \\ & + & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in ((R_j \cap R_k) - (R_i \cap R_k))) + \unicode{x1D7D9}(D_x \in (R_i - R_j - R_k)))c_x \\ & \le & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in (R_k - (R_j \cap R_k))) + \unicode{x1D7D9}(D_x \in (R_j - R_k)))c_x \\ & + & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in (R_k - (R_i \cap R_k))) + \unicode{x1D7D9}(D_x \in (R_i - R_k)))c_x \\ & = & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in (R_k - R_j)) + \unicode{x1D7D9}(D_x \in (R_j - R_k)))c_x \\ & + & \sum\limits_{x = 1}^{m}(\unicode{x1D7D9}(D_x \in (R_k - R_i)) + \unicode{x1D7D9}(D_x \in (R_i - R_k)))c_x \\ & = & d_{i,k} + d_{j,k} \\ \end{array} $

As a result, it can be solved with the metric salesman traveling problem.
There are two ways to solve this.

Minimum spanning tree. It will give 2 approximation algorithm for the metric salesman traveling problem.
Christofides algorithm. It will give 1.5 approximation algorithm for the metric salesman traveling problem.

Limitation

This approach works when kerenels can be reordered.
Which means algorithm need to be big enough to be seperated as multiple parts without dependencies.
Therefore, this may only works for only scientific algorithms.

Extension

In general, this study suggests that memory loading ordering problem for GPU can be transfered into TSP.
At the same time, TSP is one of the well-known problem between many comptuer science problems.
Therefore, it can proceed more by using research about TSP.