Kernels
The basic idea of Kernels comes from the problem of separating randomly distributed data, as shown below:
The separating hyperplane is not exactly a line in this 2D space. So, how can we learn this separation? One trick is to Transform this data by mapping each data point into a new space for computation, learn the separator in that space, and then transform this data back to our original space. This is a similar technique to what we do in the time-frequency transformations of signals using Fourier transforms, to better understand some characteristics in time and frequency domains. The naive way to do this transformation is through a kernel is to the variable x from our initial space to a variable \(\phi(x)\) in the new space. Thus, if this problem is transformed into a linear classification problem in this new space, then we are essentially applying a linear classifier to a non-linear classification problem. The only issue with this is the issue of computation that comes with higher dimensional problems! To alleviate this, we use Kernels
Kernel Trick
Let’s say we have the base features \(x_i\) and we apply a transformation on them \(\phi(x_i)\) which transforms each point according to some set rule. Now, if we were to apply any technique, say linear regression, on the original points, our problem would be to calculate the weights:
\[w = (X^TX)^{-1} X^TY\]And so, with the transformations, it becomes:
\[w = (\phi^T\phi)^{-1} \phi^TY\]If we add regularization to it, we just add the ridge regression term to it:
\[w = (\phi^T\phi + \lambda I)^{-1} \phi^TY\]The problem is, calculating \(\phi(x)\) is hard! And when we look at high volume data and the central role that these transformations might play in the case of SVMs, calculating this transformation for every point just adds complexity. Let’s try to simplify this a bit, by looking at our ridge-regressor:
\[\begin{aligned} &J(w) = \sum_{i=1}^N(y_n - w^T\phi(x_n))^2 + \frac{\lambda}{2} \sum_{i=1}^N ||w||^2 \\ \implies &\mathbf{w}^* = \frac{1}{\lambda} \sum_{i=1}^N (y_n - \mathbf{w}^T\phi(\mathbf{x}_n))\phi(\mathbf{x}_n) \end{aligned}\]Let first term be:
\(\alpha_n = \frac{1}{\lambda} \sum_{i=1}^N (y_n - \mathbf{w}^T\phi(\mathbf{x}_n))\)
Thus, we can re-write the weights as:
\[\mathbf{w}^* = \mathbf{\phi}^T\mathbf{\alpha}\]Thus, if we substitute this value in the expression for \(J(\mathbf{w})\) , we get a dual form that depends on the \(\mathbf{\alpha}\) and \(\mathbf{\phi}^T \mathbf{\phi}\) , very similar to what we get in the dual form of SVMs. This dot product transformation can be written as a Kernel Matrix:
\[\mathbf{\phi} \mathbf{\phi}^T = K = [\phi(x_i)\phi(x_j)] \,\,\,\,\,\, \forall \,\,\,\,\,\, i,j = 1, ..., N\]This matrix has 2 properties:
- It is Symmetric → \(K = K^T\)
- It is Positive Semi-Definite → When we multiply it by some other matrix and its transpose, we get a result greater tha or equal to 0 i.e \(\alpha^T K \alpha \geq 0\)
When we put this in our simplification term, we get the following result:
\[\mathbf{\alpha}^* = (\mathbf{K} + \lambda ' \mathbf{I})^{-1} Y\]The difference between the original problem to this problem is simple → Before we had to compute \(\mathbf{\phi}^T \mathbf{\phi}\) but now we have to compute \(K = \mathbf{\phi} \mathbf{\phi}^T\) which seems similar but there is a catch:
According to Mercer’s Theorem, any Symmetric and PSD Matrix can be expressed as an inner product
\[K(x,y) = \langle \phi(x), \phi(y) \rangle\]In other words, we can write K as an inner product of the original features:
\[\mathbf{\phi} \mathbf{\phi}^T = K = [\phi(x_i)\phi(x_j)] = [K(x_i,x_j)]\]This is the Kernel Trick! To better elaborate, let’s take a transformation:
\[\phi: x \rightarrow \phi(x) = [x_1^2 \,\,\,\, \sqrt(2)x_1x_2 \,\,\,\, x_2^2]^T\]A dot product of this transformation can be re-written as
\[\begin{aligned} &\phi(x_m)^T\phi(x_n) = x_{m1}^2x_{n1}^2 + 2x_{m1}x_{n1}x_{m2}x_{n2} + x_{m1}^2x_{n1}^2 \\ \implies &\phi(x_m)^T\phi(x_n) = (x_{m1}x_{n1} + x_{m2}x_{m2})^2 = (\mathbf{x^T_m}\mathbf{x_n})^2 \end{aligned}\]Thus, all we need is this final form of the kernel to work since the whole optimization relies on this dot-product and we don’t really need to transform every feature through the original expression! For predictions, originally we needed to calculate the weights since :
\[y = \mathbf{w}^T \phi(x)\]But, as we have shown these weights depend on the dot-product which can be expressed as a Kernel
\[\mathbf{w}^T \phi(x) = y(\mathbf{K} + \lambda ' \mathbf{I})^{-1} k_x\]And to compute this Kernel, we don’t need to know the true nature of \(\phi(x)\) ! This is what allows the SVMs to work in higher dimensions.