The difference between kernel ridge regression and features transform + Ridge

I assume you have known how the kernel ridge regression and features transform + Ridge work. They are somewhat the same. I will list some mirror differences between them.

  1. You can only use features transform + Ridge when the number of features is finity.

  2. You can switch off the bias feature in transform. Ridge can include the bias term, at the same time, the penalty term doesn’t include the bias. However, for kernel ridge with the polynomial kernel, the penalty term always includes the bias term.

  3. You can scale the features generated by transform before you use Ridge. it’s equal to customize the punishment strength for each feature. So the Ridge is more flexible. However, you have normally less parameter to tune in KRR.

  4. The fit and prediction time are different. You need to solve the linear system K_NxN x=y in the kernel ridge. You need only to solve the linear system Cov_Nx(D+1) x=y, where N is the number of samples in training, and D the degree of the polynomial.

  5. Kernel will be (almost) singular if two samples are (near) identical. Furthermore, when alpha is very tiny (for example, you interpolate the data), you will meet the numerical stability problem. Since the K + alpha I is almost singular. You can only overcome this problem by using the Ridge. Although the convaraince matrix is also singular, the singularity will not cause any problem. This is explained in many machine learning textbook.