Gradients
Gradient Descent
- predicted value
- metric →
let
- residual
- sos =
. - Thus, our optimization target becomes :
How gradient Descent works is by taking steps towards the optimal target. This is different from least-squares since in the latter we numerically compute the optimal solution by differentiating the target w.r.t C and setting it to 0 to find the inflection, which will be the minimal point. Gradient descent, on the other hand, works by first selecting a random value of intercept, say
- The learning rate
determines the size of the steps we take and tuning this is important, since if it is too small, our time to convergence is slower, while if it is too large, we overshoot the solution → Classic control phenomenon! - One solution is to start with a large learning rate and make it smaller with each step! → Schedule the Learning Rate
Stochastic Gradient Descent
The computations in the Gradient Descent step scale up pretty fast and thus, convergence becomes an issue. SGD resolves this by sampling points for the intercept residual calculation! → Instead of using all points, we can randomly sample n points - Mini-batch - and use them → This is especially helpful when points are clustered in different clusters, since the points in one cluster wil more-or-less have similar residuals!
- Again the sensitivity to
comes into picture and again we can adapt scheduling to overcome this!