Regularization is designed to combat overfitting, but not aid in gradient descent convergence.
If you are minimizing a function J parameterized by vector θ and where each element of θ is identified by θj, (i.e. minimize J(θ)).
Then the basic idea in batch gradient descent is to iterate until convergence by computing a new value of θ from the previous one in the following way. Updated each θj simultaneously with the formula.
θj:=θj−α∂∂θjJ(θ)
That α term is called the learning rate. It's arbitrary and if it's very small then the algorithm will converge slowly which will make the algorithm take a long time, but if it's too large, then what can happen is exactly what you are experiencing. In this case θ will be updated in the right direction, but will go too far and jump past the minimum or it can even climb out and increase.
The remedy is to simply decrease α until it doesn't happen. A sufficiently small learning rate guarantees that J(θ) will decrease on every iteration. The trick is to determine what value of α is a good one that allows fast convergence but avoids non-convergence.
A useful approach is to plot J(θ) while the algorithm is running to observe how it decreases. Start with a small value (e.g. 0.01) increase it if it appears to result in slow convergence or decrease it still further in the case of non-convergence.
manpreet
Best Answer
2 years ago
I am not really sure about how it behaves when using batch gradient descent in logistic regression.
As we do each iteration, L(W)L(W) is getting bigger and bigger, it will jump across the largest point and L(W)L(W) is going down. How do I know it without computing L(W)L(W) but only knowing old ww vector and updated ww vector?
If I use regularized logistic regression, will the weights become smaller and smaller or any other patterns?