Why Gradient Descent Works Differently in Linear vs Logistic Regression

By Md. Babul Hasan (NoYoN) 25 Jul, 2025 Post a Comment

Why Gradient Descent Works Differently in Linear vs Logistic Regression

When you're learning machine learning, one of the first things you encounter is gradient descent — the powerful technique used to train models like linear regression and logistic regression. You might think that since both use gradient descent, they must work in pretty much the same way.

But here's the catch: although the name is the same, how gradient descent behaves in these two models is quite different. The reason lies deep in the type of problem each model tries to solve, and in the shape of the cost function that gradient descent is trying to minimize.

Different Goals, Different Models

At a high level, linear regression and logistic regression are both used for prediction — but they predict different things.

Linear regression is used when the output is a real, continuous number — like predicting house prices.
Logistic regression, on the other hand, is used when the output is a category — such as predicting whether an email is spam or not. In this case, we want a probability between 0 and 1.

The models reflect this difference in their mathematical form.

For linear regression, the hypothesis is simple:

h_\theta(x) = \theta_0 + \theta_1 x

It just draws a straight line.

But logistic regression adds a twist:

h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}

Here, a sigmoid function is used to squash the output between 0 and 1, turning it into a probability.

The Cost Function – Where Things Start to Differ

Now comes the heart of the difference: the cost function.

In linear regression, we use Mean Squared Error (MSE):

J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2

This makes sense because we want our predicted values to be close to the real numbers we're trying to approximate. And it works beautifully. The cost function is smooth and convex — meaning there's just one bowl-shaped valley for gradient descent to roll down into the minimum. But what happens if we try to use this same cost function in logistic regression?

Why You Can’t Use MSE in Logistic Regression

It turns out that MSE simply doesn't work well with the sigmoid function used in logistic regression. The problem is twofold: optimization instability and weak learning signals near the decision boundary.

First, when we plug the sigmoid function into the MSE formula, the resulting cost function becomes non-convex. That means it no longer has a nice, smooth shape. Instead, it can have multiple bumps and dips, and gradient descent can get stuck or confused trying to find the minimum. This makes training unpredictable and inefficient. But even more importantly, MSE fails to give meaningful feedback when the model is making bad classification decisions.

Incorrect Gradient Behavior Near Class Boundaries

Let’s imagine a case where the true label is 1, and your logistic model predicts 0.01 — it’s confidently wrong. That’s a serious error in a classification problem. But if you compute the MSE here, the cost is just:

(0.01 - 1)^2 = 0.9801

Now suppose the model predicts 0.49 instead of 1 — still wrong, but much closer to the truth:

(0.49 - 1)^2 = 0.2601

Although the model was wildly off in the first case and almost right in the second, the cost difference isn’t dramatic. Even worse, the gradient — the amount by which the model adjusts itself during learning — doesn’t change much between these cases. The model isn’t "punished" strongly enough for being confidently wrong, and as a result, it doesn’t learn quickly or accurately.This is where cross-entropy loss (also called log loss) comes in.

Why Logistic Regression Uses Cross-Entropy

The cross-entropy cost function is tailored for binary classification problems. It’s defined as:

J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]

When your model makes a confident and correct prediction, this cost becomes very small — near zero. But when it makes a confident and wrong prediction (like predicting 0.01 when the label is 1), the cost shoots up sharply.

For example:

Predicting 0.99 when true label = 1:
$-\log(0.99) \approx 0.010$
Predicting 0.01 when true label = 1:
$-\log(0.01) \approx 4.605$

The difference is huge. The model is rewarded for confidence when it’s right — and severely penalized when it’s confident but wrong. This makes training sharper and more focused, especially near the decision threshold (around 0.5), where small changes can make a big difference.

Same Tool, Different Terrain

Even though both linear and logistic regression use gradient descent to train, they operate on completely different "terrains." Think of gradient descent like a ball rolling down a hill.

In linear regression, the hill is smooth and bowl-shaped — easy for the ball to find its way to the bottom.
In logistic regression, using MSE makes the hill bumpy and unpredictable. But using cross-entropy smooths it out again, guiding the ball reliably toward the best result.