Least Squares: Where Convenience Meets Optimality

Beyond being computationally easy, Least Squares is statically optimal and has a deep connection with Maximum Likelihood The post Least Squares: Where Convenience Meets Optimality appeared first on Towards Data Science.

Mar 25, 2025 - 09:58

Least Squares: Where Convenience Meets Optimality

0. Introduction

Least Squares is used almost everywhere when it comes to numerical optimization and regression tasks in machine learning. It aims at minimizing the Mean Squared Error (MSE) of a given model.

Both L1 (sum of absolute values) and L2 (sum of squares) norms offer an intuitive way to sum signed errors while preventing them from cancelling each other out. Yet the L2 norm results in a much smoother Loss Function and avoids the kinks of the absolute values.

But why is such a simple loss function so popular? We’ll see that there are pretty solid arguments in favor of the Least Squares, beyond being easy to compute.

Computational Convenience: The square loss function is easy to differentiate and provide a closed-form solution when optimizing a Linear Regression.
Mean and Median: We’re all familiar with these two quantities, but amusingly not many people know that they naturally stem from L2 and L1 losses.
OLS is BLUE: Among all unbiased estimators, Ordinary Least-Squares (OLS) is the Best Linear Unbiased Estimator (BLUE), i.e. the one with lowest variance.
LS is MLE with normal errors: Using Least-Squares to fit any model, linear or not, is equivalent to Maximum Likelihood Estimation under normally distributed errors.

In conclusion, the Least Squares approach totally makes sense from a mathematical perspective. However, bear in mind that it might become unreliable if the theoretical assumptions are no longer fulfilled, e.g. when the data distribution contains outliers.

N.B. I know there’s already a great subreddit, “Why Do We Use Least Squares In Linear Regression?”, about this topic. However, I‘d like to focus in this article on presenting both intuitive understanding and rigorous proofs.

1. Computational Convenience

Optimization

Training a model means tweaking its parameters to optimize a given cost function. In some very fortunate cases, its differentiation allows to directly derive a closed-form solution for the optimal parameters, without having to go through an iterative optimization.

Precisely, the square function is convex, smooth, and easy to differentiate. In contrast, the absolute function is non-differentiable at 0, making the optimization process less straightforward.

Differentiability

When training a regression model with n input-output pairs (x,y) and a model f parametrized by θ, the Least-Squares loss function is:

As long as the model f is differentiable with respect to θ, we can easily derive the gradient of the loss function.

Linear Regression

Linear Regression estimates the optimal linear coefficients β given a dataset of n input-output pairs (x,y).

The equation below shows on the left the L1 loss and on the right the L2 loss to evaluate the fitness of β on the dataset.

We usually drop the index i and switch to a vectorized notation to better leverage linear algebra. This can be done by stacking the input vectors as rows to form the design matrix X. Similarly, the outputs are stacked into a vector Y.

Ordinary Least-Squares

The L1 formulation offers very little room for improvement. On the other side, the L2 formulation is differentiable and its gradient becomes zero only for a single optimal set of parameters β. This approach is known as Ordinary Least-Squares (OLS).

Zeroing the gradient yields the closed form solution of the OLS estimator, using the pseudo-inverse matrix. This means we can directly compute the optimal coefficients without the need for a numerical optimization process.

Remarks

Modern computers are really efficient and the performance drop between analytical and numerical solutions is usually not that significant. Thus, computational convenience is not the main reason why we actually use Least-Squares.

2. Mean and Median

Introduction

You’ve certainly already computed a mean or median, whether with Excel, NumPy, or by hand. They are key concepts in Statistics, and often provide valuable insights for income, grades, tests scores or age distributions.

We’re so familiar with these two quantities that we rarely question their origin. Yet, amusingly, they stem naturally from L2 and L1 losses.

Given a set of real values xi, we often try to aggregate them into a single good representative value, e.g. the mean or median. That way, we can more easily compare different sets of values. However, what represents “well” the data is purely subjective and depends on our expectations, i.e. the cost function. For instance, mean and median income are both relevant, but they convey different insights. The mean reflects overall wealth, while the median provides a clearer picture of typical earnings, unaffected by extremely low or high incomes.

Given a cost function ρ, mirroring our expectations, we solve the following optimization problem to find the “best” representative value µ.

Mean

Let’s consider ρ is the L2 loss.

Zeroing the gradient is straightforward and brings out the mean definition.

Thus, we’ve shown that the mean best represents the xi in terms of the L2 loss.

Median

Let’s consider the L1 loss. Being a sum of piecewise linear functions, it is itself piecewise linear, with discontinuities in its gradient at each xi.

The figure below illustrates the L1 loss for each xi . Without loss of generality, I’ve sorted the xi to order the non-differentiable kinks. Each function |µ-xi| is xi-µ below xi and µ-xi above.

L1 loss between µ and each xi — Figure by the author

The table below clarifies the piecewise expressions of each individual L1 term |µ-xi|. We can sum these expressions to get the total L1 loss. With the xi sorted, the leftmost part has a slope of -n and the rightmost a slope of +n.

For better readability, I’ve hidden the constant intercepts as Ci.

Piecewise definition table of each individual absolute function and their sum — Figure by the author

Intuitively, the minimum of this piecewise linear function occurs where the slope transitions from negative to positive, which is precisely where the median lies since the points are sorted.

Thus, we’ve shown that the median best represents the xi in terms of the L1 loss.

N.B. For an odd number of points, the median is the middle value and the unique minimizer of the L1 loss. For an even number of points, the median is the average of the two middle values, and the L1 loss forms a plateau, with any value between these two minimizing the loss.

3. OLS is BLUE

Gauss-Markov theorem

The Gauss-Markov theorem states that the Ordinary Least Squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE). “Best” means that OLS has the lowest variance among all linear unbiased estimators.

This sampling variance represents how much the estimate of the coefficients of β would vary across different samples drawn from the same population.

The theorem assumes Y follows a linear model with true linear coefficients β and random errors ε. That way, we can analyze how the β estimate of an estimator would vary for different values of noise ε.

The assumptions on the random errors ε ensure they are unbiased (zero mean), homoscedastic (constant finite variance), and uncorrelated (diagonal covariance matrix).

Linearity

Be aware that “linearity” in the Gauss-Markov theorem refers to two different concepts:

Model Linearity: The regression assumes a linear relationship between Y and X.

Estimator Linearity: We only consider estimators linear in Y, meaning they must include a linear component represented by a matrix C that depends only on X.

Unbiasedness of OLS

The OLS estimator, denoted with a hat, has already been derived earlier. Substituting the random error model for Y gives an expression that better captures the deviation from the true β.

We introduce the matrix A to represent the OLS-specific linear component C for better readability.

As expected, the OLS estimator is unbiased, as its expectation is centered around the true β for unbiased errors ε.

Theorem’s proof

Let’s consider a linear estimator, denoted by a tilde, with its linear component A+D, where D represents a shift from the OLS estimator.

The expected value of this linear estimator turns out to be the true β plus an additional term DXβ. For the estimator to be considered unbiased, this term must be zero, thus DX=0. This orthogonality ensures that the shift D does not introduce any bias.

Note that this also implies that DA'=0, which will be useful later.

Now that we’ve guaranteed the unbiasedness of our linear estimator, we can compare its variance against the OLS estimator.

Since the matrix C is constant and the errors ε are spherical, we obtain the following variance.

After substituting C with A+D, expanding the terms, and using the orthogonality of DA', we end up with the variance of our linear estimator being equal to a sum of two terms. The first term is the variance of the OLS estimator, and the second term is positive, due to the positive definiteness of DD’.

As a result, we have shown that the OLS estimator achieves the lowest variance among all linear estimators for Linear Regression with unbiased spherical errors.

Remarks

The OLS estimator is considered “best” in terms of minimum variance. However, it’s worth noting that the definition of the variance itself is closely tied to Least Squares, as it reflects the expectation of the squared difference from the expected value.

Thus, the key question would be why variance is typically defined this way.

4. LS is MLE with normal errors

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) is a method for estimating model parameters θ by maximizing the likelihood of observing the given data (x,y) under the model defined by θ.

Assuming the pairs (xi,yi) are independent and identically distributed (i.i.d.), we can express the likelihood as the product of the conditional probabilities.

A common trick consists in applying a logarithm on top of a product to transform it into a more convenient and numerically stable sum of logs. Since the logarithm is monotonically increasing, it’s still equivalent to solving the same optimization problem. That’s how we get the well known log-likelihood.

In numerical optimization, we usually add a minus sign to minimize quantities instead of maximizing them.

MLE Inference

Once the optimal model parameters θ have been estimated, inference is performed by finding the value of y that maximizes the conditional probability given the observed x, i.e. the most-likely y.

Model Parameters

Note that there’s no specific assumption on the model. It can be of any kind and its parameters are simply stacked into a flat vector θ.

For instance, θ can represent the weights of a neural network, the parameters of a random forest, the coefficients of a linear regression model, and so on.

Normal Errors

As for the errors around the true model, let’s assume that they are unbiased and normally distributed.

It’s equivalent to assuming that y follows a normal distribution with mean predicted by the model and fixed variance σ².

Note that the inference step is straightforward, because the peak of the normal distribution is reached at the mean, i.e. the value predicted by the model.

Interestingly, the exponential term in the normal density cancels out with the logarithm of the log-likelihood. It then turns out to be equivalent to a plain Least-Squares minimization problem!

As a result, using Least-Squares to fit any model, linear or not, is equivalent to Maximum Likelihood Estimation under normally distributed errors.

Conclusion

Fundamental Tool

In conclusion, the popularity of Least-Squares comes from its computational simplicity and its deep link to key statistical principles. It provides a closed form solution for Linear Regression (which is the Best Linear Unbiased Estimator), defines the mean, and is equivalent to Maximum Likelihood Estimation under normal errors.

BLUE or BUE ?

There’s even debate over whether or not the linearity assumption of the Gauss-Markov Theorem can be relaxed, allowing OLS to also be considered the Best Unbiased Estimator (BUE).

We’re still solving Linear Regression, but this time the estimator can remain linear but is also allowed to be non-linear, thus BUE instead of BLUE.

The economist Bruce Hansen thought he had proved it in 2022 [1], but Pötscher and Preinerstorfer quickly invalidated his proof [2].

Outliers

Least-Squares is very likely to become unreliable when errors are not normally distributed, e.g. with outliers.

As we’ve seen previously, the mean defined by L2 is highly affected by extreme values, whereas the median defined by L1 simply ignores them.

Robust loss functions like Huber or Tukey tend to still mimic the quadratic behavior of Least-Squares for small errors, while greatly attenuating the impact of large errors with a near L1 or constant behavior. They are much better choices than L2 to cope with outliers and provide robust estimates.

Regularization

In some cases, using a biased estimator like Ridge regression, which adds regularization, can improve generalization to unseen data. While introducing bias, it helps prevent overfitting, making the model more robust, especially in noisy or high-dimensional settings.

[1] Bruce E. Hansen, 2022. “A Modern Gauss–Markov Theorem,” Econometrica, Econometric Society, vol. 90(3), pages 1283–1294, May.

[2] Pötscher, Benedikt M. & Preinerstorfer, David, 2022. “A Modern Gauss-Markov Theorem? Really?,” MPRA Paper 112185, University Library of Munich, Germany.

The post Least Squares: Where Convenience Meets Optimality appeared first on Towards Data Science.