Least-Squares

The least squares regression problem involves solving for unknown parameters $\mathbf{x}$ in a model written as a linear system: $$ A \mathbf{x} = \mathbf{b} $$ where $$ A \in \mathbb{R}^{m \times n}, \quad \mathbf{b} \in \mathbb{R}^{m}. $$

We wish to find $\mathbf{x}$ that minimizes the residual error $\|A \mathbf{x} - \mathbf{b}\|_2^2$. This is a convex optimization problem, since the cost function is quadratic in $\mathbf{x}$.

$$ \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 $$

Basic Linear Case

If $A$ is square and invertible, then the solution is simply: $$ \mathbf{x} = A^{-1} \mathbf{b}. $$ This case corresponds to having the same number of equations and unknowns ($m = n$).

Overdetermined and Underdetermined Systems

If $A$ is not square (non-invertible), we may have:

Overdetermined system: $m > n$, more equations than unknowns.
Underdetermined system: $m < n$, more unknowns than equations.

In both cases, the system may not have a unique solution. The least squares approach finds an approximate solution by minimizing $\|A\mathbf{x} - \mathbf{b}\|_2$.

Least Squares Derivation

We minimize: $$ \|A\mathbf{x} - \mathbf{b}\|_2^2 = (A\mathbf{x} - \mathbf{b})^T (A\mathbf{x} - \mathbf{b}). $$ Expanding gives: $$ = \mathbf{x}^T A^T A \mathbf{x} - 2 \mathbf{b}^T A \mathbf{x} + \mathbf{b}^T \mathbf{b}. $$ Setting the gradient to zero: $$ \nabla_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 = 2A^T(A\mathbf{x} - \mathbf{b}) = 0. $$ This yields the normal equation: $$ A^T A \mathbf{x} = A^T \mathbf{b}. $$ If $A^T A$ is invertible, then $$ \mathbf{x} = (A^T A)^{-1} A^T \mathbf{b}. $$

The matrix $(A^T A)^{-1} A^T$ is known as the pseudoinverse of $A$ (denoted $A^+$).

The Moore–Penrose Pseudoinverse and SVD

For general (possibly non-square) matrices, we can define: $$ A^+ = (A^T A)^{-1} A^T \quad \text{if } A^T A \text{ is invertible.} $$

However, for large or rank-deficient matrices, this form may be unstable. We instead compute the pseudoinverse using the Singular Value Decomposition (SVD): $$ A = U \Sigma V^T, $$ where $U$ and $V$ are orthogonal matrices and $\Sigma$ is a diagonal matrix containing the singular values $\sigma_i \ge 0$.

Then, $$ A^+ = V \Sigma^+ U^T, $$ where $\Sigma^+$ is obtained by taking the reciprocal of the nonzero singular values in $\Sigma$.

The least squares solution becomes: $$ \mathbf{x} = A^+ \mathbf{b} = V \Sigma^+ U^T \mathbf{b}. $$

This approach is numerically stable and works even when $A$ is not full rank.

Minimum-Norm and Rank Considerations

The rank of $A$ equals the number of nonzero singular values. If $A$ has rank $r$, we can approximate it using a rank-$r$ truncation of the SVD: $$ A \approx U_r \Sigma_r V_r^T. $$

In the underdetermined case ($m < n$), there are infinitely many solutions. We usually choose the one with minimum norm: $$ \min_{\mathbf{x}} \|\mathbf{x}\|_2 \quad \text{subject to } A\mathbf{x} = \mathbf{b}. $$ This leads to: $$ \mathbf{x} = A^T (A A^T)^{-1} \mathbf{b}. $$

Condition Numbers and Numerical Sensitivity

The condition number of a matrix $A$ is defined as: $$ \kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}. $$ A high condition number means $A$ is nearly singular, making the system very sensitive to small perturbations in $\mathbf{b}$.

If $\mathbf{b}$ has some noise $\delta \mathbf{b}$, then: $$ A(\mathbf{x} + \delta \mathbf{x}) = \mathbf{b} + \delta \mathbf{b}. $$ This results in $$ \frac{\|\delta \mathbf{x}\|}{\|\mathbf{x}\|} \le \kappa(A) \frac{\|\delta \mathbf{b}\|}{\|\mathbf{b}\|}. $$ Hence, reducing $\kappa(A)$ (e.g., via regularization) improves numerical stability.

Tikhonov (Ridge) Regularization

When $A^T A$ is ill-conditioned, we can stabilize the solution using Tikhonov regularization (also called ridge regression): $$ \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 + \alpha \|\mathbf{x}\|_2^2, $$ where $\alpha > 0$ controls the trade-off between bias and variance.

The closed-form solution is: $$ \mathbf{x} = (A^T A + \alpha I)^{-1} A^T \mathbf{b}. $$

Remarks:

Adding $\alpha I$ prevents division by very small singular values.
The parameter $\alpha$ introduces bias but reduces variance.

LASSO Regularization

Another common regularization is the LASSO (Least Absolute Shrinkage and Selection Operator): $$ \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 + \lambda \|\mathbf{x}\|_1. $$ Here, the $\ell_1$ norm encourages sparsity in $\mathbf{x}$, leading to simpler models.

Constrained Least Squares

We can also impose linear constraints: $$ \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 \quad \text{subject to } C\mathbf{x} = \mathbf{d}. $$ Rewriting in quadratic form: $$ \min_{\mathbf{x}} \frac{1}{2} \mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x}, \quad \text{subject to } C\mathbf{x} = \mathbf{d}, $$ where $Q = 2A^T A$ and $\mathbf{c} = -2A^T \mathbf{b}$.

Using Lagrange multipliers: $$ \mathcal{L}(\mathbf{x}, \bm{\lambda}) = \frac{1}{2} \mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x} + \bm{\lambda}^T (C\mathbf{x} - \mathbf{d}). $$ Setting derivatives to zero gives: $$ \begin{bmatrix} Q & C^T \\ C & 0 \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ \bm{\lambda} \end{bmatrix} = \begin{bmatrix} -\mathbf{c} \\ \mathbf{d} \end{bmatrix}. $$ This linear system can be solved directly, often via SVD.

Quadratic Programming Form

In general, least squares and its constrained forms can be expressed as a quadratic program: $$ \min_{\mathbf{x}} \frac{1}{2} \mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x} \quad \text{subject to } A\mathbf{x} \le \mathbf{b}, $$ where $Q$ is positive semi-definite. Because the problem is convex, any local minimum is also a global minimum.

Summary of Linear Least Squares

The least squares problem minimizes $\|A\mathbf{x} - \mathbf{b}\|_2$.
If $A$ is full rank: $\mathbf{x} = (A^T A)^{-1}A^T \mathbf{b}$.
If not, use SVD to compute $\mathbf{x} = V \Sigma^+ U^T \mathbf{b}$.
Regularization (Tikhonov / LASSO) stabilizes ill-conditioned problems.
Constrained least squares can be solved via Lagrange multipliers or QP solvers.

Nonlinear Least Squares

1. The Gauss–Newton Method

We begin with a nonlinear least squares problem of the form: $$ L(\bm{\theta}) = \frac{1}{2} \sum_{j=1}^{m} \| y_j - f(x_j; \bm{\theta}) \|^2 = \frac{1}{2} \sum_{j=1}^{m} f_j(\bm{\theta})^2 = \frac{1}{2} \| \bm{f}(\bm{\theta}) \|^2, $$ where $ f_j(\bm{\theta}) = y_j - f(x_j; \bm{\theta}) $ are the residuals.

Gradient and Jacobian

The gradient of $L$ is given by: $$ \nabla_{\bm{\theta}} L = J^\top \bm{f}, $$ where $ J $ is the Jacobian matrix of residuals: $$ J = \begin{bmatrix} \frac{\partial f_1}{\partial \theta_1} & \frac{\partial f_1}{\partial \theta_2} & \dots & \frac{\partial f_1}{\partial \theta_n} \\ \frac{\partial f_2}{\partial \theta_1} & \frac{\partial f_2}{\partial \theta_2} & \dots & \frac{\partial f_2}{\partial \theta_n} \\ \vdots & & \ddots & \vdots \\ \frac{\partial f_m}{\partial \theta_1} & \frac{\partial f_m}{\partial \theta_2} & \dots & \frac{\partial f_m}{\partial \theta_n} \end{bmatrix}. $$

The exact Hessian is: $$ \nabla^2_{\bm{\theta}} L = J^\top J + \sum_{j=1}^m f_j H_j $$ where $H_j = \nabla^2_{\bm{\theta}} f_j$ is the Hessian of the $j$-th residual function.

Gauss-Newton Approximation: Near a good fit, the residuals $f_j(\bm{\theta})$ are small, so the term $\sum_j f_j H_j$ is often negligible. Dropping it yields the Gauss-Newton Hessian approximation: $$ H \approx J^\top J $$

Gauss–Newton Update Rule

Newton’s method gives the update: $$ \bm{\theta}_{k+1} = \bm{\theta}_k - H^{-1} \nabla_{\bm{\theta}} L. $$ Using the Gauss–Newton approximation for $H$ and the gradient $\nabla L = J^\top \bm{f}$, we get: $$ (J_k^\top J_k) \, \Delta \bm{\theta}_k = - J_k^\top \bm{f}_k, $$ where the update is $\bm{\theta}_{k+1} = \bm{\theta}_k + \Delta \bm{\theta}_k$. This means the Gauss–Newton method solves a linear least squares problem at each iteration: $$ \min_{\Delta \bm{\theta}} \frac{1}{2} \| J_k \Delta \bm{\theta} + \bm{f}_k \|^2. $$

Intuition

The method reduces a nonlinear least squares problem to a sequence of linearized least squares problems by using the local Jacobian. This approach works well when the model is approximately linear near the solution.

2. The Levenberg–Marquardt Algorithm

Limitations of Gauss–Newton

The Gauss–Newton method can be unstable if:

The residuals are large, making the Hessian approximation inaccurate.
The Jacobian $J$ is rank-deficient, making $J^\top J$ singular or ill-conditioned.

The Levenberg–Marquardt (LM) Idea

The LM algorithm improves robustness by combining the Gauss–Newton method with gradient descent. It does this by regularizing the Hessian approximation: $$ H \approx J^\top J + \alpha I, $$ where $\alpha > 0$ is a damping parameter. The update equation becomes: $$ (J^\top J + \alpha I)\, \Delta \bm{\theta} = -J^\top \bm{f}. $$

When $\alpha \to 0$, LM behaves like Gauss–Newton (fast convergence near solution).
When $\alpha \to \infty$, LM behaves like gradient descent (more stable when far from solution).

Adaptive Damping

The parameter $\alpha$ is adjusted dynamically based on how well the linear model predicts the actual reduction in the cost function. This allows the algorithm to smoothly transition between fast Gauss-Newton steps and stable gradient-descent steps.

3. Conjugate Gradient Method (CG)

The Conjugate Gradient method is an iterative algorithm for solving large symmetric positive definite (SPD) linear systems of the form $A \bm{x} = \bm{b}$.

In our context, this often corresponds to solving the Gauss-Newton step: $$ (J^\top J) \Delta \bm{\theta} = -J^\top \bm{f}. $$ CG is efficient for large or sparse systems because it avoids explicit matrix factorization and inversion, relying only on matrix-vector products.

Key Idea

Instead of descending along gradient directions, CG minimizes the quadratic function $\phi(\bm{x}) = \frac{1}{2} \bm{x}^\top A \bm{x} - \bm{b}^\top \bm{x}$ by moving along a sequence of A-orthogonal (or "conjugate") directions $\{ \bm{p}_k \}$, where $\bm{p}_i^\top A \bm{p}_j = 0$ for $i \ne j$. This guarantees convergence in at most $n$ steps for an $n \times n$ system (in exact arithmetic).

In nonlinear least squares, CG is often used as an inner solver to compute the Gauss–Newton or Levenberg-Marquardt update step, especially when $J^\top J$ is too large to invert directly.

Summary of Nonlinear Methods

Gauss–Newton: Approximates the Hessian by $J^\top J$. Efficient when residuals are small and the problem is well-conditioned.
Levenberg–Marquardt: Adds a damping term $\alpha I$ to improve numerical stability and robustness, blending Gauss-Newton with gradient descent.
Conjugate Gradient: An efficient iterative solver for the linear system within each step of Gauss-Newton or LM, ideal for large-scale problems.

These methods form the foundation of modern nonlinear least squares optimization, combining analytical insight with numerical stability.