The least squares regression problem involves solving for unknown parameters $\mathbf{x}$ in a model written as a linear system: $$ A \mathbf{x} = \mathbf{b} $$ where $$ A \in \mathbb{R}^{m \times n}, \quad \mathbf{b} \in \mathbb{R}^{m}. $$
We wish to find $\mathbf{x}$ that minimizes the residual error $\|A \mathbf{x} - \mathbf{b}\|_2^2$. This is a convex optimization problem, since the cost function is quadratic in $\mathbf{x}$.
$$ \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 $$If $A$ is square and invertible, then the solution is simply: $$ \mathbf{x} = A^{-1} \mathbf{b}. $$ This case corresponds to having the same number of equations and unknowns ($m = n$).
If $A$ is not square (non-invertible), we may have:
In both cases, the system may not have a unique solution. The least squares approach finds an approximate solution by minimizing $\|A\mathbf{x} - \mathbf{b}\|_2$.
We minimize: $$ \|A\mathbf{x} - \mathbf{b}\|_2^2 = (A\mathbf{x} - \mathbf{b})^T (A\mathbf{x} - \mathbf{b}). $$ Expanding gives: $$ = \mathbf{x}^T A^T A \mathbf{x} - 2 \mathbf{b}^T A \mathbf{x} + \mathbf{b}^T \mathbf{b}. $$ Setting the gradient to zero: $$ \nabla_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 = 2A^T(A\mathbf{x} - \mathbf{b}) = 0. $$ This yields the normal equation: $$ A^T A \mathbf{x} = A^T \mathbf{b}. $$ If $A^T A$ is invertible, then $$ \mathbf{x} = (A^T A)^{-1} A^T \mathbf{b}. $$
The matrix $(A^T A)^{-1} A^T$ is known as the pseudoinverse of $A$ (denoted $A^+$).
For general (possibly non-square) matrices, we can define: $$ A^+ = (A^T A)^{-1} A^T \quad \text{if } A^T A \text{ is invertible.} $$
However, for large or rank-deficient matrices, this form may be unstable. We instead compute the pseudoinverse using the Singular Value Decomposition (SVD): $$ A = U \Sigma V^T, $$ where $U$ and $V$ are orthogonal matrices and $\Sigma$ is a diagonal matrix containing the singular values $\sigma_i \ge 0$.
Then, $$ A^+ = V \Sigma^+ U^T, $$ where $\Sigma^+$ is obtained by taking the reciprocal of the nonzero singular values in $\Sigma$.
The least squares solution becomes: $$ \mathbf{x} = A^+ \mathbf{b} = V \Sigma^+ U^T \mathbf{b}. $$
This approach is numerically stable and works even when $A$ is not full rank.
The rank of $A$ equals the number of nonzero singular values. If $A$ has rank $r$, we can approximate it using a rank-$r$ truncation of the SVD: $$ A \approx U_r \Sigma_r V_r^T. $$
In the underdetermined case ($m < n$), there are infinitely many solutions. We usually choose the one with minimum norm: $$ \min_{\mathbf{x}} \|\mathbf{x}\|_2 \quad \text{subject to } A\mathbf{x} = \mathbf{b}. $$ This leads to: $$ \mathbf{x} = A^T (A A^T)^{-1} \mathbf{b}. $$
The condition number of a matrix $A$ is defined as: $$ \kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}. $$ A high condition number means $A$ is nearly singular, making the system very sensitive to small perturbations in $\mathbf{b}$.
If $\mathbf{b}$ has some noise $\delta \mathbf{b}$, then: $$ A(\mathbf{x} + \delta \mathbf{x}) = \mathbf{b} + \delta \mathbf{b}. $$ This results in $$ \frac{\|\delta \mathbf{x}\|}{\|\mathbf{x}\|} \le \kappa(A) \frac{\|\delta \mathbf{b}\|}{\|\mathbf{b}\|}. $$ Hence, reducing $\kappa(A)$ (e.g., via regularization) improves numerical stability.
When $A^T A$ is ill-conditioned, we can stabilize the solution using Tikhonov regularization (also called ridge regression): $$ \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 + \alpha \|\mathbf{x}\|_2^2, $$ where $\alpha > 0$ controls the trade-off between bias and variance.
The closed-form solution is: $$ \mathbf{x} = (A^T A + \alpha I)^{-1} A^T \mathbf{b}. $$
Remarks:
Another common regularization is the LASSO (Least Absolute Shrinkage and Selection Operator): $$ \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 + \lambda \|\mathbf{x}\|_1. $$ Here, the $\ell_1$ norm encourages sparsity in $\mathbf{x}$, leading to simpler models.
We can also impose linear constraints: $$ \min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|_2^2 \quad \text{subject to } C\mathbf{x} = \mathbf{d}. $$ Rewriting in quadratic form: $$ \min_{\mathbf{x}} \frac{1}{2} \mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x}, \quad \text{subject to } C\mathbf{x} = \mathbf{d}, $$ where $Q = 2A^T A$ and $\mathbf{c} = -2A^T \mathbf{b}$.
Using Lagrange multipliers: $$ \mathcal{L}(\mathbf{x}, \bm{\lambda}) = \frac{1}{2} \mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x} + \bm{\lambda}^T (C\mathbf{x} - \mathbf{d}). $$ Setting derivatives to zero gives: $$ \begin{bmatrix} Q & C^T \\ C & 0 \end{bmatrix} \begin{bmatrix} \mathbf{x} \\ \bm{\lambda} \end{bmatrix} = \begin{bmatrix} -\mathbf{c} \\ \mathbf{d} \end{bmatrix}. $$ This linear system can be solved directly, often via SVD.
In general, least squares and its constrained forms can be expressed as a quadratic program: $$ \min_{\mathbf{x}} \frac{1}{2} \mathbf{x}^T Q \mathbf{x} + \mathbf{c}^T \mathbf{x} \quad \text{subject to } A\mathbf{x} \le \mathbf{b}, $$ where $Q$ is positive semi-definite. Because the problem is convex, any local minimum is also a global minimum.
We begin with a nonlinear least squares problem of the form: $$ L(\bm{\theta}) = \frac{1}{2} \sum_{j=1}^{m} \| y_j - f(x_j; \bm{\theta}) \|^2 = \frac{1}{2} \sum_{j=1}^{m} f_j(\bm{\theta})^2 = \frac{1}{2} \| \bm{f}(\bm{\theta}) \|^2, $$ where \( f_j(\bm{\theta}) = y_j - f(x_j; \bm{\theta}) \) are the residuals.
The gradient of \(L\) is given by: $$ \nabla_{\bm{\theta}} L = J^\top \bm{f}, $$ where \( J \) is the Jacobian matrix of residuals: $$ J = \begin{bmatrix} \frac{\partial f_1}{\partial \theta_1} & \frac{\partial f_1}{\partial \theta_2} & \dots & \frac{\partial f_1}{\partial \theta_n} \\ \frac{\partial f_2}{\partial \theta_1} & \frac{\partial f_2}{\partial \theta_2} & \dots & \frac{\partial f_2}{\partial \theta_n} \\ \vdots & & \ddots & \vdots \\ \frac{\partial f_m}{\partial \theta_1} & \frac{\partial f_m}{\partial \theta_2} & \dots & \frac{\partial f_m}{\partial \theta_n} \end{bmatrix}. $$
The exact Hessian is: $$ \nabla^2_{\bm{\theta}} L = J^\top J + \sum_{j=1}^m f_j H_j $$ where $H_j = \nabla^2_{\bm{\theta}} f_j$ is the Hessian of the $j$-th residual function.
Gauss-Newton Approximation: Near a good fit, the residuals $f_j(\bm{\theta})$ are small, so the term $\sum_j f_j H_j$ is often negligible. Dropping it yields the Gauss-Newton Hessian approximation: $$ H \approx J^\top J $$
Newton’s method gives the update: $$ \bm{\theta}_{k+1} = \bm{\theta}_k - H^{-1} \nabla_{\bm{\theta}} L. $$ Using the Gauss–Newton approximation for $H$ and the gradient $\nabla L = J^\top \bm{f}$, we get: $$ (J_k^\top J_k) \, \Delta \bm{\theta}_k = - J_k^\top \bm{f}_k, $$ where the update is $\bm{\theta}_{k+1} = \bm{\theta}_k + \Delta \bm{\theta}_k$. This means the Gauss–Newton method solves a linear least squares problem at each iteration: $$ \min_{\Delta \bm{\theta}} \frac{1}{2} \| J_k \Delta \bm{\theta} + \bm{f}_k \|^2. $$
The method reduces a nonlinear least squares problem to a sequence of linearized least squares problems by using the local Jacobian. This approach works well when the model is approximately linear near the solution.
The Gauss–Newton method can be unstable if:
The LM algorithm improves robustness by combining the Gauss–Newton method with gradient descent. It does this by regularizing the Hessian approximation: $$ H \approx J^\top J + \alpha I, $$ where $\alpha > 0$ is a damping parameter. The update equation becomes: $$ (J^\top J + \alpha I)\, \Delta \bm{\theta} = -J^\top \bm{f}. $$
The parameter $\alpha$ is adjusted dynamically based on how well the linear model predicts the actual reduction in the cost function. This allows the algorithm to smoothly transition between fast Gauss-Newton steps and stable gradient-descent steps.
The Conjugate Gradient method is an iterative algorithm for solving large symmetric positive definite (SPD) linear systems of the form $A \bm{x} = \bm{b}$.
In our context, this often corresponds to solving the Gauss-Newton step: $$ (J^\top J) \Delta \bm{\theta} = -J^\top \bm{f}. $$ CG is efficient for large or sparse systems because it avoids explicit matrix factorization and inversion, relying only on matrix-vector products.
Instead of descending along gradient directions, CG minimizes the quadratic function $\phi(\bm{x}) = \frac{1}{2} \bm{x}^\top A \bm{x} - \bm{b}^\top \bm{x}$ by moving along a sequence of A-orthogonal (or "conjugate") directions $\{ \bm{p}_k \}$, where $\bm{p}_i^\top A \bm{p}_j = 0$ for $i \ne j$. This guarantees convergence in at most $n$ steps for an $n \times n$ system (in exact arithmetic).
In nonlinear least squares, CG is often used as an inner solver to compute the Gauss–Newton or Levenberg-Marquardt update step, especially when $J^\top J$ is too large to invert directly.
These methods form the foundation of modern nonlinear least squares optimization, combining analytical insight with numerical stability.