Linear Models from a Probabilistic Perspective: Does Logistic Regression Have an Analytical Solution?

By 苏剑林 | July 22, 2021

We know that linear regression is a relatively simple problem because it has an analytical solution. Its variant, Logistic Regression (LR), however, does not have an analytical solution, which is somewhat of a pity. Although it is called "regression," it is actually used for classification problems, and for many readers, classification is more common than regression. To be precise, when we say logistic regression has no analytical solution, we are referring to the fact that "logistic regression has no analytical solution under Maximum Likelihood Estimation (MLE)." Does this mean that if we don't use Maximum Likelihood Estimation, we could find a usable analytical solution?

This article will derive an analytical solution for logistic regression from a non-MLE perspective. Simple experiments show that its performance is not inferior to the MLE solution obtained via gradient descent. Furthermore, this analytical solution is easily generalized to the single-layer Softmax multi-class model.

Linear Regression

Let's first review linear regression. Suppose the training data is $\{(\boldsymbol{x}_i,\boldsymbol{y}_i)\}_{i=1}^N$, where $\boldsymbol{x}\in\mathbb{R}^n,\boldsymbol{y}\in\mathbb{R}^m$. To align with code implementations, we default to using row vectors for all vectors here. Linear regression assumes that $\boldsymbol{x}$ and $\boldsymbol{y}$ satisfy a linear relationship $\boldsymbol{y}=\boldsymbol{x}\boldsymbol{W}+\boldsymbol{b}$, where $\boldsymbol{W}\in\mathbb{R}^{n\times m},\boldsymbol{b}\in\mathbb{R}^m$. We then estimate them by minimizing the following Mean Squared Error (MSE):

\begin{equation}\frac{1}{N}\sum_{i=1}^N \Vert\boldsymbol{y}_i-\boldsymbol{x}_i\boldsymbol{W}-\boldsymbol{b}\Vert^2\label{eq:loss}\end{equation}

This objective can be solved by directly expanding it and taking derivatives; since it is a quadratic minimization problem, it has an analytical solution.

Probabilistic Perspective

From the perspective of probability distributions, the Mean Squared Error implies that we assume $p(\boldsymbol{y}|\boldsymbol{x})$ is a normal distribution with mean $\boldsymbol{\mu}_{y|x}=\boldsymbol{x}\boldsymbol{W}+\boldsymbol{b}$. Now, let's make a stronger assumption:

Assume the joint distribution $p(\boldsymbol{x},\boldsymbol{y})$ is a normal distribution.

Under this assumption, we can directly write the corresponding conditional distribution:

\begin{equation}\begin{aligned} p(\boldsymbol{y}|\boldsymbol{x}) =&\, \mathcal{N}(\boldsymbol{y};\boldsymbol{\mu}_{y|x},\boldsymbol{\Sigma}_{y|x})\\ \boldsymbol{\mu}_{y|x} =&\, \boldsymbol{\mu}_y + (\boldsymbol{x}-\boldsymbol{\mu}_x)\boldsymbol{\Sigma}_{xx}^{-1}\boldsymbol{\Sigma}_{xy}\\ \boldsymbol{\Sigma}_{y|x} =&\, \boldsymbol{\Sigma}_{yy} - \boldsymbol{\Sigma}_{yx}\boldsymbol{\Sigma}_{xx}^{-1}\boldsymbol{\Sigma}_{xy} \end{aligned}\end{equation}

Here, $\boldsymbol{\mu}_x, \boldsymbol{\mu}_y$ are the mean vectors of $\boldsymbol{x}, \boldsymbol{y}$, and $\begin{pmatrix}\boldsymbol{\Sigma}_{xx} & \boldsymbol{\Sigma}_{xy} \\ \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_{yy}\end{pmatrix}$ is the covariance matrix of $\boldsymbol{x}, \boldsymbol{y}$. The form of the conditional distribution for a normal distribution can be found directly on Wikipedia, and its proof can be found on StackExchange or in related probability and statistics textbooks.

Comparing this with $\boldsymbol{\mu}_{y|x}=\boldsymbol{x}\boldsymbol{W}+\boldsymbol{b}$, we obtain:

\begin{equation}\boldsymbol{W} = \boldsymbol{\Sigma}_{xx}^{-1}\boldsymbol{\Sigma}_{xy}, \quad \boldsymbol{b} = \boldsymbol{\mu}_y - \boldsymbol{\mu}_x\boldsymbol{\Sigma}_{xx}^{-1}\boldsymbol{\Sigma}_{xy}\end{equation}

This is, in fact, the analytical solution for linear regression.

Analysis and Reflection

Let's clarify the above process. First, by default, linear regression only assumes the conditional distribution $p(\boldsymbol{y}|\boldsymbol{x})$, and we obtain the analytical solution via the least squares method; this is the conventional introduction to linear regression. Then, above, we made a much stronger assumption—that the "joint distribution $p(\boldsymbol{x},\boldsymbol{y})$ is a normal distribution"—yet we still ended up with the same analytical solution.

Why did a stronger assumption lead to the same result? In fact, from the loss function $\eqref{eq:loss}$, we can see it is quadratic with respect to $\boldsymbol{y}$ and also quadratic with respect to $\boldsymbol{x}$. This means it uses at most information regarding the second-order moments of $\boldsymbol{x}$ and $\boldsymbol{y}$. Therefore, assuming the joint distribution is normal does not change the final result because the normal distribution already preserves all moment information up to the second order (mean and covariance).

Furthermore, we can imagine that any linear model (linear regression, logistic regression, single-layer neural networks, etc.) primarily utilizes data statistics that do not exceed second-order moments. Therefore, when dealing with linear models, we can appropriately make a normal distribution assumption based on the specific circumstances. Theoretically, this can yield an equivalent result or a sufficiently good approximation.

Logistic Regression

Using the logic above, we can provide an analytical solution for logistic regression. I first saw this result in "Easy Logistic Regression with an Analytical Solution" and found it quite enlightening.

Assume the training data is $\{(\boldsymbol{x}_i,y_i)\}_{i=1}^N$, where $\boldsymbol{x}\in\mathbb{R}^n, y\in\{0,1\}$. This means it is a binary classification dataset. We establish the probability model:

\begin{equation}p(y|\boldsymbol{x}) = \left\{\begin{aligned}\sigma\left(\boldsymbol{x}\boldsymbol{w}^{\top}+b\right),\quad (y = 1)\\ 1 - \sigma\left(\boldsymbol{x}\boldsymbol{w}^{\top}+b\right),\quad (y = 0)\end{aligned}\right.\end{equation}

where $\sigma(t)=1/(1+e^{-t})$. The conventional estimation method for $\boldsymbol{w},b$ is maximum likelihood, which involves minimizing the following loss:

\begin{equation}-\frac{1}{N}\sum_{i=1}^N \ln p(y_i|\boldsymbol{x}_i)\end{equation}

We cannot calculate its analytical solution. However, if we do not take the maximum likelihood route and design a different solution path, it is possible to obtain an analytical solution.

An Ingenious Approach

First, it is easy to verify that for the logistic regression model, we have:

\begin{equation}\frac{p(1|\boldsymbol{x})}{p(0|\boldsymbol{x})} = \exp\left(\boldsymbol{x}\boldsymbol{w}^{\top}+b\right) \quad\Leftrightarrow\quad \ln \frac{p(1|\boldsymbol{x})}{p(0|\boldsymbol{x})} = \boldsymbol{x}\boldsymbol{w}^{\top}+b\label{eq:log}\end{equation}

This means that logistic regression is equivalent to a linear regression model with $\ln \frac{p(1|\boldsymbol{x})}{p(0|\boldsymbol{x})}$ as the output. However, estimating $\ln \frac{p(1|\boldsymbol{x})}{p(0|\boldsymbol{x})}$ directly is not easy, so we use Bayes' rule:

\begin{equation}p(y|\boldsymbol{x}) = \frac{p(\boldsymbol{x}|y)p(y)}{p(\boldsymbol{x})} \quad\Leftrightarrow\quad \frac{p(1|\boldsymbol{x})}{p(0|\boldsymbol{x})} = \frac{p(\boldsymbol{x}|1)p_1}{p(\boldsymbol{x}|0)p_0}\label{eq:bys}\end{equation}

Here $p_1, p_0$ are the probabilities of the positive and negative categories respectively, which are easy to estimate. $p(\boldsymbol{x}|1)$ and $p(\boldsymbol{x}|0)$ are the distributions satisfied by the positive and negative samples. Now we assume they follow normal distributions:

Assume $p(\boldsymbol{x}|1)$ and $p(\boldsymbol{x}|0)$ are normal distributions with the same covariance matrix.

The reader might find the assumption of the "same covariance matrix" a bit mysterious; we will discuss this later. Under this assumption, let:

\begin{equation}p(\boldsymbol{x}|1) = \mathcal{N}(\boldsymbol{x};\boldsymbol{\mu}_1,\boldsymbol{\Sigma}),\quad p(\boldsymbol{x}|0) = \mathcal{N}(\boldsymbol{x};\boldsymbol{\mu}_0,\boldsymbol{\Sigma})\end{equation}

where $\boldsymbol{\mu}_1, \boldsymbol{\mu}_0$ are the mean vectors of the positive and negative samples, and $\boldsymbol{\Sigma}$ can be estimated using the covariance matrix of the entire sample set. Recalling the probability density expression for a normal distribution:

\begin{equation}\frac{1}{\sqrt{(2\pi)^n \det(\boldsymbol{\Sigma})}}\exp\left\{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})^{\top}\right\}\end{equation}

Substituting this into Equation $\eqref{eq:bys}$ and expanding, we find that the quadratic terms cancel out exactly. Thus:

\begin{equation}\ln\frac{p(1|\boldsymbol{x})}{p(0|\boldsymbol{x})} = \ln\frac{p(\boldsymbol{x}|1)p_1}{p(\boldsymbol{x}|0)p_0} = \boldsymbol{x}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)^{\top} + \frac{1}{2}\left(\boldsymbol{\mu}_0\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_0^{\top} - \boldsymbol{\mu}_1\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1^{\top}\right) + \ln\frac{p_1}{p_0}\label{eq:rate}\end{equation}

Comparing this with Equation $\eqref{eq:log}$, we get:

\begin{equation}\begin{aligned} \boldsymbol{w} =&\, (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)\boldsymbol{\Sigma}^{-1}\\ b =&\, \frac{1}{2}\left(\boldsymbol{\mu}_0\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_0^{\top} - \boldsymbol{\mu}_1\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1^{\top}\right) + \ln\frac{p_1}{p_0} \end{aligned}\label{eq:sol}\end{equation}

This is an analytical solution for logistic regression. Of course, this is not entirely new; its logic is highly consistent with "Linear Discriminant Analysis (LDA)".

Analysis and Reflection

Currently, the reader's biggest doubt about this solution might be whether the "same covariance matrix" assumption is too strong. From a technical viewpoint, this assumption allows the quadratic terms in $\ln\frac{p(\boldsymbol{x}|1)}{p(\boldsymbol{x}|0)}$ to cancel exactly, leaving only linear terms to directly obtain the analytical solution. But from a theoretical perspective, is there any necessity for this assumption? In fact, we can argue that logistic regression itself (approximately) implies the assumption of the "same covariance matrix."

How do we understand this? First, for the logistic regression model, Equation $\eqref{eq:log}$ holds naturally, and Bayes' rule is always true. Thus, the conclusion is that $\ln\frac{p(\boldsymbol{x}|1)}{p(\boldsymbol{x}|0)}$ must consist only of linear and constant terms. As the linear regression example showed us, making a normal assumption for data distributions in linear models generally doesn't lose much information. So, assuming $p(\boldsymbol{x}|1)$ and $p(\boldsymbol{x}|0)$ are normal distributions is (to some extent) reasonable. Once assumed to be normal, if the result is to have no quadratic terms, the covariance matrices must be identical.

In other words, the moment you decide to use a logistic regression model and accept the normality assumption, you have effectively made the assumption that "positive and negative samples have the same covariance matrix"~

Multi-class Classifier

The analytical solution for logistic regression can also be easily generalized to the "Fully Connected + Softmax" multi-class scenario, which assumes the probability of class $i$ is:

\begin{equation}p(i|\boldsymbol{x}) = \frac{\exp\left(\boldsymbol{x}\boldsymbol{w}_i^{\top}+b_i\right)}{\sum\limits_{i=1}^k \exp\left(\boldsymbol{x}\boldsymbol{w}_i^{\top}+b_i\right)}\end{equation}

Based on the same reasoning and assumptions, we can obtain a result similar to Equation $\eqref{eq:log}$:

\begin{equation}\ln \frac{p(j|\boldsymbol{x})}{p(i|\boldsymbol{x})} = \boldsymbol{x}(\boldsymbol{w}_j - \boldsymbol{w}_i)^{\top}+(b_j - b_i)\end{equation}

And a result similar to Equation $\eqref{eq:rate}$:

\begin{equation}\ln\frac{p(j|\boldsymbol{x})}{p(i|\boldsymbol{x})} = \boldsymbol{x}\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_j - \boldsymbol{\mu}_i)^{\top} + \frac{1}{2}\left(\boldsymbol{\mu}_i\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_i^{\top} - \boldsymbol{\mu}_j\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_j^{\top}\right) + \ln\frac{p_j}{p_i}\end{equation}

By comparison, we find a set of solutions as follows:

\begin{equation}\begin{aligned} \boldsymbol{w}_i =&\, \boldsymbol{\mu}_i\boldsymbol{\Sigma}^{-1}\\ b_i =&\, \ln p_i - \frac{1}{2}\boldsymbol{\mu}_i\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_i^{\top} \end{aligned}\end{equation}

Parameter Estimation

Finally, let's discuss how to estimate the model parameters. We can see that $\boldsymbol{w}_i, b_i$ are functions of $p_i$, $\boldsymbol{\mu}_i$, and $\boldsymbol{\Sigma}$, so it essentially comes down to estimating these three. As mentioned, $p_i$ is simple—just use the frequency of each class. $\boldsymbol{\mu}_i$ is also straightforward—it's the mean vector of each class. The difficulty lies in estimating $\boldsymbol{\Sigma}$. According to our assumptions, the distribution of the entire data set is:

\begin{equation}\tilde{p}(\boldsymbol{x}) = \sum_{i=1}^k p_i \mathcal{N}(\boldsymbol{x};\boldsymbol{\mu}_i,\boldsymbol{\Sigma})\end{equation}

Multiplying both sides by $\boldsymbol{x}^{\top}\boldsymbol{x}$ and integrating, we get:

\begin{equation}\tilde{\boldsymbol{\Sigma}}+\tilde{\boldsymbol{\mu}}^{\top} \tilde{\boldsymbol{\mu}} = \sum_{i=1}^k p_i \left(\boldsymbol{\Sigma}+\boldsymbol{\mu}_i^{\top} \boldsymbol{\mu}_i\right) = \boldsymbol{\Sigma} + \sum_{i=1}^k p_i\, \boldsymbol{\mu}_i^{\top} \boldsymbol{\mu}_i\end{equation}

where $\tilde{\boldsymbol{\mu}}, \tilde{\boldsymbol{\Sigma}}$ are the mean vector and covariance matrix of all data. Therefore, we have the estimation:

\begin{equation}\boldsymbol{\Sigma} = \tilde{\boldsymbol{\Sigma}}+\tilde{\boldsymbol{\mu}}^{\top} \tilde{\boldsymbol{\mu}} - \sum_{i=1}^k p_i\, \boldsymbol{\mu}_i^{\top} \boldsymbol{\mu}_i\end{equation}

Specifically, it is recommended that before estimation, we perform whitening on the original data (refer to "You Might Not Need BERT-flow: A Linear Transformation Comparable to BERT-flow"), so that the entire dataset has a mean of 0 and a covariance equal to the identity matrix. In this case:

\begin{equation}\boldsymbol{\Sigma} = \boldsymbol{I} - \sum_{i=1}^k p_i\, \boldsymbol{\mu}_i^{\top} \boldsymbol{\mu}_i\end{equation}

Theoretically, the result of estimating after whitening is exactly the same as direct estimation. However, for high-dimensional data, whitening makes the numerical calculations much more stable, so performing whitening before estimation is recommended in practice.

Experimental Evaluation

How usable is the analytical solution derived above? Can it compare to the solution found via gradient descent? I conducted several text-related experiments using RoFormer-Sim-FT to extract fixed sentence vector features, followed by a fully connected layer for classification, comparing the differences between the gradient descent solution and the analytical solution. The experimental code is open-sourced as follows: [Link omitted per guidelines].

Full Sample Experiments

The evaluation includes four classification tasks: Sentiment classification (SENTIMENT), long text classification (IFLYTEK), short news classification (TNEWS), and e-commerce topic classification (SHOPPING). The details are as follows:

\begin{array}{c|cccc} \hline & \text{Total Classes} & \text{Train Samples} & \text{Val Samples} & \text{Test Samples} \\ \hline \text{SENTIMENT} & 2 & 16883 & 2111 & 2111 \\ \text{IFLYTEK} & 119 & 12133 & 2599 & \text{-}\\ \text{TNEWS} & 15 & 53360 & 10000 & \text{-}\\ \text{SHOPPING} & 10 & 47079 & 15694 & \text{-}\\ \hline \end{array}

The evaluation metric for the experiment is accuracy. The results using the full training set are as follows:

\begin{array}{c|ccc} \hline & \text{Train Acc} & \text{Val Acc} & \text{Test Acc} \\ \hline \text{SENTIMENT-GD} & 92.26\% & 91.14\% & 91.14\% \\ \text{SENTIMENT-Analytical} & 91.79\% & 90.81\% & 91.57\% \\ \hline \text{IFLYTEK-GD} & 93.43\% & 51.14\% & \text{-} \\ \text{IFLYTEK-Analytical} & 71.70\% & 56.44\% & \text{-} \\ \hline \text{TNEWS-GD} & 59.62\% & 53.35\% & \text{-} \\ \text{TNEWS-Analytical} & 56.12\% & 54.20\% & \text{-} \\ \hline \text{SHOPPING-GD} & 91.63\% & 86.98\% & \text{-} \\ \text{SHOPPING-Analytical} & 87.89\% & 86.38\% & \text{-} \\ \hline \end{array}

Small Sample Experiments

From the tables above, it can be observed that in terms of training set performance, the analytical solution is usually inferior to gradient descent (GD). However, its performance on the validation and test sets is close to or even exceeds that of GD. Overall, the gap between the analytical solution's training and validation performance is smaller, which means the analytical solution might have better generalization capability. It might be more suitable for scenarios with small amounts of data or distribution inconsistencies between training and validation sets.

To verify this hypothesis, we kept only 1000 training samples for each dataset and continued the experiment:

\begin{array}{c|ccc} \hline & \text{Train Acc} & \text{Val Acc} & \text{Test Acc} \\ \hline \text{SENTIMENT-1K-GD} & 99.90\% & 66.08\% & 66.79\% \\ \text{SENTIMENT-1K-Analytical} & 100.00\% & 72.67\% & 73.24\% \\ \hline \text{IFLYTEK-1K-GD} & 99.47\% & 15.43\% & \text{-} \\ \text{IFLYTEK-1K-Analytical} & 99.47\% & 15.70\% & \text{-} \\ \hline \text{TNEWS-1K-GD} & 100.00\% & 22.47\% & \text{-} \\ \text{TNEWS-1K-Analytical} & 100.00\% & 26.74\% & \text{-} \\ \hline \text{SHOPPING-1K-GD} & 100.00\% & 49.82\% & \text{-} \\ \text{SHOPPING-1K-Analytical} & 100.00\% & 65.49\% & \text{-} \\ \hline \end{array}

It can be seen that when training data is reduced, the training accuracy gap narrows, but the analytical solution's validation performance surpasses gradient descent across the board. This further highlights the excellent generalization performance of the analytical solution in few-shot scenarios.

Comprehensive Review

This conclusion is not difficult to understand. Given the same linear model, the analytical solution adds the assumption that "the samples of each class follow a normal distribution with the same covariance matrix" compared to gradient descent. When there is plenty of data, our estimation of each class's distribution becomes more accurate, making any deviation from this assumption more severe than adaptive training via gradient descent. Conversely, when data is scarce, estimating the distribution of each class itself is difficult; in this case, the assumption acts as useful prior information that helps the model generalize "from points to surfaces," whereas gradient descent may fail to generalize due to a lack of priors.

In other words, with little data, gradient descent merely memorizes a few samples and stops, lacking "analogical reasoning." The analytical solution, by contrast, effectively "creates" more samples for the model to memorize through its additional assumptions, thus learning more. When data is abundant, gradient descent memorizes many actual samples and becomes "practice-perfect," whereas the analytical solution is still "creating" samples based on its own assumptions—and at that point, the created samples are inferior to the real ones, leading to potential performance drops.

Is there room for improvement in the analytical solution? A direct idea is to find ways to make more precise estimations of $\ln \frac{p(j|\boldsymbol{x})}{p(i|\boldsymbol{x})}$ and then convert it into a linear regression problem for parameter estimation. There are quite a few ideas for better estimation of $\ln \frac{p(j|\boldsymbol{x})}{p(i|\boldsymbol{x})}$, such as assuming normal distributions with inconsistent covariances or even using kernel density estimation. I leave these for readers to explore.

Summary

This article introduced an analytical solution for logistic regression and generalized it to the single-layer Softmax classification scenario. Experiments show that this analytical solution has better generalization capability than gradient descent, particularly producing better results in few-shot scenarios.