04 Jan 2014, 22:02

12 ways of looking at logistic regression


Logistic regression is the most popular off-the-shelf classification technique. In fact, it’s so popular that every subfield has their own interpretation of the same technique, just so they have it in their toolkit while being able to compare it with the stuff they’re actually researching in the same theoretical framework. I’ve attempted to reproduce some of the many interpretations I’ve heard of below, bearing in mind I’m not an expert in all of these, so some of the nuance might be lost.

  • The classical statistical interpretation. Your labels come from a binomial distribution, conditional on their features. You want to estimate the distribution.

  • The Bayesian statistical interpretation. In addition to the above, your parameter estimates are themselves probabilistic beliefs with their own distributions. If you could encode a totally non-informative, zero-strength prior, this ends up being more or less the same as the frequentist interpretation.

  • The latent-variable interpretation, popular with social scientists and psychologists. There is some latent continuous variable that determines the outcome, depending on which side of the threshold it falls on, but we can only see the final outcome. Your goal is to estimate the parameters that determine this latent variable as closely as possible.

  • The Kentucky Derby interpretation. Your parameters represent multiplicative effects on the odds (as in, a 4:1 bet). Your goal is to calculate the effect of each feature to end up with the same outcome.

  • The less-Naive-Bayes interpretation. Like Naive Bayes, but estimating the pairwise correlations/covariances instead of assuming uncorrelated variables.

  • The information-theory interpretation. Find parameters so that conditional on the features, the output label distribution has maximum entropy.

  • The warp-space interpretation. Perform a kind of quasi-linear-regression in a space where we transform the label dimension via an inverse sigmoid.

  • The loss minimization interpretation. You have a loss function that gives you a penalty for each misclassified example (a higher penalty the more extreme your prediction was), and you classify an example by dotting its features with your parameters and applying a sigmoid. Find parameters that minimize the loss.

  • The “minimum bias” interpretation, popular with actuaries. Plot your data as a tensor, with each feature being a dimension, and the outcomes for each feature combo being summed in the appropriate cell (this only works for categorical features). Try to find parameters for each dimension, so that when you sum them together, apply a sigmoid, and multiply by the cell population, you get minimal binomial loss.

  • The neural network interpretation. Your features constitute a stimulus, dot-product’d with your parameters and fed through a sigmoid activation function to get a predicted label. You’re maximizing the fidelity with which your neuron can “remember” the label for the data it has seen.

  • The support vector interpretation. Take your data, and try to segment it with a hyperplane. For each point, apply a “supporting vector force” to the plane, proportional to the logit of the distance. When the forces balance, your hyperplane gives you your parameters.

  • The feedback interpretation. We initialize our parameters to some garbage values. For each observation, we dot the features and our parameters. If the result is negative and the outcome in our data is positive, or vice versa, move the parameter vector “backwards”, in the reverse direction of the feature vector. If they’re both negative or positive, move the parameter vector “forwards”, in the direction of the feature vector. This corresponds to the stochastic gradient descent fitting procedure.

There are probably more I missed. Despite the fact that everyone has a similar basic toolkit, there seems to be a pretty low amount of cross-domain polination on extensions, even between similar fields like machine learning and statistics, or statistics and actuarial science. Maybe that’s because everyone is speaking their own dialect, and content to restrict their conversations to native speakers.