The Logit Model, better known as Logistic Regression is a binomial regression model. Logistic Regression is used to associate with a vector of random variables to a binomial random variable. Logistic regression is a special case of a generalized linear model. It is widely used in machine learning.

## Applications of Logit Model

Logistic regression is widespread in many areas. To give some examples:

- In medicine, it allows to find the factors that characterize a group of sick subjects as compared to healthy subjects.
- In the field of insurance, it makes it possible to target a fraction of the customers who will be sensitive to an insurance policy on a particular risk.
- In the banking field, to detect risk groups when subscribing a loan.
- In econometrics, to explain a discrete variable. For example, voting intentions in elections.

## The explanation of the Model

We denote Y as the variable to predict and X = (X_{1}, X_{2} … X_{n}) as predictive variables (explanatory variables). In the context of binary logistic regression, the variable Y takes two possible modalities {1, 0}. The variables X_{j} are exclusively continuous or binary.

P(Y = 1) denotes the probability that the variable Y takes the value 1. Similarly we can define P(Y = 0) as the probability that the variable Y takes the value 0. P(X|1) is the conditional distribution of X knowing the value taken by Y. Similarly, P(X|0) is defined.

The posterior probability of obtaining the modality 1 of Y knowing the value taken by X is noted P(1|X). Similarly P(0|X) is defined.

## Fundamental Hypothesis about Logit Models

Logistic regression is based on the following fundamental assumption, which recognizes the so-called “obviousness” measure.

Ev(p) = ln(p / (1 – p)).

ln(P(X|1)/P(X|0)) = a_{0} + a_{1}X_{1} + a_{2}X_{2} + … a_{n}X_{n}

*Here, ln denotes the natural logarithm (the logarithm to the base e, the Euler’s constant).*

A large class of distributions meets this specification, the multinormal distribution described in linear discriminant analysis.

Compared to discriminant analysis always, it is no longer the conditional densities P(X|1) and P(X|0) which are modeled but the ratio of these densities. The restriction introduced by the hypothesis is less strong.

The specification above can be written differently. The term designates logit of P(1|X), the following expression

ln(P(X|1)/(1 – P(X|1))) = a_{0} + a_{1}X_{1} + a_{2}X_{2} + … a_{n}X_{n}

There are 2 key points to note here:

- It is indeed a “regression” because we want to show a dependency relation between a variable to be explained and a series of explanatory variables.
- This is a “logistic” regression because the probability law is modeled from a logistic law.

Indeed, after transformation of the equation above, we get

P(1|X) = (eb_{0} + b_{1}X_{1} + … b_{n}X_{n})/(1 + eb_{0} + b_{1}X_{1} + … b_{n}X_{n})

## Confusion matrix – Evaluation of Logit Models

The goal is to produce a model that can predict as accurately as possible the values taken by a categorical variable, Y, a preferred approach to evaluate the quality of the model would be to compare the predicted values with the true values taken by Y: this is the role of the confusion matrix. From this, we derive a simple indicator, the error rate or the misclassification rate, which is the ratio between the number of bad predictions and the sample size.

When the confusion matrix is built on the data that was used to develop the model, the error rate is often too optimistic, not reflecting the actual performance of the model in the population. In order for the evaluation to be unbiased, it is advisable to build this matrix on a separate sample, called a test sample. In contrast to the learning sample, that did not participate in the construction of the model.

The main advantage of this method is that it makes it possible to compare any ranking method and thus select the one that proves to be the most efficient in the face of a given problem.

## Statistical Evaluation of Logistic Regression

It is possible to use a probabilistic scheme to test the validity of the model. These tests are based on the asymptotic distribution of the maximum likelihood estimators. These tests employ the concept of hypothesis testing in order to evaluate the performance of a Logit Model.

## Other ways to evaluate the performance of a Logit Model

Other evaluation procedures are commonly cited for logistic regression. Among other things, we note the Hosmer-Lemeshow test, which uses the “score” (the probability of assignment to a group) to order observations. In this, it approaches other learning assessment methods such as ROC curves that are significantly richer in information than the simple confusion matrix and the associated error rate.

## Variants of Logistic Regression

Logistic regression applies directly when the explanatory variables are continuous or dichotomous. When categorical, it is necessary to encode the variables. The simplest is the binary coding. Take the example of a habitat variable in three ways {city, periphery, others}. We will then create two binary variables: “habitat_ville”, “habitat_periphery”. The last modality is deduced from the other two, when the two variables simultaneously take the value 0, this indicates that the observation corresponds to “habitat = others”.

Finally, it is possible to perform a logistic regression to predict the values of a categorical variable with K (K > 2) modalities. We speak of polytomous logistic regression. The procedure is based on the designation of a reference group, it then produces (K – 1) linear combinations for the prediction. The interpretation of the coefficients is less obvious in this case.

## Comments are closed.