Linear probability model vs. logistic regression

Dirk Enzmann

Join Date: Apr 2014

Posts: 541
#1

Linear probability model vs. logistic regression

07 Jul 2025, 13:18

This is not a question immediately related to Stata, but perhaps there are experts who nevertheless can help:

I am involved in a discussion about the use of a linear probability model (LPM) instead of a logistic regression model for cross-sectional data. The variables are a binary outcome variable (0/1, 6.5 and 8.8% = 1 in two groups), two groups (4,100 and 8,400 cases) and 10 predictors (3 binary, 2 2 3-categorical, 5 quasi-continuous). One model aims to test the interaction of 2 quasi-continuous variables and the interaction of 1 binary with 1 continuous variable. The research question: Do the effects of the predictors (model 1) or the interactions (model 2) differ between both groups (identical models per group)?

I am advocating a binary logistic regression model while my opponent is arguing for the use of a LPM because (a) we are studying rare events (6.5% and 8.8% in group A and B, resp.), and (b) the interpretation of the interaction effects is more straightforward. The opponent came up with a paper by Timoneda (2021) who argues that LPM outperforms logistic regression when estimating group fixed effects in panel data with a binary dependent variable. My counterargument: The paper discusses time-series cross-sectional data with many groups (likely to result in the "incidental parameter problem") but our data are not time-series data, with a relatively small number of predictors. Just because the outcome is binary (and comparatively rare) we should use logistic regression. The problem of interaction effects should be handled using AMEs (if a single coefficient is required), better using plots of predicted values.

However, I am far from an expert in time-series analysis (even this is an overstatement). Hence my question: Is there anyone in the Stata Forum to can help to shed light on the issue?

Reference: Timoneda, J. C. (2021). Estimating group fixed effects in panel data with a binary dependent variable: How the LPM outperforms logistic regression in rare events data. Social Science Research, 93, 102486. https://doi.org/10.1016/j.ssresearch.2020.102486
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

07 Jul 2025, 13:26

thanks for the cite - however, the citation appears to only deal with TSCS data, and, if I understand your post correctly, that is not what you have - so I'm not sure it's relevant

second, I think that the rare events issue is not particularly relevant anyplace

third, and most important, logistic regression and LPM answer different questions and any comparison on purely stat grounds is misplaced - are you interested in an OR or a risk difference?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 712
#3

07 Jul 2025, 13:39

Some time ago I ran a simulation to test whether interaction effects with binary outcomes are better to estimate with LPM or a logit model. I found that the difference is rather small as long as the sample size is large. However, I did not study rare events. See: https://osf.io/preprints/socarxiv/2cdu4_v1

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 541
#4

07 Jul 2025, 14:57

Answering Rich Goldstein: We are interested in the difference of the effects of the predictors (or the interaction effects) on the outcome in both groups (with identical models). In case of logistic regression models that would be the ORs, in case of LPM that would be b-coefficients. To illustrate the differences we then would compare predicted probabilities for specific values of the predictors. Does this answer your question?

Felix Bittmann: Thanks. If your results would also hold with (comparatively) rare events I interpret your findings that LPM vs. logistic regression would make no substantial difference. If I read the paper correctly, the means of the dichotomous outcome variable were .50, .65, and .80 (or .50, .35, and .20). But I still wonder whether the simulation results would also hold for a mean of the outcome variable of .10 or less (down to .05).
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#5

07 Jul 2025, 17:23

sorry, but no it does not - ORs and RD's are different estimands - your main goal should be one of these (or a different estimand such as RR), not both
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 712
#6

08 Jul 2025, 00:13

Dirk Enzmann: Yes, these three means were tested so not really rare events. I have excluded this topic from the paper as this is a rather different question, where other approaches might be better (e.g. firthlogit)
I would also say: why not use both approaches and compare results? Yet, Rich Goldstein makes a relevant point that OR and RD are interpreted differently and you should think about what fits your needs best. I would argue that OR are different to interpret in an interaction setting yet there is a rather new approach of marginal odds ratios. This paper here describes how one can combine OR and interaction analyses, the Stata ado is also available (lnmor): https://sociologicalscience.com/articles-v10-10-332/

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
daniel klein

Join Date: Mar 2014

Posts: 3861
#7

08 Jul 2025, 02:50

Related to Rich Goldstein's point, Buis (2010) might be an interesting read.

Buis, M. L. 2010. Stata tip 87: Interpretation of interactions in nonlinear models. The Stata Journal,10(2), pp. 305--308.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3459
#8

08 Jul 2025, 03:07

I have championed logistic regression and odds ratios many times on this list, but if you want to interpret the logistic regression results only with AMEs (in other words risk differences) then I would argue for a linear probability model. In essence, when you interpret your model with AMEs, you fit a linear model on your non-linear logit model. Why would anyone prefer a two step approximation of a model that is can easily be estimated in one step? Moreover, we know how to diagnose the one step model (LPM) really well and a whole lot of tools are readily available. We don't have standard tools available to diagnose our linear model on top of the non-linear model implicit in AMEs.

Don't get me wrong: I still really really like logistic regression and odds ratios, but if you only want AMEs than don't bother with logistic regression: that is just a two step complication that you can more easily and more transparently do in one step (an LPM).

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 541
#9

08 Jul 2025, 05:31

Maarten Buis : Thank you! But if I want to keep the advantage (?) of logistic regression over LPM: Can't I use MORs (marginal odds ratios) (Karlson & Jann, 2023; Jann & Karlson, 2023) instead of AMEs? Note that we want to compare the coefficients or effects (AMEs or MORs) of a model with covariates (including interaction effects) across to two samples (both with identical models). Shouldn't in that case logistic regression be preferred to LPM?

And: What would be a proper way to compare the MORs or AMEs between the two samples (I don't want to complicate the model by using three-way interactions)?

References:
Karlson, K., & Jann, B. (2023). Marginal odds ratios: What they are, how to compute them, and why sociologists might want to use them. Sociological Science, 10, 332–347. https://doi.org/10.15195/v10.a10

Jann, B., & Karlson, K. B. (2023). Estimation of marginal odds ratios. Manuskript, Bern. Retrieved from https://ideas.repec.org/p/bss/wpaper/44.html

Jann, B., & Karlson, K. B. (2023, October 27). Marginal odds ratios: What they are, how to compute them, and why applied researchers might want to use them. Presented at the 2023 Mexican Stata Conference Hermosillo, October 26–27, 2023. Retrieved from https://www.stata.com/meeting/mexico...ico23_Jann.pdf

Last edited by Dirk Enzmann; 08 Jul 2025, 05:39. Reason: Additional question and correction of typos
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3459
#10

08 Jul 2025, 09:58

I suspect that you worry about MORs or AMEs because you want to compare models, and there is the fiction that you cannot do that with odds ratios. See for example Kuha and Mills (2018) (I make a similar argument in a working paper: https://www.maartenbuis.nl/wp/odds_ratio_3.1.pdf ) So if that is the problem you face, then the MORs and AMEs are solutions to non-problems. I would rather not complicate my analysis by adding solutions to non-problems. So as far as I am concerned (and not everybody agrees with me) we can leave the MORs and AMEs out of the discussion, and focus solely on comparing the odds ratios from logistic regression and risk differences from an LPM. The LPM can get a bit tricky if you have a lot of 1s or a lot of 0s because you can than easily get predictions more than 1 or less than 0. Other than that, it just depends on the what measure you want to show.

I assume that with "compare" you mean do inference. I would do a three way interaction for that I would than have to think very hard on how to present that, e.g. with this Stata tip: https://www.stata-journal.com/articl...article=st0250

Kuha, J., & Mills, C. (2018). On Group Comparisons With Logistic Regression Models. Sociological Methods & Research, 49(2), 498-525. https://doi.org/10.1177/0049124117747306

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
John Mullahy

Join Date: Dec 2016

Posts: 752
#11

09 Jul 2025, 08:34

My approach would be to start with the notions that (a) Prob(y=1|x)=f(x), and (b) there is a correct but typically unknown specification of the functional form of f(x).

It could be exp(xb)/(1+exp(xb)) (logit).

It could be xb (linear probability).

It could be normal(xb) (probit).

It could be—and probably is—something else.

ORs, AMEs, etc. are correctly specified if they are derived from the correct specification of f(x). Otherwise all bets are off.

Of course one might reasonably counter that all models are wrong but some are useful.
https://en.wikipedia.org/wiki/All_models_are_wrong

Last edited by John Mullahy; 09 Jul 2025, 08:42.
3 likes
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#12

10 Jul 2025, 05:44

Paul Allison argues that logit + margins gives you the best of both possible worlds:

https://statisticalhorizons.com/in-d...-logit-part-2/

Depending on your purposes, the lpm may be fine, but if you want Marginal Effects at Representative Values (e.g. see predictions at high, medium, and low values of income) logit + margins may be much better because lpm may yield improbable or even impossible predictions.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
2 likes
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#13

10 Jul 2025, 05:52

What would be a proper way to compare the MORs or AMEs between the two samples (I don't want to complicate the model by using three-way interactions)?

I'm a big fan of this paper by Mize/ Doan/ Long, "A General Framework for Comparing Predictions and Marginal Effects across Models"

https://journals-sagepub-com.proxy.l...81175019852763

Abstract
Many research questions involve comparing predictions or effects across multiple models. For example, it may be of interest whether an independent variable’s effect changes after adding variables to a model. Or, it could be important to compare a variable’s effect on different outcomes or across different types of models. When doing this, marginal effects are a useful method for quantifying effects because they are in the natural metric of the dependent variable and they avoid identification problems when comparing regression coefficients across logit and probit models. Despite advances that make it possible to compute marginal effects for almost any model, there is no general method for comparing these effects across models. In this article, the authors provide a general framework for comparing predictions and marginal effects across models using seemingly unrelated estimation to combine estimates from multiple models, which allows tests of the equality of predictions and effects across models. The authors illustrate their method to compare nested models, to compare effects on different dependent or independent variables, to compare results from different samples or groups within one sample, and to assess results from different types of models.

For software and examples, see

https://www.trentonmize.com/software/mecompare

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment

Announcement

Linear probability model vs. logistic regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment