firthlogit + average marginal effects, but convergence problem

Kerstin Schmidt

Join Date: Apr 2017

Posts: 125
#1

firthlogit + average marginal effects, but convergence problem

31 Jul 2025, 05:36

Dear Statalist community,

I have a very small sample (n=98) from experimental data and one of my logit models dropps some observations because of perfect prediction of the outcome variable.

When trying to find a solution in this forum, I first read #2 . Based on the suggestions from this thread, I tried to run -exlogistic- but it required more memory than available on my computer. Then I used Joseph Coveney's -firthlogit- command instead:

Code:

firthlogit Y c.X1##i.Treatment i.Treatment##c.X2 $ControlVars

This worked and then I tried to find a solution to estimate the average marginal effects of X1. I found #2 and a code by Joseph Coveney's for exactly my question. Applied to my example I ran:

Code:

firthlogit Y c.X1##i.Treatment i.Treatment##c.X2 $ControlVars tempname B matrix define `B' = e(b) logit Y c.X1##i.Treatment i.Treatment##c.X2 $ControlVars, robust asis from(`B', copy) iterate(0)

Unfortunately, this would not converge:

Code:

Iteration 0: Log pseudolikelihood = -49.828919 convergence not achieved

Do you have any ideas what I can do instead?
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10284
#2

31 Jul 2025, 05:44

I don't think the goal was to achieve convergence. What Joseph wants to do is transfer the estimates from firthlogit to logit, the latter of which is supported by the margins command.
Comment
Kerstin Schmidt

Join Date: Apr 2017

Posts: 125
#3

31 Jul 2025, 14:39

Ok, but if the model does not converge, are the average marginal effects produced by the -margins- command then reliable?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5024
#4

31 Jul 2025, 15:57

I wouldn't trust any model that has not converged. See pages 2+ of this handout and see if any of the tips help you.

https://academicweb.nd.edu/~rwilliam/xsoc73994/L02.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#5

31 Jul 2025, 16:40

Originally posted by Kerstin Schmidt View Post

Ok, but if the model does not converge, are the average marginal effects produced by the -margins- command then reliable?

Andrew is correct. That nonconvergence message is an artifact of setting the iterations to zero. The fit has converged to the estimates given by firthlogit.

But there's another matter relating to your question about the reliability of the output produced by margins here: in cases of complete or quasicomplete separation, the curvature of the likelihood profile can be very asymmetric, and use of the asymptotic standard errors (which imposes symmetry) to derive test statistics might be inaccurate.

The point estimates that margins gives you here will reflect the maximum penalized likelihood derived estimates, and so there's no problem with those if those are what you're mainly interested in.

But I recommend taking a look at the profile likelihood of the coefficient(s) affected by separation before relying upon the Wald test statistics and associated confidence intervals that margins carries forward. If the profile likelihood looks highly askew, then the test statistics and confidence intervals reported by margins might not be reliable.
3 likes
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2204
#6

01 Aug 2025, 06:23

I assume by "experiment" you mean that the assignment to treatment was randomized. In that case, for the average treatment effect, you don't need to include controls. Just regress Y on the treatment dummy. Even with n = 98, if there's balance in treated and control, the robust standard error should work pretty well, and that produces a confidence interval.

There are two reasons to include covariates. One is to improve efficiency -- as shown in Negi and Wooldridge (2021, Econometric Reviews). But this is only asymptotic efficiency, and may cause distortions with small n. If this is the purpose, you can use linear regression even thought Y is binary.

The other is to obtain moderating effects. But with an RCT, you don't need extra control variables. Just include those you care about: X1 and X2. With n =98 you might be asking too much of your data. As an approximation -- often a good one -- you can use linear regression with X1 and X2. If these are binary then the specification is fully saturated and a linear model isn't restrictive. So if it's an RCT, I'd use linear regression of Y on Treat and then include X1, X2, and interactions with Treat in a linear regression.
1 like
Comment
Kerstin Schmidt

Join Date: Apr 2017

Posts: 125
#7

03 Aug 2025, 12:26

Thank you all very much for your comments!
Joseph Coveney: How do I check for the profile likelihood of the coefficient(s)?
Jeff Wooldridge: Interesting, never thought I could use OLS with experimental data with a binary outcome variable. You write "I'd use linear regression of Y on Treat". Why do you think it is better to use OLS here and not a binary logit?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#8

03 Aug 2025, 23:16

Originally posted by Kerstin Schmidt View Post

How do I check for the profile likelihood of the coefficient(s)?

You use the constraint command to fit the model at various values of the coefficient, collect the corresponding log likelihood values and plot them.

I illustrate the general procedure in the attached do-files (and their SMCL log files) for two of the example datasets that are available on SSC in ancillary files for the firthlogit user-written command, namely, the dataset in asseryanis.do, which illustrates the procedure for a categorical predictor, and the dataset in collett.do, which illustrates the procedure for a continuous predictor. Illustrations of the procedure using those datasets are in the attached Asseryanis Dataset.do and Collett Dataset.do and their log files.

For comparison and as a reference for judgment, I illustrate how symmetric the plot can be expected for a well behaved predictor in an artificial dataset. That is in Synthetic Dataset.do and its log file.

Along with the do-files and their log files, I've attached the corresponding plots.

Note: the forum's software balks at my attempt to attach more than five files, and so I've had to attach all of the do-files and one of the log files to a follow-on post.
Attached Files

Collett Dataset.smcl (8.3 KB, 1 view)

Asseryanis Dataset.smcl (3.4 KB, 1 view)
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#9

03 Aug 2025, 23:18

Originally posted by Joseph Coveney View Post

the forum's software balks at my attempt to attach more than five files, and so I've had to attach all of the do-files and one of the log files to a follow-on post.

They're attached here.
Attached Files

Asseryanis Dataset.do (2.6 KB, 1 view)

Collett Dataset.do (2.4 KB, 1 view)

Synthetic Dataset.smcl (2.8 KB, 1 view)

Synthetic Dataset.do (1.3 KB, 1 view)
Comment
Kerstin Schmidt

Join Date: Apr 2017

Posts: 125
#10

04 Aug 2025, 08:51

Joseph Coveney: Thank you sooo much! For my predictor variable the profile likelihood is not skewed, which should make my test statistics and confidence intervals reported by margins reliable:
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2204
#11

04 Aug 2025, 12:29

Originally posted by Kerstin Schmidt View Post

Thank you all very much for your comments!
Joseph Coveney: How do I check for the profile likelihood of the coefficient(s)?
Jeff Wooldridge: Interesting, never thought I could use OLS with experimental data with a binary outcome variable. You write "I'd use linear regression of Y on Treat". Why do you think it is better to use OLS here and not a binary logit?

Maybe this is not the case for your field, but in economics we focus on the treatment effect on the probability. So, what is the change in the employment probability due to a job training program. If the intervention is randomized, you can always use the difference in employment rates between the treated and untreated. That is the coefficient on Treat in the linear regression Y on constant, Treat. If, instead, you use a logit, you will get EXACTLY the same estimate on the probability. That's because the regression is trivially saturated, and all we're doing is estimating the "success" probabilities for different subgroups (treated and control in this case). If you want the effect on the log-odds ratio -- you'd still get the same estimates using linear versus logit.

When you add covariates, now things will differ. But for the average effect, both linear regression and logit are consistent. Where they generally differ is for the so-called moderating effects. Maybe logit is better here, but there's not guarantee. The linear model can in some cases provide a better approximation.

Even if you solve the logit convergence problems, I'd still use linear regression. There are no convergence problems there unless you have perfect collinearity.

In looking at your first post again, it occurs to me that not centering the variables X1 and X2 about their sample averages before including them might be contributing to the convergence problems. If one or both cannot be zero, failure to center often results in severe collinearity (and makes the coefficients hard to interpret; of course, that goes away with margins, but you may never get to margins unless you center).
3 likes
Comment
Kerstin Schmidt

Join Date: Apr 2017

Posts: 125
#12

05 Aug 2025, 02:35

Very interesting discussion and I thank you all for contributing to this!

In my case, the initial nonconvergence info is no longer a problem because "That nonconvergence message is an artifact of setting the iterations to zero. The fit has converged to the estimates given by firthlogit" (see #5) and my profile likelihood distribution is not skewed. As I am reporting all three models ((1) univariate regression, (2) with interactions, (3) with covariates), I should be transparent.
Comment

Announcement

firthlogit + average marginal effects, but convergence problem

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment