Would this be the right way to get the predicted probability from a logistic regression?

Signe Kristine

Join Date: Oct 2018

Posts: 6
#1

Would this be the right way to get the predicted probability from a logistic regression?

04 Nov 2018, 02:13

Hi everyone

We are investigating the relationship between women's education and contraceptive use in India. We are making splines for educational level and we have used the following code in stata:

Code:

mkspline edu1 5 edu2 8 edu3 12 edu4 = education logistic everused edu1-edu4 age age2 dontknow_caste middle_caste high_caste muslim christian other poorer middle richer richest, robust adjust age age2 dontknow_caste middle_caste high_caste muslim christian other middle poorer richer richest, gen(pr1) generate expr1=exp(pr1) generate prob1=1/(1+expr1) logistic currentmethod edu1-edu4 age age2 dontknow_caste middle_caste high_caste muslim christian other poorer middle richer richest, robust adjust age age2 dontknow_caste middle_caste high_caste muslim christian other middle poorer richer richest, gen(pr2) generate expr2=exp(pr2) generate prob2=1/(1+expr2)

We are looking at both current use end ever use of contraception methods and the graph we obtain is presented here. We are a bit surprised about the results, the use contraceptive use is higher than we expected.

So our questions is:

Would this be the right way to get the predicted probability from a logistic regression?
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

04 Nov 2018, 06:29

since the help file for -adjust- starts by saying,

adjust has been superseded by margins

, I wonder what version of Stata you are using (see the FAQ); you might also want to look at:

Code:

help logistic postestimation##predict
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#3

04 Nov 2018, 06:36

Signe:
you can also make your code more efficient using -fvvarlist- for categorical variables and interactions (as Rich implicitly reminds you about).
Please note that if you create -age-, -agesq- by hand and then go -margins-, Stata will not be able to interpret-agesq- as the squared term for -age- and consider them as two different predictors; this problem has an easy fix, which implies -fvvarlist-:

Code:

c.age##c.age

Kind regards,
Carlo
(Stata 19.0)
Comment
Signe Kristine

Join Date: Oct 2018

Posts: 6
#4

04 Nov 2018, 08:47

Okay, now we tried what you said and this is our new code:

Code:

logit everused education age c.age#c.age i.dontknow_caste i.middle_caste i.high_caste i.muslim i.christian i.other i.poorer i.middle i.richer i.richest, nolog

Code:

margins, at(education=(1 5 8 12 20)) atmeans

Code:

marginsplot, noci

But the result we get doesn't make any sense:

1. Is

Code:

at(education=(1 5 8 12 20))

equivalent to a spline specification?
2. It seems that probability of everuse decreases as years of education increase. This doesn't make any sense since the relationship should be positive.

Do you have any idea of what we are doing wrong/ how to get the results we are expecting?

Last edited by Signe Kristine; 04 Nov 2018, 08:51.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

04 Nov 2018, 11:55

Originally posted by Signe Kristine View Post

Okay, now we tried what you said and this is our new code:

Code:

logit everused education age c.age#c.age i.dontknow_caste i.middle_caste i.high_caste i.muslim i.christian i.other i.poorer i.middle i.richer i.richest, nolog

Code:

margins, at(education=(1 5 8 12 20)) atmeans

Code:

marginsplot, noci

But the result we get doesn't make any sense:

1. Is

Code:

at(education=(1 5 8 12 20))

equivalent to a spline specification?
2. It seems that probability of everuse decreases as years of education increase. This doesn't make any sense since the relationship should be positive.

No, the margins code is not a splined specification. You merely asked margins to present the average predicted probabilities, holding:

1) Education at ages 1, 5, 8, 12, and 20,

2) And all other covariates at their means.

Logistic models normally produce a curved line even without splines or a quadratic specification of the independent variables. In your case, you may not be seeing much curvature because the probability of using contraception isn't varying much over the entire range of education (look at your Y axis, compared to the other graph you showed).

If you think the effect of education is non-linear in the log odds, then you could include a quadratic term for education. This is probably not justified given the output from the linear probability model above. Nonetheless, example code would be:

Code:

logit everused c.education##c.education c.age##c.age i.caste i.religion i.income_group, nolog margins, at(education=(1 5 8 12 20)) atmeans marginsplot

Side note: You appear to still be manually generating dummies for caste, income, and religion. The syntax above relieves you from that burden. I am not sure if you accidentally omitted a base dummy group for income, and if you did, that would be erroneous. It's better to use the factor variable syntax, because it reduces amount of coding you have to do, and it reduces the chance of a coding error. I think it shouldn't change the output from the regression barring the error above. If you have income as originally coded, a lot of readers would accept it if you included income or log income as continuous.

Side note 2: splines don't work as well with margins and marginsplot. If you absolutely, absolutely must introduce splines into the logistic regression, you should let us know, but I don't think you should need to.

Last, if you have no coding errors, then your are results are what they are. Given the predicted probabilities on the Y axes, this looks like a different sample. Things could have changed. Or who knows, maybe ever used was accidentally coded in reverse format (such that 1 is never used).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

04 Nov 2018, 12:02

Kristine:
as an aside to Weiwen's helpful advice, it would be useful if you shared the legend an related values appearing above the -margins- outcome table when you invoked the -atmeans- option.

Kind regards,
Carlo
(Stata 19.0)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#7

04 Nov 2018, 13:27

Spline functions do complicate your life. Unfortunately, I don't know of an easy way to tell margins that spline variables are not independent of each other. One of my wish list items is for margins to be able to handle more complicated situations, like when you are dealing with functions of a variable (other than things like squaring and cubing).

I go over some simple spline plotting procedures on pp. 13-17 of

https://www3.nd.edu/~rwilliam/stats2/l61.pdf

Maybe they could be adapted for your purposes.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#8

04 Nov 2018, 16:07

Actually, in my last class, we had a demonstration of how to plot predicted probabilities (or whatever else) after a splined regression. Now, this takes away from the functionality of -margins-, but it does get you a plot. I recall that there have been other discussions of how to run margins properly after a splined regression, but I don't recall the propose solution.

For Signe's benefit, here's what happens if you take one of the stock datasets, fit a logistic model with dosage as a continuous variable, then fit it with splines. The dataset uses just two variables: dosage, and a continous outcome. The blue line shows the predicted probabilities after fitting a logistic model with dosage treated as continuous, no quadratic term. The red line is the predicted probabilities after we created linear splines at quintile breakpoints, per Stata's example syntax for -mkspline-.

If you needed 95% CIs with a graph after splines, you should note that you can calculate the standard error of the prediction, then create two new variables at probability +/- 1.96 * SE. Then plot all 3 lines.

Code:

webuse mksp2, clear sum dosage, det dosage ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 3 0 10% 8 1 Obs 100 25% 24.5 2 Sum of Wgt. 100 50% 48.5 Mean 48.3 Largest Std. Dev. 29.78729 75% 73.5 96 90% 91 99 Variance 887.2828 95% 95.5 99 Skewness .0825489 99% 99.5 100 Kurtosis 1.814948 logistic outcome dosage predict pr_logistic mkspline dose 5 = dosage, pctile logistic outcome dose1-dose5 predict pr_spline twoway (line pr_logistic dosage, sort) (line pr_spline dosage, sort)

So, as I mentioned, logistic regression usually produces a set of predicted probabilities that have a curve over a large range (see the blue line). Signe's graph after her own logistic model doesn't look curved. My sense is that there probably isn't a lot of variation in her predicted probabilities, so her graph of probabilities looks rather more straight. If you took the mid-section of the blue line, it would also look straight. Also note the predicted probabilities on my Y-axis - they range from nearly 0 to nearly 1, whereas Signe's graph is constrained to .45-.55 or so.

Back to Signe, this is what things could look like if you did a splined logistic regression. You have a set of piecewise logistic functions. They aren't very interpretable to me, but I don't typically model dose-response relationships in this much detail.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment

Announcement

Would this be the right way to get the predicted probability from a logistic regression?

Comment

Comment

Comment

Comment

Comment

Comment

Comment