How to get the correct polynomial terms in logit model

Man Yang

Join Date: Mar 2016
Posts: 183

How to get the correct polynomial terms in logit model

03 Jan 2018, 18:20

Hello folks, I am trying to fit a logit model and after getting the plot of the two target variables that I am interested in, I feel it should be a quadratic model. Below is the plot based on the raw data.

Click image for larger version

Name: Screen Shot 2018-01-03 at 5.16.55 PM.png
Views: 1
Size: 21.6 KB
ID: 1424323

However, when I add the quadratic term in the model, the coefficient associated with the quadratic term is not significant (but I feel it should be sig). Below is the model output. As you can see, read is sig but read2 (the quadratic term of read) is not sig. I am not sure what does this tell me. Should I keep adding cubic or 4th order quadratic to the model until all the polynomial terms are significant? Thanks.

Code:

. svy: logit IEP_THIRD KINDER_READ read2 
(running logit on estimation sample)

Survey: Logistic regression

Number of strata   =        42                  Number of obs      =      3386
Number of PSUs     =       125                  Population size    = 2380657.3
                                                Design df          =        83
                                                F(   2,     82)    =     95.99
                                                Prob > F           =    0.0000

------------------------------------------------------------------------------
             |             Linearized
   IEP_THIRD |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 KINDER_READ |  -2.153642   .1580315   -13.63   0.000    -2.467961   -1.839324
       read2 |   -.174322    .267403    -0.65   0.516    -.7061758    .3575318
       _cons |   -1.24688   .1244377   -10.02   0.000    -1.494382   -.9993783
------------------------------------------------------------------------------

Tags: None

Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#2

03 Jan 2018, 19:04

Originally posted by Man Yang View Post

As you can see, read is sig but read2 (the quadratic term of read) is not sig. I am not sure what does this tell me.

It tells you that everything is as expected. See this post for the reason why.
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#3

04 Jan 2018, 01:59

A logit model is for binary dependent variables, i.e. a variable that can take only two values (0 or 1). Your graph shows a fractional dependent variable, i.e. a variable that can take any value between (and sometimes including) 0 and 1. So I suspect what you want to estimate is a fractional logit, and not a logit model. To do so in Stata using the fracreg command.

I agree with Joseph that your graph looks pretty linear in the log odds to me, so I would not expect a square term to do much.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

04 Jan 2018, 06:09

Man, also note that if you add quadratic terms, it's best to use Stata's factor variable syntax so that margins will know that you want to include one variable and its squared term (or cubic, etc). For example:

Code:

svy: logit IEP_THIRD c.KINDER_READ##c.KINDER_READ

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Man Yang

Join Date: Mar 2016

Posts: 183
#5

04 Jan 2018, 12:24

Hello all, thanks for your replies. I don't know whether I should make a clarifications here about the graph in #1 but the graph was generated using a user-written command called -binscatter- and it tells me that for all children who scored (performance is the independent variable) in a bin range (i.e., -2 ~ -1.8) about 80% of them had a value 1 in the dependent variable. My question is does this shape between the dependent and independent variable dictate how I am going to set up the logit model later? If so, then the shape shows that the two variables sort of having a non-linear relationship, is that so?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

04 Jan 2018, 14:01

Your plot may indeed show that the variables - on the horizontal axis, a value of KINDER_READ typical of children in the bin; on the vertical axis, the proportion scoring for the children in the bin - have a non-linear relationship.

The model you are fitting with the logit command does not fit the probability of scoring to KINDER_READ, it does fit the logit of the probability of scoring to KINDER_READ. The results of that fit are transformed into estimated probabilities when that is what is requested.

For your plot to be representative of the model you are fitting, the vertical axis would have to be the logit of the proportion scoring in each bin. It does not appear that binscatter has that capability, given a cursory glance at its help file.

Last edited by William Lisowski; 04 Jan 2018, 14:04.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

04 Jan 2018, 14:24

Originally posted by Man Yang View Post

...does this shape between the dependent and independent variable dictate how I am going to set up the logit model later? If so, then the shape shows that the two variables sort of having a non-linear relationship, is that so?

Adding to what's been said already, both here and in your previous thread: when we estimate the probability of something happening, a logit model is a very reasonable choice. Just to be very clear: a logit model has a linear predictor, meaning that the XB (independent variables and their betas) side of the model is specified as linear. The effect in probability terms is not linear. Sticking with your model as specified but minus the quadratic term, a change in the child reading score from -2 to -1.8 has a different effect on the probability than a change of 1.8 to 2.

It sounds like nobody here is very sure what your binscatter plot is showing. Glancing very quickly at -binscatter-'s help (it's available on SSC), it says:

binscatter groups the x-axis variable into equal-sized bins, computes the mean of the x-axis and y-axis variables within each bin, then creates a scatterplot of these data points.

Say you went and typed

Code:

binscatter IEP_THIRD KINDER_READ

If I'm reading right (not guaranteed!), I think that yes, your y-axis represents the mean of IEP_THIRD within the corresponding bin of reading score. That is a probability. Remember in the other thread I said that linear probability models were (as far as I knew) out of fashion? Well, here, you are showing a sort of maybe vaguely curved relationship between X and untransformed probability. Linear probability models plot your Xs against ... untransformed probability.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Man Yang

Join Date: Mar 2016

Posts: 183
#8

04 Jan 2018, 21:33

Thanks so much for all of your replies. That helps a lot. So can I still fit the logit model but ask for predicted probabilities in the postestimation and then compare it to the plot in #1? Also, just for clarification, if the quadratic term of the independent variable is not significant, it means the model is linear in the log odds scale, correct? Then, my question is how could I get a sense of the relationship between IV and DV before fitting logit model? What should I plot before fitting the logit model?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#9

05 Jan 2018, 03:02

Originally posted by Man Yang View Post

Also, just for clarification, if the quadratic term of the independent variable is not significant, it means the model is linear in the log odds scale, correct?

No, there are many ways in which a relationship can be non-linear. A quadratic relationship is just one of many possible ways. It may be that a linear model fits better than a quadratic, but that does not rule out that there is another form that fits better than linear.

Originally posted by Man Yang View Post

Then, my question is how could I get a sense of the relationship between IV and DV before fitting logit model? What should I plot before fitting the logit model?

You can logit transform the proportions in each bin, then a linear effect in a logit model corresponds to a linear effect in that graph. I don't see an option in binscatter to do that, so you need to do it yourself. Here is an example (which requires the mylabels program available form SSC):

Code:

. // open example data . sysuse nlsw88, clear (NLSW, 1988 extract) . . //mark observation that will be used . gen byte touse = !missing(wage, union) . . // break wage up into 20 equal sized groups . xtile x=wage if touse, n(20) . . // assign each group the "middle" wage . bysort x touse (wage) : replace x = (wage[1] + wage[_N])/2 if touse (1,878 real changes made) . . // proportion of union members in each wage group . bysort x touse : egen y = mean(union) if touse (368 missing values generated) . . // logit transform that proportion . replace y = logit(y) (1,878 real changes made) . . // create nice labels for this transformed proportion . mylabels 0.05 0.1 0.2 0.3 0.4 0.5, /// > myscale(logit(@)) local(ylabs) -2.94443897916644 ".05" -2.197224577336219 ".1" -1.386294361119891 ".2" -.8472978603872036 ".3" -.4054651081081643 ".4" 0 ".5" . . // plot . scatter y x , ylab(`ylabs') /// > ytitle("proportion union members" /// > "(logit scale)")

In this case I would consider a linear spline with a knot at 10 and worry about the outlier in wage. (Actually, I would use union membership to predict wage rather than the other way around, but this example is just there to show the mechanics not to make substantive sense.)

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Man Yang

Join Date: Mar 2016

Posts: 183
#10

05 Jan 2018, 12:10

Hi Maarten, thanks for your reply. I followed your syntax but created a odd looking graph and I don't know how to explain it. Please disregard the second plot cuz it's inaccurate.

UPDATE: sorry I mistyped one thing in the code...Below should be the correct plot.

So, based on this plot, I would say the relationship is sort of linear if ignoring the outlier around 1. What do you think?

Last edited by Man Yang; 05 Jan 2018, 12:23.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#11

05 Jan 2018, 13:52

Man, if all you want is a plot showing the unconstrained functional form of the relationship between KINDER_READ and the log-odds of the outcome, why not just estimate an initial exploratory model that treats KINDER_READ as a factor variable? I.e., something like this:

Code:

// Round KINDER_READ to nearest unit generate read = round(KINDER_READ) // logit model with KINDER_READ as factor variable quietly svy: logit IEP_THIRD i.read margins read, predict(xb) marginsplot

Michael Mitchell takes this approach in some of the examples in his book Interpreting and Visualizing Regression Models Using Stata.

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#12

05 Jan 2018, 14:22

I like the concept in #11, but in this case I think it will not work. Looking at the graph in #10, it appears that variable of interest takes on negative values. Stata will not accept that with factor-variables. (I'm sure the developers at StataCorp had a good reason not to allow factor variables for variables with negative values, but it eludes me, and I wish they had decided otherwise.)
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#13

05 Jan 2018, 15:21

Well spotted, Clyde (in #12). One could get around that problem easily enough by adding a constant value to make all values non-negative. This would mean changing one line in my code in #11:

Code:

generate read = round(KINDER_READ) + k // k = value needed to make all values of read non-negative

If the point is to visualize the functional form (in an exploratory model), the absolute numbers on the X-axis don't really matter.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment

Announcement

How to get the correct polynomial terms in logit model

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment