Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to get the correct polynomial terms in logit model

    Hello folks, I am trying to fit a logit model and after getting the plot of the two target variables that I am interested in, I feel it should be a quadratic model. Below is the plot based on the raw data.

    Click image for larger version

Name:	Screen Shot 2018-01-03 at 5.16.55 PM.png
Views:	1
Size:	21.6 KB
ID:	1424323


    However, when I add the quadratic term in the model, the coefficient associated with the quadratic term is not significant (but I feel it should be sig). Below is the model output. As you can see, read is sig but read2 (the quadratic term of read) is not sig. I am not sure what does this tell me. Should I keep adding cubic or 4th order quadratic to the model until all the polynomial terms are significant? Thanks.
    Code:
    . svy: logit IEP_THIRD KINDER_READ read2 
    (running logit on estimation sample)
    
    Survey: Logistic regression
    
    Number of strata   =        42                  Number of obs      =      3386
    Number of PSUs     =       125                  Population size    = 2380657.3
                                                    Design df          =        83
                                                    F(   2,     82)    =     95.99
                                                    Prob > F           =    0.0000
    
    ------------------------------------------------------------------------------
                 |             Linearized
       IEP_THIRD |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
     KINDER_READ |  -2.153642   .1580315   -13.63   0.000    -2.467961   -1.839324
           read2 |   -.174322    .267403    -0.65   0.516    -.7061758    .3575318
           _cons |   -1.24688   .1244377   -10.02   0.000    -1.494382   -.9993783
    ------------------------------------------------------------------------------

  • #2
    Originally posted by Man Yang View Post
    As you can see, read is sig but read2 (the quadratic term of read) is not sig. I am not sure what does this tell me.
    It tells you that everything is as expected. See this post for the reason why.

    Comment


    • #3
      A logit model is for binary dependent variables, i.e. a variable that can take only two values (0 or 1). Your graph shows a fractional dependent variable, i.e. a variable that can take any value between (and sometimes including) 0 and 1. So I suspect what you want to estimate is a fractional logit, and not a logit model. To do so in Stata using the fracreg command.

      I agree with Joseph that your graph looks pretty linear in the log odds to me, so I would not expect a square term to do much.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        Man, also note that if you add quadratic terms, it's best to use Stata's factor variable syntax so that margins will know that you want to include one variable and its squared term (or cubic, etc). For example:

        Code:
         
         svy: logit IEP_THIRD c.KINDER_READ##c.KINDER_READ
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Hello all, thanks for your replies. I don't know whether I should make a clarifications here about the graph in #1 but the graph was generated using a user-written command called -binscatter- and it tells me that for all children who scored (performance is the independent variable) in a bin range (i.e., -2 ~ -1.8) about 80% of them had a value 1 in the dependent variable. My question is does this shape between the dependent and independent variable dictate how I am going to set up the logit model later? If so, then the shape shows that the two variables sort of having a non-linear relationship, is that so?

          Comment


          • #6
            Your plot may indeed show that the variables - on the horizontal axis, a value of KINDER_READ typical of children in the bin; on the vertical axis, the proportion scoring for the children in the bin - have a non-linear relationship.

            The model you are fitting with the logit command does not fit the probability of scoring to KINDER_READ, it does fit the logit of the probability of scoring to KINDER_READ. The results of that fit are transformed into estimated probabilities when that is what is requested.

            For your plot to be representative of the model you are fitting, the vertical axis would have to be the logit of the proportion scoring in each bin. It does not appear that binscatter has that capability, given a cursory glance at its help file.
            Last edited by William Lisowski; 04 Jan 2018, 14:04.

            Comment


            • #7
              Originally posted by Man Yang View Post
              ...does this shape between the dependent and independent variable dictate how I am going to set up the logit model later? If so, then the shape shows that the two variables sort of having a non-linear relationship, is that so?
              Adding to what's been said already, both here and in your previous thread: when we estimate the probability of something happening, a logit model is a very reasonable choice. Just to be very clear: a logit model has a linear predictor, meaning that the XB (independent variables and their betas) side of the model is specified as linear. The effect in probability terms is not linear. Sticking with your model as specified but minus the quadratic term, a change in the child reading score from -2 to -1.8 has a different effect on the probability than a change of 1.8 to 2.

              It sounds like nobody here is very sure what your binscatter plot is showing. Glancing very quickly at -binscatter-'s help (it's available on SSC), it says:

              binscatter groups the x-axis variable into equal-sized bins, computes the mean of the x-axis and y-axis variables within each bin, then creates a scatterplot of these data points.
              Say you went and typed

              Code:
               
               binscatter IEP_THIRD KINDER_READ
              If I'm reading right (not guaranteed!), I think that yes, your y-axis represents the mean of IEP_THIRD within the corresponding bin of reading score. That is a probability. Remember in the other thread I said that linear probability models were (as far as I knew) out of fashion? Well, here, you are showing a sort of maybe vaguely curved relationship between X and untransformed probability. Linear probability models plot your Xs against ... untransformed probability.
              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

              Comment


              • #8
                Thanks so much for all of your replies. That helps a lot. So can I still fit the logit model but ask for predicted probabilities in the postestimation and then compare it to the plot in #1? Also, just for clarification, if the quadratic term of the independent variable is not significant, it means the model is linear in the log odds scale, correct? Then, my question is how could I get a sense of the relationship between IV and DV before fitting logit model? What should I plot before fitting the logit model?

                Comment


                • #9
                  Originally posted by Man Yang View Post
                  Also, just for clarification, if the quadratic term of the independent variable is not significant, it means the model is linear in the log odds scale, correct?
                  No, there are many ways in which a relationship can be non-linear. A quadratic relationship is just one of many possible ways. It may be that a linear model fits better than a quadratic, but that does not rule out that there is another form that fits better than linear.

                  Originally posted by Man Yang View Post
                  Then, my question is how could I get a sense of the relationship between IV and DV before fitting logit model? What should I plot before fitting the logit model?
                  You can logit transform the proportions in each bin, then a linear effect in a logit model corresponds to a linear effect in that graph. I don't see an option in binscatter to do that, so you need to do it yourself. Here is an example (which requires the mylabels program available form SSC):

                  Code:
                  . // open example data
                  . sysuse nlsw88, clear
                  (NLSW, 1988 extract)
                  
                  .
                  . //mark observation that will be used
                  . gen byte touse = !missing(wage, union)
                  
                  .
                  . // break wage up into 20 equal sized groups
                  . xtile x=wage if touse, n(20)
                  
                  .
                  . // assign each group the "middle" wage
                  . bysort x touse (wage) : replace x = (wage[1] + wage[_N])/2 if touse
                  (1,878 real changes made)
                  
                  .
                  . // proportion of union members in each wage group
                  . bysort x touse : egen y = mean(union) if touse
                  (368 missing values generated)
                  
                  .
                  . // logit transform that proportion
                  . replace y = logit(y)
                  (1,878 real changes made)
                  
                  .
                  . // create nice labels for this transformed proportion
                  . mylabels 0.05 0.1 0.2 0.3 0.4 0.5,     ///
                  >     myscale(logit(@)) local(ylabs)
                  -2.94443897916644 ".05" -2.197224577336219 ".1" -1.386294361119891 ".2" -.8472978603872036 ".3" -.4054651081081643 ".4" 0 ".5"
                  
                  .
                  . // plot
                  . scatter y x , ylab(`ylabs')            ///
                  >     ytitle("proportion union members"  ///
                  >                "(logit scale)")
                  Click image for larger version

Name:	Graph.png
Views:	1
Size:	18.6 KB
ID:	1424531

                  In this case I would consider a linear spline with a knot at 10 and worry about the outlier in wage. (Actually, I would use union membership to predict wage rather than the other way around, but this example is just there to show the mechanics not to make substantive sense.)
                  ---------------------------------
                  Maarten L. Buis
                  University of Konstanz
                  Department of history and sociology
                  box 40
                  78457 Konstanz
                  Germany
                  http://www.maartenbuis.nl
                  ---------------------------------

                  Comment


                  • #10
                    Hi Maarten, thanks for your reply. I followed your syntax but created a odd looking graph and I don't know how to explain it. Please disregard the second plot cuz it's inaccurate.

                    UPDATE: sorry I mistyped one thing in the code...Below should be the correct plot.

                    Click image for larger version

Name:	Graph.png
Views:	1
Size:	39.1 KB
ID:	1424621

                    So, based on this plot, I would say the relationship is sort of linear if ignoring the outlier around 1. What do you think?
                    Last edited by Man Yang; 05 Jan 2018, 12:23.

                    Comment


                    • #11
                      Man, if all you want is a plot showing the unconstrained functional form of the relationship between KINDER_READ and the log-odds of the outcome, why not just estimate an initial exploratory model that treats KINDER_READ as a factor variable? I.e., something like this:

                      Code:
                      // Round KINDER_READ to nearest unit
                      generate read = round(KINDER_READ)
                      // logit model with KINDER_READ as factor variable
                      quietly svy: logit IEP_THIRD i.read
                      margins read, predict(xb)
                      marginsplot
                      Michael Mitchell takes this approach in some of the examples in his book Interpreting and Visualizing Regression Models Using Stata.

                      HTH.
                      --
                      Bruce Weaver
                      Email: [email protected]
                      Version: Stata/MP 18.5 (Windows)

                      Comment


                      • #12
                        I like the concept in #11, but in this case I think it will not work. Looking at the graph in #10, it appears that variable of interest takes on negative values. Stata will not accept that with factor-variables. (I'm sure the developers at StataCorp had a good reason not to allow factor variables for variables with negative values, but it eludes me, and I wish they had decided otherwise.)

                        Comment


                        • #13
                          Well spotted, Clyde (in #12). One could get around that problem easily enough by adding a constant value to make all values non-negative. This would mean changing one line in my code in #11:

                          Code:
                          generate read = round(KINDER_READ) + k // k = value needed to make all values of read non-negative
                          If the point is to visualize the functional form (in an exploratory model), the absolute numbers on the X-axis don't really matter.
                          --
                          Bruce Weaver
                          Email: [email protected]
                          Version: Stata/MP 18.5 (Windows)

                          Comment

                          Working...
                          X