Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Logistic regression results with i.varname vs. varname as an independent variable (Stata)

    I have seen different ways of adding an independent dummy variable in regression models. The first approach is to just list the varname or adding i. before the varname. The two methods produce different results when I run the same model.

    My goal is to estimate the probability of an individual dropping out from a dataset of sub-reddit posts, controlling for the user's average sentiment in the posts, timing of the reddit posts that they wrote, and the average responses the individual received per reddit post. The outcome variable dropout equals 1 if a user drops (i.e. stop using the subreddit once the government launches a specific economic policy and 0 otherwise. I want to test whether individuals were more likely to dropout in the month right before the policy's implementation, but I am not sure how to structure my time variable in the logistic regression model.

    First, here is a data example:
    ```
    * Example generated by -dataex-. For more info, type help dataex clear input byte drop_out float month_year double avg_sentiment float avg_response 1 625 -1 1 1 623 -1 0 1 631 0 0 . 632 0 . . 632 0 . 1 625 1 6 0 624 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 633 -.2105263157894737 . 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 624 -.2307692307692307 .6923077 end format %tm month_year ``` Here is the first model is without using i.varname:
    ```
    xtlogit drop_out month_year avg_sentiment avg_respons
    ```
    Is the result below telling us that higher month_year values are correlated on average, with a lower probability of dropout at -.335?




    I then ran the same model using i.varname for the month_year dummy variable:
    ```
    xtlogit drop_out i.month_year avg_sentiment avg_response
    ```
    I have seen different ways of adding an independent dummy variable in regression models. The first approach is to just list the varname or adding i. before the varname. The two methods produce different results when I run the same model.

    First, here is a data example:
    * Example generated by -dataex-. For more info, type help dataex clear input byte drop_out float month_year double avg_sentiment float avg_response 1 625 -1 1 1 623 -1 0 1 631 0 0 . 632 0 . . 632 0 . 1 625 1 6 0 624 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 633 -.2105263157894737 . 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 624 -.2307692307692307 .6923077 end format %tm month_year
    Here are the two models, where I am estimating the probability of an individual dropping out from the dataset controlling for their online average sentiment, time of the reddit posts they wrote, and the average responses they received per reddit post.

    The first model is without using i.varname:
    xtlogit drop_out month_year avg_sentiment avg_response
    Is the result below telling us that higher month_year values are correlated on average, with a lower probability of dropout at -.335?

    I then ran the same model using i.varname for the month_year dummy variable:
    xtlogit drop_out i.month_year avg_sentiment avg_response
    Here is the result, but I am a bit confused as to why the coefficients for month_year differ than the prior model by 1)having positive results and 2)much larger in magnitude.

  • #2
    I've already answered this question somewhere else but I think you deleted it. Please don't do that.

    Otherwise this is hard to follow as #1 is a bit messy and the xtlogit context is a complication, but your question seems to boil down to a single comparison.

    month_year is a monthly date variable. When you supply it as such as a predictor you're fitting a linear trend term in time.

    When you supply it as a factor variable, you are trying to estimate a model in which each distinct monthly date is associated with a different level, other things being equal.

    They are quite different uses and the model output is correspondingly quite different. You see one coefficient in the first case and several in the second.

    Neither use is well described as adding an independent dummy variable as the first use doesn't invoke any dummy variables at all, and the second implies fitting a bunch of dummy variables as a bunch of predictors.

    (I prefer the term indicator variable, but that preference doesn't bear on the central question here)

    Comment


    • #3
      Originally posted by Nick Cox View Post
      I've already answered this question somewhere else but I think you deleted it. Please don't do that.
      True, and I understood that you prefer the question to be posted here, rather than somewhere else.

      Otherwise this is hard to follow as #1 is a bit messy and the xtlogit context is a complication, but your question seems to boil down to a single comparison.

      month_year is a monthly date variable. When you supply it as such as a predictor you're fitting a linear trend term in time.

      When you supply it as a factor variable, you are trying to estimate a model in which each distinct monthly date is associated with a different level, other things being equal.

      They are quite different uses and the model output is correspondingly quite different. You see one coefficient in the first case and several in the second.

      Neither use is well described as adding an independent dummy variable as the first use doesn't invoke any dummy variables at all, and the second implies fitting a bunch of dummy variables as a bunch of predictors.

      (I prefer the term indicator variable, but that preference doesn't bear on the central question here)
      Makes sense, so how do I construct a time variable in the model that allows for testing whether individuals were more likely to dropout in the month right before the policy's implementation?

      Comment


      • #4
        Paolo:
        the only aside that I can add to Nick's comprehensive reply, is that -i.timevar- is usually included in (conditional, in -xtlogit-) -fe- specification.
        I cannot say from your screenshots (not recommended by the FAQ) whether this is your case or not.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Originally posted by Paolo Maldini View Post
          so how do I construct a time variable in the model that allows for testing whether individuals were more likely to dropout in the month right before the policy's implementation?
          You make an indicator (dummy) variable that indicates whether next month the policy changes. You add some appropriately smooth function of time (could be linear, could be a fractional polynomial (see help fp), could be ...) plus that indicator variable, and the coefficient of that indicator variable tells you how much more or less the odds is just prior to the policy change relative to the smooth changing time.
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Originally posted by Carlo Lazzaro View Post
            Paolo:
            the only aside that I can add to Nick's comprehensive reply, is that -i.timevar- is usually included in (conditional, in -xtlogit-) -fe- specification.
            I cannot say from your screenshots (not recommended by the FAQ) whether this is your case or not.
            Thanks for the support Lazzaro.

            Hope this is more helpful:
            ```
            xtlogit drop_out i.month_year avg_sentiment avg_respons
            ```
            Random-effects logistic regression Number of obs = 1,577
            Group variable: id Number of groups = 394
            Random effects u_i ~ Gaussian Obs per group:
            min = 1
            avg = 4.0
            max = 191
            Integration method: mvaghermite Integration pts. = 12
            Wald chi2(12) = 114.56
            Log likelihood = -110.62722 Prob > chi2 = 0.0000
            drop_out Coefficient Std. err. z P>z [95% conf. interval]
            month_year
            617 0 (empty)
            618 0 (empty)
            621 22.31525 7.492238 2.98 0.003 7.630729 36.99976
            622 20.93282 8.70476 2.40 0.016 3.871801 37.99384
            623 20.5304 3.611588 5.68 0.000 13.45182 27.60898
            624 20.65095 2.82076 7.32 0.000 15.12236 26.17953
            625 20.65573 2.571961 8.03 0.000 15.61478 25.69668
            626 20.86 2.837259 7.35 0.000 15.29908 26.42093
            627 20.79581 2.841784 7.32 0.000 15.22602 26.3656
            628 20.54298 3.051287 6.73 0.000 14.56257 26.52339
            629 19.51106 2.570344 7.59 0.000 14.47328 24.54884
            630 2.95785 3.518059 0.84 0.400 -3.93742 9.853119
            631 0 (omitted)
            avg_sentiment -.1546643 1.709997 -0.09 0.928 -3.506196 3.196868
            avg_response .0767591 1.083462 0.07 0.944 -2.046787 2.200305
            _cons 19.08004 2.013371 9.48 0.000 15.1339 23.02617
            /lnsig2u 7.065322 .0698232 6.928471 7.202173
            sigma_u 34.21489 1.194496 31.95202 36.63801
            rho .9971976 .0001951 .9967879 .9975552
            LR test of rho=0: chibar2(01) = 1715.00 Prob >= chibar2 = 0.000
            And with i.varname:
            ```
            xtlogit drop_out month_year avg_sentiment avg_respons
            ```
            Random-effects logistic regression Number of obs = 1,616
            Group variable: id Number of groups = 418
            Random effects u_i ~ Gaussian Obs per group:
            min = 1
            avg = 3.9
            max = 191
            Integration method: mvaghermite Integration pts. = 12
            Wald chi2(3) = 6.31
            Log likelihood = -141.42472 Prob > chi2 = 0.0975
            drop_out Coefficient Std. err. z P>z [95% conf. interval]
            month_year -.3355589 .1390517 -2.41 0.016 -.6080952 -.0630225
            avg_sentiment -.2001143 .5797818 -0.35 0.730 -1.336466 .9362372
            avg_response .4171611 .5271705 0.79 0.429 -.6160742 1.450396
            _cons 220.7612 87.12887 2.53 0.011 49.99171 391.5306
            /lnsig2u 4.428914 .1518173 4.131358 4.726471
            sigma_u 9.156438 .6950529 7.890654 10.62527
            rho .9622419 .0055159 .9498131 .9716845
            LR test of rho=0: chibar2(01) = 1741.56 Prob >= chibar2 = 0.000
            .
            Last edited by Paolo Maldini; 16 Feb 2023, 07:34.

            Comment


            • #7
              Originally posted by Maarten Buis View Post

              You make an indicator (dummy) variable that indicates whether next month the policy changes. You add some appropriately smooth function of time (could be linear, could be a fractional polynomial (see help fp), could be ...) plus that indicator variable, and the coefficient of that indicator variable tells you how much more or less the odds is just prior to the policy change relative to the smooth changing time.
              Thanks, this is a really helpful suggestion!
              I just wanted to clarify this part, "indicates whether next month the policy changes," assuming that the last month before the policy changes equals 631, then we should code the indicator variable as follows:
              where 1 equals the last month before the policy changes, and 0 for the months prior to that month.
              ```
              // Create an indicator variable that shows whether next month the policy changes.
              gen pre_impl=.
              replace pre_impl=1 if month_year==631
              replace pre_impl=0 if month_year<631
              ```

              Comment


              • #8
                That creates missing values for the time after 631, so that is not what you want. I would create it this way:
                Code:
                gen pre_impl = 0 if !missing(month_year)
                replace pre_impl = 1 if month_year == 631
                Or probably:
                Code:
                gen byte pre_impl:yesno = ( month_year == ym(2012,8) ) if !missing(month_year)
                label define yesno 0 "no" 1 "yes"
                label var pre_imp "month before policy implementation"
                note pre_impl: based on month_year \ my_dofile.do \ MLB
                Where my_dofile.do is the do-file that creates that variable and MLB are my initials, you would change it to the name of your do-file and your initials
                ---------------------------------
                Maarten L. Buis
                University of Konstanz
                Department of history and sociology
                box 40
                78457 Konstanz
                Germany
                http://www.maartenbuis.nl
                ---------------------------------

                Comment


                • #9
                  Paolo:
                  a minor contribution from my end: with such a relevant number of panels, you may want to consider the -bootstrap- option for your standard errors vs their default counterparts (no cluster robust option is available for -xtlogit-).
                  As an aside, please get yourself (more) familiar with CODE delimiters to share what you typed and what Stata gave you back. Thanks.
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Thanks Professors Buis and Lazzaro for the really useful feedback.

                    Comment


                    • #11
                      Originally posted by Carlo Lazzaro View Post
                      Paolo:
                      a minor contribution from my end: with such a relevant number of panels, you may want to consider the -bootstrap- option for your standard errors vs their default counterparts (no cluster robust option is available for -xtlogit-).
                      As an aside, please get yourself (more) familiar with CODE delimiters to share what you typed and what Stata gave you back. Thanks.
                      Thanks again Professor Lazzaro for the bootstrap proposal, I used the guidance here, and after running the same model with the same controls but with the bootstrap option as below:

                      ```
                      bootstrap, reps(100) seed(1): xtlogit drop_out i.month_year avg_sentiment avg_respons
                      ```
                      The coefficients remained the same for both 1) avg_sentiment and 2) avg_response yet they became statistically significant at the 0.05 p-value level.
                      Does that necessarily make the model with bootstrap option necessarily better or more accurate?

                      Comment


                      • #12
                        Paolo:
                        I was probably unclear in my previous reply, as I meant -xtlogit,fe- bootstrapped standard errors:
                        Code:
                        . use https://www.stata-press.com/data/r17/union
                        (NLS Women 14-24 in 1968)
                        
                        . xtlogit union age grade not_smsa i.south##c.year, fe vce(bootstrap, reps(200) dots(1))
                        (running xtlogit on estimation sample)
                        
                        Bootstrap replications (200)
                        ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
                        ..................................................    50
                        ..................................................   100
                        ..................................................   150
                        .........x........................................   200
                        
                        Conditional fixed-effects logistic regression        Number of obs    = 12,035
                                                                             Replications     =    199
                        Group variable: idcode                               Number of groups =  1,690
                        
                                                                             Obs per group:
                                                                                          min =      2
                                                                                          avg =    7.1
                                                                                          max =     12
                        
                                                                             Wald chi2(6)     =  45.19
                        Log likelihood = -4510.888                           Prob > chi2      = 0.0000
                        
                                                      (Replications based on 1,690 clusters in idcode)
                        ------------------------------------------------------------------------------
                                     |   Observed   Bootstrap                         Normal-based
                               union | coefficient  std. err.      z    P>|z|     [95% conf. interval]
                        -------------+----------------------------------------------------------------
                                 age |   .0710973      .1004     0.71   0.479    -.1256831    .2678776
                               grade |   .0816111   .0564763     1.45   0.148    -.0290805    .1923026
                            not_smsa |   .0224809     .16181     0.14   0.890    -.2946609    .3396227
                             1.south |  -2.856488   .9599905    -2.98   0.003    -4.738034    -.974941
                                year |  -.0636853   .1016104    -0.63   0.531    -.2628381    .1354675
                                     |
                        south#c.year |
                                  1  |   .0264136   .0117947     2.24   0.025     .0032963    .0495308
                        ------------------------------------------------------------------------------
                        
                        .
                        As an aside, please call me Carlo, as all on (and many more off) this list do. Thanks.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          Thanks so much Carlo for the support.
                          When I ran my code as below, I received a "an error occurred when bootstrap executed xtlogit" message.
                          #delimit ; xtlogit drop_out avg_sentiment avg_respons, fe vce(bootstrap, reps(100) dots(1)) ``` However, when I ran the same code with the guideline shown here, it worked well, so I wonder if both codes are correct depending on the Stata version? I am using Stata 17. #delimit ; bootstrap, reps(100) seed(1): xtlogit drop_out avg_sentiment avg_respons ```

                          Comment


                          • #14
                            Paolo:
                            see -help delimit-.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Apologies, I have figured how to use it correctly now. When I ran my code as below, I received a "an error occurred when bootstrap executed xtlogit" message.
                              Here is the code that I ran:
                              Code:
                              xtlogit drop_out avg_sentiment avg_respons, fe vce(bootstrap, reps(100) dots(1))\
                              However, when I ran the same code below with the guideline shown here, it worked well, so I wonder if both codes are correct depending on the Stata version? I am using Stata 17.
                              Code:
                              bootstrap, reps(100) seed(1): xtlogit drop_out avg_sentiment avg_respons

                              Comment

                              Working...
                              X