Logistic regression results with i.varname vs. varname as an independent variable (Stata)

Paolo Maldini

Join Date: Feb 2022

Posts: 49
#1

Logistic regression results with i.varname vs. varname as an independent variable (Stata)

15 Feb 2023, 11:24

I have seen different ways of adding an independent dummy variable in regression models. The first approach is to just list the varname or adding i. before the varname. The two methods produce different results when I run the same model.

My goal is to estimate the probability of an individual dropping out from a dataset of sub-reddit posts, controlling for the user's average sentiment in the posts, timing of the reddit posts that they wrote, and the average responses the individual received per reddit post. The outcome variable dropout equals 1 if a user drops (i.e. stop using the subreddit once the government launches a specific economic policy and 0 otherwise. I want to test whether individuals were more likely to dropout in the month right before the policy's implementation, but I am not sure how to structure my time variable in the logistic regression model.

First, here is a data example:
```
* Example generated by -dataex-. For more info, type help dataex clear input byte drop_out float month_year double avg_sentiment float avg_response 1 625 -1 1 1 623 -1 0 1 631 0 0 . 632 0 . . 632 0 . 1 625 1 6 0 624 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 633 -.2105263157894737 . 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 624 -.2307692307692307 .6923077 end format %tm month_year ``` Here is the first model is without using i.varname:
```
xtlogit drop_out month_year avg_sentiment avg_respons
```
Is the result below telling us that higher month_year values are correlated on average, with a lower probability of dropout at -.335?

I then ran the same model using i.varname for the month_year dummy variable:
```
xtlogit drop_out i.month_year avg_sentiment avg_response
```
I have seen different ways of adding an independent dummy variable in regression models. The first approach is to just list the varname or adding i. before the varname. The two methods produce different results when I run the same model.

First, here is a data example:
* Example generated by -dataex-. For more info, type help dataex clear input byte drop_out float month_year double avg_sentiment float avg_response 1 625 -1 1 1 623 -1 0 1 631 0 0 . 632 0 . . 632 0 . 1 625 1 6 0 624 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 629 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 630 -.2105263157894737 .2105263 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 632 -.2105263157894737 . 0 633 -.2105263157894737 . 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 623 -.2307692307692307 .6923077 1 624 -.2307692307692307 .6923077 end format %tm month_year
Here are the two models, where I am estimating the probability of an individual dropping out from the dataset controlling for their online average sentiment, time of the reddit posts they wrote, and the average responses they received per reddit post.

The first model is without using i.varname:
xtlogit drop_out month_year avg_sentiment avg_response
Is the result below telling us that higher month_year values are correlated on average, with a lower probability of dropout at -.335?

I then ran the same model using i.varname for the month_year dummy variable:
xtlogit drop_out i.month_year avg_sentiment avg_response
Here is the result, but I am a bit confused as to why the coefficients for month_year differ than the prior model by 1)having positive results and 2)much larger in magnitude.
Tags: categorical, logit, panel, panel data, regression
Nick Cox

Join Date: Mar 2014

Posts: 35641
#2

15 Feb 2023, 11:46

I've already answered this question somewhere else but I think you deleted it. Please don't do that.

Otherwise this is hard to follow as #1 is a bit messy and the xtlogit context is a complication, but your question seems to boil down to a single comparison.

month_year is a monthly date variable. When you supply it as such as a predictor you're fitting a linear trend term in time.

When you supply it as a factor variable, you are trying to estimate a model in which each distinct monthly date is associated with a different level, other things being equal.

They are quite different uses and the model output is correspondingly quite different. You see one coefficient in the first case and several in the second.

Neither use is well described as adding an independent dummy variable as the first use doesn't invoke any dummy variables at all, and the second implies fitting a bunch of dummy variables as a bunch of predictors.

(I prefer the term indicator variable, but that preference doesn't bear on the central question here)
3 likes
Comment
Paolo Maldini

Join Date: Feb 2022

Posts: 49
#3

15 Feb 2023, 12:30

Originally posted by Nick Cox View Post

I've already answered this question somewhere else but I think you deleted it. Please don't do that.
True, and I understood that you prefer the question to be posted here, rather than somewhere else.

Otherwise this is hard to follow as #1 is a bit messy and the xtlogit context is a complication, but your question seems to boil down to a single comparison.

month_year is a monthly date variable. When you supply it as such as a predictor you're fitting a linear trend term in time.

When you supply it as a factor variable, you are trying to estimate a model in which each distinct monthly date is associated with a different level, other things being equal.

They are quite different uses and the model output is correspondingly quite different. You see one coefficient in the first case and several in the second.

Neither use is well described as adding an independent dummy variable as the first use doesn't invoke any dummy variables at all, and the second implies fitting a bunch of dummy variables as a bunch of predictors.

(I prefer the term indicator variable, but that preference doesn't bear on the central question here)

Makes sense, so how do I construct a time variable in the model that allows for testing whether individuals were more likely to dropout in the month right before the policy's implementation?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17701
#4

15 Feb 2023, 15:35

Paolo:
the only aside that I can add to Nick's comprehensive reply, is that -i.timevar- is usually included in (conditional, in -xtlogit-) -fe- specification.
I cannot say from your screenshots (not recommended by the FAQ) whether this is your case or not.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3449
#5

16 Feb 2023, 01:28

Originally posted by Paolo Maldini View Post

so how do I construct a time variable in the model that allows for testing whether individuals were more likely to dropout in the month right before the policy's implementation?

You make an indicator (dummy) variable that indicates whether next month the policy changes. You add some appropriately smooth function of time (could be linear, could be a fractional polynomial (see help fp), could be ...) plus that indicator variable, and the coefficient of that indicator variable tells you how much more or less the odds is just prior to the policy change relative to the smooth changing time.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
3 likes
Comment

Paolo Maldini

Join Date: Feb 2022
Posts: 49

16 Feb 2023, 07:08

Originally posted by Carlo Lazzaro View Post

Paolo:
the only aside that I can add to Nick's comprehensive reply, is that -i.timevar- is usually included in (conditional, in -xtlogit-) -fe- specification.
I cannot say from your screenshots (not recommended by the FAQ) whether this is your case or not.

Thanks for the support Lazzaro.

Hope this is more helpful:
```
xtlogit drop_out i.month_year avg_sentiment avg_respons
```

Random-effects logistic regression	Number of obs = 1,577
Group variable: id	Number of groups = 394
Random effects u_i ~ Gaussian	Obs per group:
	min = 1
	avg = 4.0
	max = 191
Integration method: mvaghermite	Integration pts. = 12
	Wald chi2(12) = 114.56
Log likelihood = -110.62722	Prob > chi2 = 0.0000

drop_out Coefficient Std. err.	z	P>z [95% conf. interval]

month_year
617 0 (empty)
618 0 (empty)
621 22.31525 7.492238	2.98	0.003 7.630729 36.99976
622 20.93282 8.70476	2.40	0.016 3.871801 37.99384
623 20.5304 3.611588	5.68	0.000 13.45182 27.60898
624 20.65095 2.82076	7.32	0.000 15.12236 26.17953
625 20.65573 2.571961	8.03	0.000 15.61478 25.69668
626 20.86 2.837259	7.35	0.000 15.29908 26.42093
627 20.79581 2.841784	7.32	0.000 15.22602 26.3656
628 20.54298 3.051287	6.73	0.000 14.56257 26.52339
629 19.51106 2.570344	7.59	0.000 14.47328 24.54884
630 2.95785 3.518059	0.84	0.400 -3.93742 9.853119
631 0 (omitted)
avg_sentiment -.1546643 1.709997	-0.09	0.928 -3.506196 3.196868
avg_response .0767591 1.083462	0.07	0.944 -2.046787 2.200305
_cons 19.08004 2.013371	9.48	0.000 15.1339 23.02617

/lnsig2u 7.065322 .0698232	6.928471 7.202173

sigma_u 34.21489 1.194496	31.95202 36.63801
rho .9971976 .0001951	.9967879 .9975552

LR test of rho=0: chibar2(01) = 1715.00	Prob >= chibar2 = 0.000

And with i.varname:
```
xtlogit drop_out month_year avg_sentiment avg_respons
```

Random-effects logistic regression	Number of obs = 1,616
Group variable: id	Number of groups = 418
Random effects u_i ~ Gaussian	Obs per group:
	min = 1
	avg = 3.9
	max = 191
Integration method: mvaghermite	Integration pts. = 12
	Wald chi2(3) = 6.31
Log likelihood = -141.42472	Prob > chi2 = 0.0975

drop_out Coefficient Std. err.	z	P>z [95% conf. interval]

month_year -.3355589 .1390517	-2.41	0.016 -.6080952 -.0630225
avg_sentiment -.2001143 .5797818	-0.35	0.730 -1.336466 .9362372
avg_response .4171611 .5271705	0.79	0.429 -.6160742 1.450396
_cons 220.7612 87.12887	2.53	0.011 49.99171 391.5306

/lnsig2u 4.428914 .1518173	4.131358 4.726471

sigma_u 9.156438 .6950529	7.890654 10.62527
rho .9622419 .0055159	.9498131 .9716845

LR test of rho=0: chibar2(01) = 1741.56	Prob >= chibar2 = 0.000
.

Last edited by Paolo Maldini; 16 Feb 2023, 07:34.

Comment

Paolo Maldini

Join Date: Feb 2022

Posts: 49
#7

16 Feb 2023, 07:33

Originally posted by Maarten Buis View Post

You make an indicator (dummy) variable that indicates whether next month the policy changes. You add some appropriately smooth function of time (could be linear, could be a fractional polynomial (see help fp), could be ...) plus that indicator variable, and the coefficient of that indicator variable tells you how much more or less the odds is just prior to the policy change relative to the smooth changing time.

Thanks, this is a really helpful suggestion!
I just wanted to clarify this part, "indicates whether next month the policy changes," assuming that the last month before the policy changes equals 631, then we should code the indicator variable as follows:
where 1 equals the last month before the policy changes, and 0 for the months prior to that month.
```
// Create an indicator variable that shows whether next month the policy changes.
gen pre_impl=.
replace pre_impl=1 if month_year==631
replace pre_impl=0 if month_year<631
```
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3449
#8

16 Feb 2023, 10:12

That creates missing values for the time after 631, so that is not what you want. I would create it this way:

Code:

gen pre_impl = 0 if !missing(month_year) replace pre_impl = 1 if month_year == 631

Or probably:

Code:

gen byte pre_impl:yesno = ( month_year == ym(2012,8) ) if !missing(month_year) label define yesno 0 "no" 1 "yes" label var pre_imp "month before policy implementation" note pre_impl: based on month_year \ my_dofile.do \ MLB

Where my_dofile.do is the do-file that creates that variable and MLB are my initials, you would change it to the name of your do-file and your initials

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17701
#9

16 Feb 2023, 12:05

Paolo:
a minor contribution from my end: with such a relevant number of panels, you may want to consider the -bootstrap- option for your standard errors vs their default counterparts (no cluster robust option is available for -xtlogit-).
As an aside, please get yourself (more) familiar with CODE delimiters to share what you typed and what Stata gave you back. Thanks.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Paolo Maldini

Join Date: Feb 2022

Posts: 49
#10

17 Feb 2023, 09:07

Thanks Professors Buis and Lazzaro for the really useful feedback.
Comment
Paolo Maldini

Join Date: Feb 2022

Posts: 49
#11

17 Feb 2023, 11:03

Originally posted by Carlo Lazzaro View Post

Paolo:
a minor contribution from my end: with such a relevant number of panels, you may want to consider the -bootstrap- option for your standard errors vs their default counterparts (no cluster robust option is available for -xtlogit-).
As an aside, please get yourself (more) familiar with CODE delimiters to share what you typed and what Stata gave you back. Thanks.

Thanks again Professor Lazzaro for the bootstrap proposal, I used the guidance here, and after running the same model with the same controls but with the bootstrap option as below:

```
bootstrap, reps(100) seed(1): xtlogit drop_out i.month_year avg_sentiment avg_respons
```
The coefficients remained the same for both 1) avg_sentiment and 2) avg_response yet they became statistically significant at the 0.05 p-value level.
Does that necessarily make the model with bootstrap option necessarily better or more accurate?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17701

#12

17 Feb 2023, 11:27

Paolo:
I was probably unclear in my previous reply, as I meant -xtlogit,fe- bootstrapped standard errors:

Code:

. use https://www.stata-press.com/data/r17/union
(NLS Women 14-24 in 1968)

. xtlogit union age grade not_smsa i.south##c.year, fe vce(bootstrap, reps(200) dots(1))
(running xtlogit on estimation sample)

Bootstrap replications (200)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100
..................................................   150
.........x........................................   200

Conditional fixed-effects logistic regression        Number of obs    = 12,035
                                                     Replications     =    199
Group variable: idcode                               Number of groups =  1,690

                                                     Obs per group:
                                                                  min =      2
                                                                  avg =    7.1
                                                                  max =     12

                                                     Wald chi2(6)     =  45.19
Log likelihood = -4510.888                           Prob > chi2      = 0.0000

                              (Replications based on 1,690 clusters in idcode)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
       union | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0710973      .1004     0.71   0.479    -.1256831    .2678776
       grade |   .0816111   .0564763     1.45   0.148    -.0290805    .1923026
    not_smsa |   .0224809     .16181     0.14   0.890    -.2946609    .3396227
     1.south |  -2.856488   .9599905    -2.98   0.003    -4.738034    -.974941
        year |  -.0636853   .1016104    -0.63   0.531    -.2628381    .1354675
             |
south#c.year |
          1  |   .0264136   .0117947     2.24   0.025     .0032963    .0495308
------------------------------------------------------------------------------

.

As an aside, please call me Carlo, as all on (and many more off) this list do. Thanks.

Kind regards,
Carlo
(Stata 19.0)

Comment

Paolo Maldini

Join Date: Feb 2022

Posts: 49
#13

20 Feb 2023, 04:57

Thanks so much Carlo for the support.
When I ran my code as below, I received a "an error occurred when bootstrap executed xtlogit" message.
#delimit ; xtlogit drop_out avg_sentiment avg_respons, fe vce(bootstrap, reps(100) dots(1)) ``` However, when I ran the same code with the guideline shown here, it worked well, so I wonder if both codes are correct depending on the Stata version? I am using Stata 17. #delimit ; bootstrap, reps(100) seed(1): xtlogit drop_out avg_sentiment avg_respons ```
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17701
#14

20 Feb 2023, 05:11

Paolo:
see -help delimit-.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Paolo Maldini

Join Date: Feb 2022

Posts: 49
#15

21 Feb 2023, 07:53

Apologies, I have figured how to use it correctly now. When I ran my code as below, I received a "an error occurred when bootstrap executed xtlogit" message.
Here is the code that I ran:

Code:

xtlogit drop_out avg_sentiment avg_respons, fe vce(bootstrap, reps(100) dots(1))\

However, when I ran the same code below with the guideline shown here, it worked well, so I wonder if both codes are correct depending on the Stata version? I am using Stata 17.

Code:

bootstrap, reps(100) seed(1): xtlogit drop_out avg_sentiment avg_respons
Comment

Announcement