Hello! I am seeking your feedback on my approach to estimate a difference-in-difference model (for an unbalanced panel).
My goal is to estimate the effect of a feature introduced at an online platform on some outcome variable. The feature became available in August of 2015. The monthly data available for analysis includes the following months: 01, 03, 05, 06, 08, 09, 10, 11, 12, and 01/2016:
Starting from August 2015, some platforms users began to use the new feature -- let me call them a treated group:
While some users used the new feature every single month starting from August, others used it only once or a few times:
Now, given the feature became available in 08/2015, I create a time variable:
And then estimate the following fixed-effects model:
Does everything seem to be appropriate so far?
Also, since different users started using the new feature at different times, does it make sense to create several time* variables and examine how the effect unfolds over time? E.g.:

I would sincerely appreciate your feedback.
My goal is to estimate the effect of a feature introduced at an online platform on some outcome variable. The feature became available in August of 2015. The monthly data available for analysis includes the following months: 01, 03, 05, 06, 08, 09, 10, 11, 12, and 01/2016:
Code:
xtset panel variable: id (unbalanced) time variable: month, 01/2015 to 01/2016, but with gaps delta: 1 month xtdescribe id: 105, 2515, ..., 10289394 n = 68574 month: 01/2015, 03/2015, ..., 01/2016 T = 10 Delta(month) = 1 month Span(month) = 13 periods (id*month uniquely identifies each observation) Distribution of T_i: min 5% 25% 50% 75% 95% max 1 1 2 4 7 10 10 Freq. Percent Cum. | Pattern ---------------------------+--------------- 8446 12.32 12.32 | 1.1.11.111111 4239 6.18 18.50 | ............1 4194 6.12 24.61 | ...........11 3875 5.65 30.27 | 1............ 3182 4.64 34.91 | .......111111 3087 4.50 39.41 | 1.1.......... 2541 3.71 43.11 | ..........111 2040 2.97 46.09 | ........11111 1866 2.72 48.81 | .........1111 35104 51.19 100.00 | (other patterns) ---------------------------+--------------- 68574 100.00 | X.X.XX.XXXXXX
Code:
sum treated Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- treated | 303,038 .0658399 .2480025 0 1 tab treated month | month treated | 01/2015 03/2015 05/2015 06/2015 08/2015 09/2015 10/2015 11/2015 12/2015 01/2016 | Total -----------+--------------------------------------------------------------------------------------------------------------+---------- 0 | 27,392 27,101 27,319 27,469 27,624 27,522 27,306 28,669 30,752 31,932 | 283,086 1 | 0 0 0 0 2,903 2,961 2,942 3,271 3,624 4,251 | 19,952 -----------+--------------------------------------------------------------------------------------------------------------+---------- Total | 27,392 27,101 27,319 27,469 30,527 30,483 30,248 31,940 34,376 36,183 | 303,038
Code:
tab feature_use_count month feature_us | month e_count | 08/2015 09/2015 10/2015 11/2015 12/2015 01/2016 | Total -----------+------------------------------------------------------------------+---------- 1 | 642 256 188 296 381 1,187 | 2,950 2 | 352 416 224 248 861 809 | 2,910 3 | 247 333 373 572 494 474 | 2,493 4 | 377 421 619 618 365 344 | 2,744 5 | 250 500 503 502 488 402 | 2,645 6 | 1,035 1,035 1,035 1,035 1,035 1,035 | 6,210 -----------+------------------------------------------------------------------+---------- Total | 2,903 2,961 2,942 3,271 3,624 4,251 | 19,952
Code:
gen time = (month > tm(2015m8)) & !missing(month) sum time Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- time | 303,038 .5386453 .4985051 0 1
Code:
xtreg outcome time##treated, fe vce(robust) Fixed-effects (within) regression Number of obs = 275,646 Group variable: id Number of groups = 64,699 R-sq: Obs per group: within = 0.0013 min = 1 between = 0.0017 avg = 4.3 overall = 0.0029 max = 9 F(3,64698) = 55.45 corr(u_i, Xb) = 0.0427 Prob > F = 0.0000 (Std. Err. adjusted for 64,699 clusters in id) ------------------------------------------------------------------------------ | Robust outcome | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.time | -.0536213 .0045599 -11.76 0.000 -.0625588 -.0446838 1.treated | -.0507019 .0223163 -2.27 0.023 -.0944419 -.0069619 | time#treated | 1 1 | .1455297 .0233969 6.22 0.000 .0996717 .1913876 | _cons | 1.816555 .0029693 611.79 0.000 1.810735 1.822375 -------------+---------------------------------------------------------------- sigma_u | 2.7997928 sigma_e | .75591234 rho | .93205863 (fraction of variance due to u_i) ------------------------------------------------------------------------------ margins time#treated Adjusted predictions Number of obs = 275,646 Model VCE : Robust Expression : Linear prediction, predict() ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- time#treated | 0 0 | 1.816555 .0029693 611.79 0.000 1.810735 1.822374 0 1 | 1.765853 .0215303 82.02 0.000 1.723654 1.808052 1 0 | 1.762934 .0022852 771.46 0.000 1.758455 1.767412 1 1 | 1.857761 .0179003 103.78 0.000 1.822677 1.892845 ------------------------------------------------------------------------------ marginsplot ///see screenshot attached below
Also, since different users started using the new feature at different times, does it make sense to create several time* variables and examine how the effect unfolds over time? E.g.:
Code:
gen time1 = (month > tm(2015m9)) & !missing(month) gen time2 = (month > tm(2015m10)) & !missing(month) gen time3 = (month > tm(2015m11)) & !missing(month) ///etc
I would sincerely appreciate your feedback.
Comment