Ensuring correct specification of the difference-in-difference model

Anton Ivanov

Join Date: Sep 2014
Posts: 267

Ensuring correct specification of the difference-in-difference model

22 Jul 2018, 16:08

Hello! I am seeking your feedback on my approach to estimate a difference-in-difference model (for an unbalanced panel).

My goal is to estimate the effect of a feature introduced at an online platform on some outcome variable. The feature became available in August of 2015. The monthly data available for analysis includes the following months: 01, 03, 05, 06, 08, 09, 10, 11, 12, and 01/2016:

Code:

xtset
       panel variable:  id (unbalanced)
        time variable:  month, 01/2015 to 01/2016, but with gaps
                delta:  1 month

xtdescribe

      id:  105, 2515, ..., 10289394                          n =      68574
   month:  01/2015, 03/2015, ..., 01/2016                    T =         10
           Delta(month) = 1 month
           Span(month)  = 13 periods
           (id*month uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       1       2         4         7      10      10

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+---------------
     8446     12.32   12.32 |  1.1.11.111111
     4239      6.18   18.50 |  ............1
     4194      6.12   24.61 |  ...........11
     3875      5.65   30.27 |  1............
     3182      4.64   34.91 |  .......111111
     3087      4.50   39.41 |  1.1..........
     2541      3.71   43.11 |  ..........111
     2040      2.97   46.09 |  ........11111
     1866      2.72   48.81 |  .........1111
    35104     51.19  100.00 | (other patterns)
 ---------------------------+---------------
    68574    100.00         |  X.X.XX.XXXXXX

Starting from August 2015, some platforms users began to use the new feature -- let me call them a treated group:

Code:

sum treated

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     treated |    303,038    .0658399    .2480025          0          1

tab treated month

           |                                                     month
   treated |   01/2015    03/2015    05/2015    06/2015    08/2015    09/2015    10/2015    11/2015    12/2015    01/2016 |     Total
-----------+--------------------------------------------------------------------------------------------------------------+----------
         0 |    27,392     27,101     27,319     27,469     27,624     27,522     27,306     28,669     30,752     31,932 |   283,086
         1 |         0          0          0          0      2,903      2,961      2,942      3,271      3,624      4,251 |    19,952
-----------+--------------------------------------------------------------------------------------------------------------+----------
     Total |    27,392     27,101     27,319     27,469     30,527     30,483     30,248     31,940     34,376     36,183 |   303,038

While some users used the new feature every single month starting from August, others used it only once or a few times:

Code:

tab feature_use_count month

feature_us |                               month
   e_count |   08/2015    09/2015    10/2015    11/2015    12/2015    01/2016 |     Total
-----------+------------------------------------------------------------------+----------
         1 |       642        256        188        296        381      1,187 |     2,950
         2 |       352        416        224        248        861        809 |     2,910
         3 |       247        333        373        572        494        474 |     2,493
         4 |       377        421        619        618        365        344 |     2,744
         5 |       250        500        503        502        488        402 |     2,645
         6 |     1,035      1,035      1,035      1,035      1,035      1,035 |     6,210
-----------+------------------------------------------------------------------+----------
     Total |     2,903      2,961      2,942      3,271      3,624      4,251 |    19,952

Now, given the feature became available in 08/2015, I create a time variable:

Code:

gen time = (month > tm(2015m8)) & !missing(month)

sum time

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        time |    303,038    .5386453    .4985051          0          1

And then estimate the following fixed-effects model:

Code:

xtreg outcome time##treated, fe vce(robust)

Fixed-effects (within) regression               Number of obs     =    275,646
Group variable: id                              Number of groups  =     64,699

R-sq:                                           Obs per group:
     within  = 0.0013                                         min =          1
     between = 0.0017                                         avg =        4.3
     overall = 0.0029                                         max =          9

                                                F(3,64698)        =      55.45
corr(u_i, Xb)  = 0.0427                         Prob > F          =     0.0000

                                (Std. Err. adjusted for 64,699 clusters in id)
------------------------------------------------------------------------------
             |               Robust
     outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      1.time |  -.0536213   .0045599   -11.76   0.000    -.0625588   -.0446838
   1.treated |  -.0507019   .0223163    -2.27   0.023    -.0944419   -.0069619
             |
time#treated |
        1 1  |   .1455297   .0233969     6.22   0.000     .0996717    .1913876
             |
       _cons |   1.816555   .0029693   611.79   0.000     1.810735    1.822375
-------------+----------------------------------------------------------------
     sigma_u |  2.7997928
     sigma_e |  .75591234
         rho |  .93205863   (fraction of variance due to u_i)
------------------------------------------------------------------------------

margins time#treated

Adjusted predictions                            Number of obs     =    275,646
Model VCE    : Robust

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
time#treated |
        0 0  |   1.816555   .0029693   611.79   0.000     1.810735    1.822374
        0 1  |   1.765853   .0215303    82.02   0.000     1.723654    1.808052
        1 0  |   1.762934   .0022852   771.46   0.000     1.758455    1.767412
        1 1  |   1.857761   .0179003   103.78   0.000     1.822677    1.892845
------------------------------------------------------------------------------

marginsplot
///see screenshot attached below

Does everything seem to be appropriate so far?

Also, since different users started using the new feature at different times, does it make sense to create several time* variables and examine how the effect unfolds over time? E.g.:

Code:

gen time1 = (month > tm(2015m9)) & !missing(month)
gen time2 = (month > tm(2015m10)) & !missing(month)
gen time3 = (month > tm(2015m11)) & !missing(month)
///etc

Click image for larger version

Name: Screen Shot 2018-07-22 at 18.04.49.png
Views: 1
Size: 158.6 KB
ID: 1454652

I would sincerely appreciate your feedback.

Last edited by Anton Ivanov; 22 Jul 2018, 16:09. Reason: difference-in-difference

Tags: difference-in-difference, fixed-effects, regression, unbalanced panel data

Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

22 Jul 2018, 18:22

Your approach seems appropriate overall. But your data and results confuse me, because they don't seem consistent with each other. Your -tab treated month- output shows that none of the treated users were observed at all in the months preceding the intervention: you have no data at all on this group of people when they did not have access to the intervention. So you don't actually have a full two-by-two layout of treatment status and pre-post intervention status. But that should show up in your regression results: it should not be possible to get estimates for all three of 1.time, 1.treated, and 1.time#1.treated, because that makes three indicators for three categories, so there must be colinearity. Similarly -margins- should have reported the result for 1.treated#0.time as "(not estimable)". So I don't understand how you could have gotten those results from the data as you show it.

different users started using the new feature at different times

What does this mean. For the users who began using the new feature in, say, 10/2015, what information do we have about them for 8/2015 and 9/2015? Were they simply not represented in the data at all in those months? Or were they in the data, but not given access to the intervention until 10/2015, so that they were effectively in the control condition for 8/2015 and 9/2015. It makes a big difference. If they are simply not in the data at all, then your current analysis is fine. But if they were in the data but restricted to control conditions for that time, then you cannot use the classical DID analysis and must do a generalized DID instead.
1 like
Comment

Anton Ivanov

Join Date: Sep 2014
Posts: 267

22 Jul 2018, 19:09

Clyde, thank you for response. Regarding your first comment, I think the problem is in the way how I generated the time variable -- i.e., I used ">" only, whereas it seems I should have used ">=". If I correct that, then the result is:

Code:

 gen time1 = (month >= tm(2015m8)) & !missing(month)

xtreg outcome time1##treated, fe vce(robust)
note: 0b.time1#1.treated identifies no observations in the sample
note: 1.time1#1.treated omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =    275,646
Group variable: id                              Number of groups  =     64,699

R-sq:                                           Obs per group:
     within  = 0.0015                                         min =          1
     between = 0.0023                                         avg =        4.3
     overall = 0.0034                                         max =          9

                                                F(2,64698)        =      85.61
corr(u_i, Xb)  = 0.0463                         Prob > F          =     0.0000

                                 (Std. Err. adjusted for 64,699 clusters in id)
-------------------------------------------------------------------------------
              |               Robust
      outcome |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
      1.time1 |   -.067274   .0054251   -12.40   0.000    -.0779073   -.0566407
    1.treated |   .0817391    .017511     4.67   0.000     .0474174    .1160607
              |
time1#treated |
         0 1  |          0  (empty)
         1 1  |          0  (omitted)
              |
        _cons |   1.831505   .0039702   461.31   0.000     1.823723    1.839286
--------------+----------------------------------------------------------------
      sigma_u |  2.7995229
      sigma_e |  .75582747
          rho |  .93206064   (fraction of variance due to u_i)
-------------------------------------------------------------------------------

margins time1#treated 

Adjusted predictions                            Number of obs     =    275,646
Model VCE    : Robust

Expression   : Linear prediction, predict()

-------------------------------------------------------------------------------
              |            Delta-method
              |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
time1#treated |
         0 0  |   1.831505   .0039702   461.31   0.000     1.823723    1.839286
         0 1  |          .  (not estimable)
         1 0  |   1.764231   .0020898   844.21   0.000     1.760135    1.768327
         1 1  |    1.84597   .0162591   113.53   0.000     1.814102    1.877837
-------------------------------------------------------------------------------

Is there any plausible way for me to address this issue and model this appropriately given the structure of the data?

As for your second comment, let me try to explain. Based on the -tab feature_use_count month- output, some of the users do use the feature literally every month starting from August (i.e., feature_use_count = 6 for six months starting from August). However, others can, for example, use the feature in August, then in October, and then in December. So there can be gaps in the user's usage of the feature. Not to mention that some of the users only used the feature once, say in November, and that's it. Let me check though if any of the data are available specifically for these users for the months when they don't use the feature. I'll post an update on this issue.

I am very thankful for your useful comments.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#4

22 Jul 2018, 19:51

Is there any plausible way for me to address this issue and model this appropriately given the structure of the data?

Not really. For a full DID design you need information on the treatment group from before the intervention begins. Without that, all you can do is compare the outcome in the treatment and control groups from the start of the intervention forward. Unless the treatment group was selected by an actual randomization, that is a much weaker design than a DID, and you will be hard pressed to say that you have identified a causal effect with it. If data on the treatment group from before then is available somewhere, you are well advised to get it and include it in your analysis so you can do a full DID design.

Your response to my other comment doesn't quite answer the question I had in mind. So to make it clearer: it is not so important for this purpose whether or not the users actually used the feature, or how often. What is relevant is whether or not the feature was available to them, whether they could have used it if they chose to do so. As a product developer, you do not ordinarily have any control over whether user's make use of all the features of your product, so it is not important to know whether actual use of the feature affects the outcome. You need to know whether making the feature available affects the outcome: this is what you have control over. In health care research, this is known as the "intention to treat" analysis and it is considered the best way to estimate the effect of a treatment: based on who was assigned to receive the treatment, not based on who actually used it.

If you wish, you can also do an "as treated" analysis in which the treatment variable is redefined to mean actual use of the feature, rather than just availability of the feature. But this is an inferior analysis because the group that chooses to make use of the feature is self-selected and may very well differ from those who choose not to use it in important ways that confound your analysis of the outcome. In particular, in this "as treated" analysis, neither a DID design nor randomization to treatment assignment can be expected to identify a causal effect.

That said, I am presuming that your intent is to identify causal effects. If you are not interested in that and just want an estimate of the association between the use of the feature and outcome, then you can proceed with what you have. Just remain aware that that association may be the result of many influences, some perhaps inherently unobservable, other than the use of the feature itself.
1 like
Comment

Anton Ivanov

Join Date: Sep 2014
Posts: 267

22 Jul 2018, 21:26

Clyde, please let me clarify.

I would actually start with addressing your latter comments, because it seems to me that they are of the primary importance.

The feature was available to absolutely all users of the platform starting from August. The decision whether to use it or not was based solely on the user's desire to do so. It was not randomly assigned, but rather randomly selected by users. As far as I understand, the group that chooses to make use of the feature is indeed self-selected and may differ from those who choose not to use it. Therefore, this brings me to the "as treated" type of analysis, correct?

Now, addressing the pre- and post- intervention availability of the data. Although given the self-selection issue, this might not be of much help, I did the following. For the id's that used the feature at least once starting from August, I kept only those that were available both before and after August. Here is an example of the data using -dataex id month outcome time treated-:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id byte month int outcome float(time treated)
2515  1 106 0 0
2515  3 109 0 0
2515  5 111 0 0
2515  6 112 0 0
2515  8 115 1 1
2515  9 120 1 0
2515 10 121 1 0
2515 11 122 1 0
2515 12 123 1 0
2515 13 124 1 0
3831  1   5 0 0
3831  3   6 0 0
3831  5   6 0 0
3831  6   7 0 0
3831  8   8 1 0
3831  9  14 1 0
3831 10  18 1 0
3831 11  25 1 0
3831 12  28 1 1
3831 13  32 1 1
5099  1  41 0 0
5099  3  42 0 0
5099  5  42 0 0
5099  6  44 0 0
5099  8  46 1 1
5099  9  47 1 1
5099 10  49 1 0
5099 11  51 1 0
5099 12  52 1 0
5099 13  52 1 0
5489  1 185 0 0
5489  3 191 0 0
5489  5 198 0 0
5489  6 207 0 0
5489  8 215 1 1
5489  9 221 1 1
5489 10 223 1 0
5489 11 229 1 0
5489 12 232 1 0
5489 13 234 1 0
5857  8  60 1 1
5857  9  60 1 1
5857 10  63 1 0
5857 11  64 1 0
5857 12  65 1 0
5857 13  66 1 0
6990  1 138 0 0
6990  3 143 0 0
6990  5 149 0 0
6990  6 150 0 0
6990  8 150 1 1
6990  9 150 1 1
6990 10 153 1 1
6990 11 157 1 1
6990 12 161 1 1
6990 13 165 1 1
8225  1  72 0 0
8225  3  74 0 0
8225  5  76 0 0
8225  6  78 0 0
8225  8  79 1 1
8225  9  81 1 0
8225 10  83 1 0
8225 11  84 1 0
8225 12  84 1 0
8225 13  84 1 0
8293 12  59 1 1
8293 13  63 1 1
9668  1 157 0 0
9668  3 157 0 0
9668  5 163 0 0
9668  6 166 0 0
9668  8 174 1 1
9668  9 179 1 1
9668 10 181 1 1
9668 11 185 1 1
9668 12 186 1 1
9668 13 187 1 1
9680  1  58 0 0
9680  3  59 0 0
9680  8  71 1 1
9680  9  73 1 1
9680 10  76 1 1
9680 11  79 1 1
9782  1 109 0 0
9782  3 111 0 0
9782  5 117 0 0
9782  6 119 0 0
9782  8 122 1 0
9782  9 127 1 0
9782 10 129 1 0
9782 11 131 1 1
9782 12 132 1 1
9782 13 133 1 0
9783  1 166 0 0
9783  3 169 0 0
9783  5 173 0 0
9783  6 176 0 0
9783  8 184 1 1
9783  9 184 1 1
end

Given the results in the output, there are pre- and post- data on the outcome available.

But regardless of all this, I am still left with an "as treated" option for analysis, right?

Comment

Anton Ivanov

Join Date: Sep 2014

Posts: 267
#6

22 Jul 2018, 21:42

Also, I realize this might be a bit off-topic, but do you think a user-written -ddid- module could then be applicable for my case?

Giovanni Cerulli & Marco Ventura, 2017. "DDID: Stata module to compute pre- and post-treatment estimation of the Average Treatment Effect (ATE) with binary time-varying treatment," Statistical Software Components S458384, Boston College Department of Economics
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#7

22 Jul 2018, 21:48

OK, now I have a clearer understanding of your situation. This is not suitable for a DID analysis, not even close. In fact, in this data there is no way at all that I can see to identify a causal effect of the intervention from this data set. All you can do is get a simple crude comparison of the outcome when the new feature is used vs when it isn't. This is strongly confounded by lots of things that are not included in the data (and probably by things that you can't even get data on if you want to.) Fortunately, you have longitudinal data on people, so at least by using a fixed-effects panel estimator you can eliminate the effects of attributes of the individual users that are unchanging over time. But that still doesn't help you control for factors that vary over time. In particular, users may perceive the feature as most useful under certain use-cases, and the nature of those use-cases may, itself, be associated with the outcome you are measuring. This would be a serious confounder.

I also think that the data from before August is uninformative and should be omitted from the analysis. It is untreated only. While you might think it has value by increasing sample size, it is also from a different time period and may be affected by factors that make it different from the data where you have both treated and untreated observations (seasonal use patterns or something like that). Not knowing more about the specifics, it's hard to make this concrete. But in general study design terms, I would think the best you can do with this is something like:

Code:

xtset id month xtreg outcome i.treated if time, fe

If you have other variables available to you that are plausibly related to the outcome and possibly related to the user's decision to use or not use the feature, then including those in the regression would improve things somewhat. Among the things you should consider in this light is including month itself as a predictor variable as there may well be either seasonal or long-term trends in the outcome that need to be adjusted for.

And any conclusions you draw from this would have to be very tentative due to the very weak study design. Perhaps there are specific things in your context and relating to the particular outcome measures that mitigate these problems somewhat. And if that is the case, you should make those arguments why things are not quite so bad as they seem. But I think it will be very difficult to reach any persuasive conclusions from this type of data.

Last edited by Clyde Schechter; 22 Jul 2018, 21:52.
1 like
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#8

22 Jul 2018, 23:05

Dear Clyde, I am very, very thankful to you for your comprehensive feedback and guidance. Indeed, these (publicly available) data are not without issues, which substantially increase the difficulty to reach persuasive conclusions. You are absolutely correct on that. Your comment on the seasonality is also very useful as I do observe variations associated with different months in the descriptive statistics. Given the limitations I plan to include about 15 time-varying control variables (at the user level), and in addition to that use instrumental variables to alleviate concerns associated with endogeneity caused by the omitted variable bias.

Once again, thank you very much for your response, sir.
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#9

26 Aug 2018, 15:56

Hello Clyde Schechter ,

You had been extremely helpful with my previous questions in this post and I am thankful to you for that. There is an important update related to the way the feature was introduced to the platform users (this knowledge was not available at the time of my first post). Therefore, I would like to check with you if it facilitates conditions for the DID model or not.

So, the feature has become available to all platform users in January of 2016. Prior to that, in August of 2015 it had been tested only with some users. In particular, in July an email was sent out randomly to the platform users asking them if they would like to participate in using the new feature. Those who responded positively were given an opportunity to use the feature. Over the 5-month testing period (Aug-Dec 2015), different users used the feature 1-2-3-4-5 times. In other words, while some users used it consistently, others just tried it out one or a few times.

I realize there is a bias associated with self-selection. But other that, what are the caveats of such design for the purpose of DID?

With sincere appreciation of your feedback,
Anton

Last edited by Anton Ivanov; 26 Aug 2018, 16:03.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#10

26 Aug 2018, 19:45

Well, if the email was really sent out randomly (in the statistical, not the lay sense of the word) to platform users, then you have a randomized controlled trial of offering the feature. So you could then do an analysis that contrasts the outcome among those who received the email offer to those who were not sent the email offer. Interestingly, the most efficient way to do this would be, from a modeling perspective, similar to a DID analysis of this phenomenon, so that you adjust for whatever the pre-offer baseline difference in outcome was. (The main difference would be that you can use a random-effects estimator, which is more efficient than the fixed-effects estimator, without having to worry whether the random effects estimator is consistent: the randomization implies its consistency.) But in this case there would be no issue of identifying the causal effect: the randomization does that.

That would not, however, alter anything you could do with regard to the differences in outcome among people who used it 1, 2, 3, 4, or 5 times, or didn't use it despite having it available. That attribute of the users is self-selected, so that direct comparisons may be confounded with unobserved factors that distinguish these groups. For them, the DID approach offers an approximation to a causal analysis, because at least those attributes that are time invariant are ruled out as confounders. But you still cannot exclude confounding by time-varying attributes with this analysis.

So except for the possibility of adding to your analyses a randomized test of the effect of making the feature available (without regard to whether or how often it is used), it doesn't change anything else.
1 like
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#11

30 Aug 2018, 10:25

Clyde, thank you very much for your comprehensive response. I took my time to double-check on the randomness issue that you pointed out. The platform developer has confirmed that the platform users were selected at random and that selection was statistically random across all platform users.

Please let me clarify if from the modeling perspective I should use a model similar to the DID or a different one? I.e., those who received an email = 1 (0 otherwise), post-email period = 1 (0 pre-email), and then examine the interaction coefficient between the two dummies? And use a random-effects estimator, as you emphasized.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#12

30 Aug 2018, 10:29

You have it right. The regression model is the same as the one used for a DID estimate, and you look at the interaction coefficient between the two indicators (dummies).
1 like
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#13

31 Aug 2018, 00:17

Thank you very much for you guidance, Clyde.
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#14

02 Oct 2018, 10:06

Dear Clyde,

I would like to ask one more clarification question related to the model that we discussed in this post. So, for the testing purposes the platform has randomly selected n number of users across all platform users. Next, it sent out emails to the selected users asking them if they would like to participate in the testing of a new feature. Those users that responded favorably, got the feature available for their accounts on the platform. Unfortunately, the data on which users out of n responded favorably is not available. As such, I suspect there could be bias associated with self-selection on the use of the feature during test period.

Based on my understanding, if the self-selection bias were associated with the outcome, I would be able to use a Heckman model, for instance. However, given the bias is on the predictor of interest, I am not sure which correction approach would be feasible here.

I would appreciate your feedback on the possible solution for this issue.

Thankfully,
Anton
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#15

02 Oct 2018, 11:56

You are correct that the Heckman model is used for biased non-availability of outcome and does not apply to this situation. This problem arises frequently in research involving human subjects because you typically cannot enforce random assignment to treatments. It does not have any simple solutions. An instrumental variables approach is sometimes used in this situation--but from what you describe that doesn't seem feasible here. There may not be a good solution to this problem.
1 like
Comment

Announcement

Ensuring correct specification of the difference-in-difference model

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment