Q about Difference-In-Difference model with unbalanced panel data

Eric Liu

Join Date: Dec 2019

Posts: 4
#1

Q about Difference-In-Difference model with unbalanced panel data

23 Dec 2019, 02:52

Hi all,
I am currently thinking to use difference-in-difference model in my project but not sure whether this is a feasible way I can pursue for my dataset. My project tries to explore the effect of identity shift for an emerging mobile App on its user base growth. My dataset is restricted between 2014 and 2018. Since my focus is on emerging mobile Apps, my dataset will include all new apps launched during this period and trace their monthly user base change. As those Apps launched in different months and the treatment timing (identity shift chosen by an App owner) happened in multiple time, my panel dataset is unbalanced for both control group and treatment group in the sense that I don't have equal number of observation for different entities (Apps). Is this a serious issue when I try to use DID model in this case?
Thanks for taking time to read my question!
Best wishes,
Eric
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#2

23 Dec 2019, 16:47

Balance is not the issue; balance is never needed for a DID model. The issue here is that different Apps undergo the "treatment" at different times. This precludes the use of a classical DID analysis. But you can use generalized DID instead. https://www.annualreviews.org/doi/pd...-040617-013507 has a very nice explanation of this approach.

So you will need a variable, which we can call treated, which takes on the values 1 in those observations where the app has undergone treatment already, and 0 in those observations where the app either never undergoes treatment (control group),k or does, but hasn't as yet done so. You then run a fixed effects model using this treated variable along with fixed effects for app and year. The coefficient of treated will be the generalized DID estimate of the effect of treatment.

If you need more concrete advice, post back showing example data, using the -dataex- command. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Eric Liu

Join Date: Dec 2019

Posts: 4
#3

24 Dec 2019, 03:25

Many thanks Clyde! Thanks for giving me so helpful advice and the attached paper! Will post the example data once I finish the data collection part soon.

One more concern I got here is: should I restrict my treatment as a specific type (i.e. a specific identity shift for the first time once it launched) since different Apps may shift their identity in different ways and in multiple times over the 5 year period? Or could I explore the effect of a general identity shift without the restriction of type and times by applying DID model? If I need to restrict the treatment as the example I mentioned before, I may need to cut down those data with multiple identity shifts to the period before the second time of identity shift in order to avoid the disturbance of other identity shifts on my DV. In this way, my dataset may waste lots of samples and only include a small part of data which restricts my ability to test the causality in DID model or decrease the reliability of result. Do you have any suggestion for this?

Best regards,
Eric
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#4

24 Dec 2019, 09:15

Your questions in #3 are not statistical in nature. They depend on an understanding of the subject matter itself. I am not in your field: I don't even know what the term "identity shift" means as applied to an app, so I certainly have no feel for general vs specific types of them. To the extent that a model treats different things in the same way that is a source of error (typically bias). But treating things that are really the same as if they were different also introduces error (typically noise). You need to determine, based on your understanding of the real-world context of your study, what types of identity shift are "the same" (in the sense of affecting the outcome you are studying in the same way), and which are different and, to the best of your ability and within the limitations of the available data, try to model accordingly. To the extent that your own experience and judgment in this field are limited, you would be best off consulting the literature or experts in your field for guidance.
Comment
Eric Liu

Join Date: Dec 2019

Posts: 4
#5

24 Dec 2019, 12:56

Thanks for your response, Clyde. I understand your words. This is also my biggest concern for this project since the dataset is not very similar to classical DID. I will definitely consult the literature further. If I could simplify my question in #3, that is whether DID model could deal with the treatment group with multiple same interventions over a time period? Or should I constrain my analysis on one-time intervention in this case? If the DID model cannot deal with such situation, do you have any suggestions for other available models? I ever thought about time-to-event analysis (survival data analysis) but it seems have similar requirement in the sense that the intervention should be one-time and occur at the same time for all objects in treatment group.

BTW, really appreciate for your patient response especially on this day! Hope I don't disturb your holiday for Christmas! Many thanks again and Merry Christmas~
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#6

24 Dec 2019, 21:32

The DID model is very flexible and you can model multiple interventions. It just requires that you a) craft your code to properly represent the different interventions and what you expect to happen after them, and b) have sufficient data to adequately estimate all of the different effects. By a) I mean that when you have multiple interventions, it is possible that the effect of a second intervention is different from that of a first intervention, which might be synergistic or antagonistic. And effects of third interventions could be different still. So you need to have a very clear conceptual model of how these multiple interventions affect the outcome not just in isolation but in the context of following another intervention. That in turn requires a deep understanding of the subject matter. By b) I mean that each additional intervention requires more variables to be added to the model, which, in turn, means a larger estimation sample is required to get results with useful precision.

In general, I would say that if what you are studying is a relatively new area with little past research to rely on, it is unlikely you will be able to come up with a credible model of multiple interventions, so you should stick to evaluating first interventions only (and cut off the data at any subsequent intervention). If, however, this is a field where a lot is already known, then you might be able to reason out a sensible model for multiple interventions.
Comment
Eric Liu

Join Date: Dec 2019

Posts: 4
#7

25 Dec 2019, 05:55

Alright, I thought I should stick to evaluate first intervention only in this case. But I am still very curious about the DID model run with multiple interventions since I never found a paper applying DID in this way. Could you please share some papers that apply/explain DID in this way if you ever read? Many thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#8

25 Dec 2019, 19:58

I don't have any such papers to share with you; it is rarely used for the reasons that I mentioned--you need to create a complicated model and that requires a well-developed theory. I'm just familiar enough with the mathematics of the modeling that I can assure you that it could be done under the right circumstances.
Comment

Announcement

Q about Difference-In-Difference model with unbalanced panel data

Comment

Comment

Comment

Comment

Comment

Comment

Comment