Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weighted average counterfactuals

    Hi there, I am having some trouble understanding the weighted average command and how to use it when producing counterfactuals.

    Here is an example of my problem!



    *** Trying to figure out what is going on with these weighted averages
    *** Looking at car data, and the number of sales and defects per model
    *** Want to answer the question: Did the share of cars with defects increase from 2000 to 2005?
    *** And if so, is that due to a composition change towards the production of more cars that were already more defect prone in 2000,
    *** Or is it because the car models themselves became more defect prone

    ** So, I expect defect_percentage and new_defect_percentage to be equal
    ** I also expect counterfactual_defect_percentage and new_countfact_defect_percentage to be equal

    ***Unfortunately, the two counterfactual variables I create are NOT equal. And I don't understand why

    Code:
    clear all
    
    sysuse auto
    drop price-foreign
    expand 2, gen(dupindicator)
    set seed 12345
    sort make
    gen year = 2000 if dupindicator == 0
    replace year = 2005 if dupindicator == 1
    
    gen sales = runiform()
    replace sales = sales * 1000
    replace sales = round(sales, 1)
    set seed 54321
    gen defects = runiform()
    replace defects = defects * 100
    replace defects = round(defects,1)
    
    ** Defect percentage by make-year
    gen defect_percentage = defects / sales
    
    ** Total number of sales per year
    bysort year: egen total_year_sales = sum(sales)
    
    ** make share of sales in a year
    gen make_share = sales/total_year_sales
    ** What was the total defect share in 2000? What was it in 2005?
    bysort year: egen defect_percentage_year = wtmean(defect_percentage), weight(make_share)
    sort make year
    
    gen make_share_2000 = make_share
    replace make_share_2000 = . if year == 2005
    bysort make: carryforward make_share_2000, replace
    
    bysort year: egen counterfactual_defect_percentage = wtmean(defect_percentage), weight(make_share_2000)
    sort make year
    
    
    bysort year : egen defect_total_year = sum(defects)
    gen new_defect_percentage = defect_total_year / total_year_sales
    
    
    gen sales_2019 = total_year_sales
    replace sales_2019 = . if year == 2020
    sort make year
    bysort make: carryforward sales_2019, replace
    
    
    gen new_countfact_defect_percentage = defect_total_year / sales_2019
    sort make year

  • #2
    Let's start from the beginning shall we? You're asked to specify that wtmean comes from a user written command, as I'd never heard of it before and I suspect most folks haven't. I really appreciate the data example, but I'm super unclear as to how any of this is related to counterfactuals or producing them. Either way though, I think I understand your intervention..... but what's your empirical design? Surely there's a better way to get what you want than all of these generate commands, right? Can you talk a little about how you're estimating effects in this instance?

    Anyways, my guess is that you've changed your random seed..... that might be the reason. Barring this, post your real data using dataex (if you can't do that, then make a synthetic dataset from that dataset and anonymize it), and then I suspect I can try to answer this better. But for the moment, this question seems super abstract, and I really love causal inference and counterfactual analysis, so I wanna understand this better.

    Comment


    • #3
      My apologies. I thought wtmean was something everyone has come across!

      This is related to counterfactuals in the sense that I am doing something like a shift-share analysis. In this example I gave, I am trying to see if the the number of cars with defects would have been higher in 2005 IF the composition of car sales didn't change from 2000 to 2005. That is all I mean by a counterfactual.

      The random seed isn't why the two counterfactuals I create aren't the same. I guess my question is, am I using the wtmean command wrong? Have I discovered a bug? Or am I using wtmean wrong and calculating the counterfactual my hand incorrectly.

      I agree that there should be a better way than to use a bunch of generate commands. That is why I wanted to use the wtmean command. But I'm finding that doing it the long way with a bunch of generate commands is yielding a different result and I do not understand why or which method is correct.

      Comment


      • #4
        Why not just do a difference-in-differences analysis? Post your data using dataex and I'll give the syntax for DD, there's almost certainly no reason at all to use wtmean or even generate in this instance. You'll likely only need regress with the weight option to do this.

        Comment


        • #5
          Hmm. Not clear to me how a DiD answers this question. Curious as to what you mean And I can't post a dataex. But am happy to work through this question using the automobile data.

          Comment


          • #6
            Why can't you use dataex? Is your Stata version less than 9.2?


            DiD answers the question because it seems like you're trying to answer whether or not the change in policy had any impact on the amount of defects, right? If you have a group of cars that receives the treatment, a group that doesn't receive the treatment, and a pre-post period (which it seems like you do), DD is optimal here. If not, I don't know what other approach would be viable to generate a counterfactual. And if DD isn't viable, I guess my question is, how do you argue that you're getting at the counterfactual in the causal sense? You can't do that short of some kind of cohort/quasi-experimental design. I'm not asking this to be difficult, by the way, I'm curious about what you think your identifying assumptions are in this context? If DD wouldn't work here, what would? This is why I asked to see your data, because without it, any advice I can give short of seeing how it really looks is almost impossible.

            Even when I look at your code you want to use, I'm still lost. Where's our treatment variable? Which cars were treated or not?

            Comment


            • #7
              I could use dataex but I'm not sure what would be different from just typing in the first 14 lines of code.

              And there is no treatment variable or policy. I am just looking at how defects evolve over time. Say defects increase from 2000 to 2005 by 5 percentage points. I want to decompose that increase to see how much is due to 1) car-models becoming more detective and 2) how much of it is due to the change in the composition of car sales.

              The two counterfactuals I generate are looking to see what happens if you keep the 2000 level composition constant. But when I calculate it these two different ways, I am getting different results. I am confused on if I am performing the calculations wrong in one or both of the ways because I am pretty sure they should yield the same result

              Comment

              Working...
              X