Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing coefficient across subsamples in the presence of interactive effects

    I am interested in estimating a relation between two variables across multiple subsamples but I am hung up by the fact that I also want to account for interactive effects within each subsample.

    Here is an example of what I am talking about (the actual variables are different in my study):

    Let's say that I want to examine the effect that fertilizer has on the height of a plant. I could estimate the simple regression:

    (1) Height = a + b1*fertilizer + e

    Where height is the height of the plant and fertilizer is the amount of fertilizer. b1 would then be the effect of fertilizer on height.

    However, I actually have many different types of plants (say 30) and I expect this relation to vary across my plants. What I can do is estimate this regression separately for each type of plant and come up with a an estimate of b1 for each plant subsample.

    HOWEVER, I also know that whether the plant is positioned in the sun will change the effect of fertilizer (Sun is a 0/1 variable). That is, Sun has an interactive effect on fertilizer. However, I'm not interested in the effect of sunlight and I would like to remove this variation from the data so that I can focus on comparing the effect of fertilizer between plant types. If I estimate

    (2) Height = a + b1*fertilizer + b2*Sun + b3*fertilizer*Sun + e

    separately for each of my 30 plant types, then comparing b1 for each subsample no longer tells me how the effect of fertilizer varies across plants. Instead, it allows me to test whether the effect of fertilizer varies across plant types ONLY for plants which are not planted in the sun. Similarly, if I compare b3 across my subsamples, I am comparing the effect of fertilizer ONLY for plants which are planted in the sun.

    I feel I am in a bit of a pickle because I would like to compare the effect of fertilizer across plant types on average, removing the variation caused by sunlight. It would be inappropriate to simply estimate model (1) and compare b1, because if some of my plant subsamples have a higher proportion of plants planted in the sun, then differences in b1 would be driven by the omitted variable Sun and not just by differences in the effect of fertilizer across subsamples. In the actual setting that I am looking at, I have several of these interactive variables and many of them are continuous so I can't just do simple statistics where I compare fertilizer with and without sun separately.

    I'm not sure if it's possible to actually do what I want to. I have tried searching for this (very specific) scenario online and have not found any solutions. I think the following may work but I don't know if it is appropriate:

    First estimate: (3) Height = a + b1*Sun + e
    take e from this regression and call it e_Height

    then estimate: (4) Fertilizer = a + b1*Sun + e
    take e from this regression and call it e_fertilizer

    Now estimate: (5) e_Height = a + b1*e_fertilizer + e, separately for each plant subsample.

    Would comparing b1 from this last equation (5) be appropriate? Would it allow me to compare the average effectiveness of fertilizer across different types of plants without being confounded by Sun? I am concerned that this may not solve the issue because Sun has an interactive effect and is therefore not just a simple additional control variable. Additionally, I don't know if I should estimate (3) and (4) within each plant subsample or for the overall sample (I guess it depends on whether I think the effect of Sun varies across subsamples as well).

    Any thoughts and suggestions would be very helpful!

  • #2
    After thinking about it some more, I am certain that the proposed solution I gave above using equations (3), (4), and (5) is incorrect because it still doesn't pull out the interactive effect.

    I think the only way to really get at this would be to estimate the following:
    (6): Height = a+ b1*fertilizer + b2*Sun + b3*fertilizer*Sun + b4*fertilizer*sample1 + b5*fertilizer*sample2 + ... + bN*fertilizer*sampleN

    Where sampleN is in indicator variable if the observation is from subsample N. You would need to interact fertilizer with dummies for all of the plant subsamples (except for 1, the omitted group). I think these coefficients would allow you to estimate the effect of fertilizer across the different samples while controlling for the interaction with Sun. It's still a little weird to think about the interpretation, because b1 only tells us the effect of fertilizer for observations where Sun=0, but I think my interpretation of the subsample interactions should be appropriate. Or at least I hope so.

    Do this sound correct?

    Comment


    • #3
      You didn't get a quick answer. You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

      Folks generally don't respond as much to long complex questions as they do to more focused questions.

      I could easily have mis-interpreted your questions, but rather than a bunch of separate regressions, you can probably use factor variable notation to obtain separate estimates for each subsample. In doing so, it may be good to use robust standard errors. This has the advantage to letting you easily compare parameters. That is:
      regress y i.subsample##(c.sun##c.fertilizer) would have your interaction and also give separate estimates for each subsample. Whether you want # or ## is something of a taste issues - some folks like to specifically enter the main effect rather than letting Stata do it.

      regress y i.subsample i.subsample#(c.sun c.fertilizer c.sun#c.fertilizer)

      Should run the same model as the other regress statement.

      Comment

      Working...
      X