Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • IV-FD with an endogenous binary regressor

    Hi Statalisters,

    I would like to estimate the following model for individual i in period t:

    Yit = B0 + B1X1it + B2X1it*X2it + B3X3it + B4X4it + Fi + Uit

    where Y is a continuous variable; X1 is an endogenous binary variable (correlated with both the time-variant and time-invariant components of the error term); X2, X3 and X4 are “exogenous” variables (correlated with Fi but not with Uit) – where X2 and X3 are continuous and X4 is binary–; and Fi are individual fixed effects. I am interested in the causal effect of X1 and X1*X2 on Y, and intend to use Z (a vector of 32 variables) as instruments for X1 and X1*X2. Specifically, I will use 2SLS with First Differences to account for endogeneity arising from time-variant and time-invariant heterogeneity. In order to do this, I will use a panel dataset that consists of 1,777 individuals (i=1...1777) during 2 time periods (t=1,2). However, I'd like to flag some key limitations of my data:
    1. Z only varies across individuals and not over time (I only have values for t=1). Still it seems to be a relevant set of instruments when this model is estimated (ignoring Fi) in cross-sections of the data (either for t=1 or t=2).
    2. X2 is missing whenever X1=0, which is the reason why I did not include it in levels as an additional covariate. Instead, I included X3, which accounts for a similar characteristic and for which I have values for all the sample. In order to avoid this interaction being dropped for observations with missing values, I imputed values of X2 where X1=0 with 0. Any suggestion about how to better deal with this issue is more than welcome.
    My main question is: I'd like to follow the procedure suggested by Wooldridge (2002) "Econometrics of Cross Section and Panel Data" to deal with a binary endogenous regressor, considering that it has the additional benefit of "squeezing" the variation of a large amount of instruments into a single one (the fitted value). However, I'm not sure how to go about it given this setting (after taking first differences, X1 is not binary anymore). Here goes my attempt, using Stata 14:

    Code:
    /* The original dataset has a wide structure, where each observation corresponds to individual i and variables with the suffix _? correspond to period t.*/
    reshape long Y_@ X1_@ X2_@ X3_@ X4_@, i(id) j(time)
    ren *_ *
    xtset id time
    gen X1X2 = X1*X2
    probit D.(X1 X3 X4) Z*, vce(cluster id)
    // Am I ditching information since -1 and 1 are considered the same by -probit-? I cannot use either -xtprobit- (without the lag operator) because of the incidental parameters problem or -xtlogit, fe- because Z doesn't vary over time.
    // Also, given that X4 is a dummy, should I create dummies for the different combinations instead of just using the lag operator?
    predict phat1
    // This creates predicted probabilities for observations both in t=1 and t=2. Thus, now I have time-variant instruments. Is this a valid procedure?
    gen phat2 = phat1*X2
    xtivreg2 Y X3 X4 (X1 X1X2 = phat1 phat2), fd first r
    Many thanks for your help.

    Best wishes,

    Maria

  • #2
    You didn't get a quick answer. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. You will also increase your chances of a response by offering a shorter, more focused posting.

    Generally 2sls is consistent with a binary endogenous variable. When you start to do this with your ad hoc approach, it is likely to create problems. If you have 3 real outcomes for the variable, then you're probably making a mistake to treat them as two. Your predicted value won't even match the range of the original variable. If fd is problematic, then why not use fe?

    Comment

    Working...
    X