Oaxaca_rif & standard vs. reweighted model

Quent Dave

Join Date: Mar 2016

Posts: 15
#1

Oaxaca_rif & standard vs. reweighted model

22 Jul 2020, 15:26

Dear all,

I am currently working on a project that makes use of OB decomposition to decompose the test score gap that exists between private and public schools.
I especially want to apply a OB-RIF decomposition to look at the evolution of explained vs. unexplained educational inputs (e.g teachers' characteristics) across the score gap distribution.

I read and tried to understand as much as possible Jann (2008) and Rios-Avila (2020) regarding the two OB (resp. "oaxaca" and "oaxaca_rif" cmds).

- My first question lies on the use of the standard (unweighted) RIF decomposition. Jann (2008) in "oaxaca" cmd allows the use of a "pooled" option when calculating twofold decomposition but in "oaxaca_rif" cmd, there is no such option, requiring to use either w(1) or w(0) and that would say that we consider that discrimination is directed toward only one group. In my specific setting group A = Private, group B = Public such that I use w(1) but, does that really make sense? Is it normal not to have the "pooled" option in "oaxaca_rif" cmd, at least for estimation at the mean ?

- My second question goes to the arbitrage between using the standard or reweighted OB-RIF in the case where I don't have an optimal reweighted model to deal with. Namely, when using "oaxaca_rif" cmd with a "rwlogit" model, some specification errors are significant (not at the mean but at Q10, Q20 and Q80 only). As stated by Rios-Avila (2019), it suggests a misspecification of the latent model. However I cannot do better in this model. What would you do? Go for this reweighted model anyway since it is important to estimate as least approximately the counterfactual distribution or go for an unweighted model?

For information, I join this graph that shows how explained and unexplained parts vary across the distribution of the scoring gap, where we see that the results are quite sensitive to the model used. And I am then a bit lost. I prefer the reweighted model by its capacity to estimate a counterfactual distribution (even though flawed, at least for some quantiles) but still, I'm wondering if it's the good decision or if there is kind of a rule of thumb to go to the reweighted or the standard model

Thank you for any answer you could provide on these questions,

Best,
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2471
#2

22 Jul 2020, 16:19

Hi Quent,
So, about "pool" or "omega" options for oaxaca_rif. That was my programming decision based on what I know about OB decompositions, and the explanation of decompositions based on Firpo Fortin and Lemieux (2018) paper."Decomposing Wage Distributions Using Recentered Influence Function Regressions.” Econometrics 6(3): 41.

The other reason was more focused on theoretical justification. The options Omega and Pool in Oaxaca assume that you need to "pool" all the data and estimate the corresponding equation. But for RIF-Oaxaca, there is no clear answer to what to use as the "pooled" dependent variable. It could be the one constructed for each subsample defined by the "by(variable)" or calculate the RIF for the pooled sample.
The first option is easy to implement, but the second is not so straight forward. And required going deeper into what Oaxaca does.
The problem is a bit more complicated when you try to do this with the reweighted option.

for the first case you could do something like this:
oaxaca y x1 x2 x3, by(z) w(0)
egen rif_y=rifvar(y) if e(sample), q(10) by(z)
oaxaca rif_y x1 x2 x3, by(z) w(0)

This should reproduce the same thing as
oaxaca_rif_y x1 x2 x3, by(z) w(0) rif(q(10))

BUt allows you to use omega or pool options.

For your second point. The specification error is almost unavoidable, but it at least gives you a measure of it. My own take on that is. If your reweighting error is small and nonsignificant, you are fine. If the Specification error is significant, you mention it, and show you try to correct that improving the model, but that it could also be related to the "linearity" assumption in the RIF.
You can always show both as robustness.

Now, to decide whether to use reweighted vs standard decomposition. Perhaps its better to show how robust are your results. And how much they can vary when using different assumptions.
I recommend you to also read Fortin, Firpo and Lemieux (2011) chapter on decomposition methods in Economics.
Best Regards
Fernando
Comment
Quent Dave

Join Date: Mar 2016

Posts: 15
#3

23 Jul 2020, 11:26

Hi Fernando,

Thank you for your answer.

- This is quite clear and I totally understand for the first point. It makes sense. I think that I will run the two models. The twofold pooled model using oaxaca cmd and the standard decomposition model OB-RIF at the mean using w(1) in oaxaca_rif

- For the second point, it is good to know about this specification error. And also about the reweighted error, and actually, running different scenarios, I see that those errors are quite high, sometimes significant (see table). This would suggest to use a standard decomposition model (as I cannot estimate a better IPW model), and mention it in an annex I guess. Results are sensitive to the choice of standard/reweighting decomposition but I guess it is normal since reweighted errors are quite large

Thanks for the recommandation, and for your answer, this is really nice
Best

Last edited by Quent Dave; 23 Jul 2020, 11:28.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2471
#4

23 Jul 2020, 11:32

Hi Quent,
So having such large Reweighted error seems odd. You really want them to be small and non significant.
So, perhaps a better question on this point is. Did you see how similar or different are the distribution of characteristics in your data? If the differences are too large, IPW will do little to "fix" the problem.

Bottom line, try to check the balance of the characteristics. If the distributions are too different from each other, even after reweighting, you do need to work in to the model specification.

For balancing test, you can use pstest or teffect post estimation command tbalance.

Fernando
Comment
Quent Dave

Join Date: Mar 2016

Posts: 15
#5

23 Jul 2020, 13:54

Hi Fernando,

Okay I think I see, however, I'm getting a bit confused.

The idea of my research in a few words is to analyze the comparative advantage of Private (PR) school over Public (PB) schools thanks to a Dynamic OLS model where for the same students I observe scores at baseline and endline. Controling for baseline tests allows to control for the endogenous selection into one type of school. I find a significant positive effect for enrolled into PR

Additionally, I want to understand what explains this gap, especially that I observe many educational inputs for both types of schools (i.e infrastructure and many teacher characteristics). As those two types of schools are different, it is normal that I don't have balance over my covariates (for example because of different budget allocation or teacher recruitment process). When using pstest as you advised, I clearly see those unbalances (c.f table). Unbalance is one thing but the other thing is that distributions are sometimes really different across PR/PB. The best example is wages; PB schools pay way more their teachers than PR schools such that there is (almost) no overlapping support on this variable. I then construct dummies that indicate where the teacher is in the earning distribution within school type in order to restore overlapping assumption and control for earning.

According to the table, some variables are really unbalanced (e.g age or proportion of teacher with at least Grade 14 education). As those variables are really unbalanced and their distributions really different (as seen by V(T)/V(C) for continuous variables), I should actually re-construct those variables such that they are more balanced. For example create age dummies such that there is full overlapping & not too much unbalancing. The others, that can't be changed (e.g gender) are not too unbalanced so it's okay to deal with them like that
=> is that correct?

Considering these unbalances, it means that my basic Oaxaca model (not even talking about the RIF decomposition) is not good because of these too big differences between the two groups.

Finally, after doing so, I should run the new IPW model, hoping to find small and not significant reweighted errors. Otherwise, something is still wrong, and it will be time to quit economics.

Thanks for your support Fernando, this is really nice and helpful
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2471
#6

23 Jul 2020, 14:28

Do not quit economics!
and IPW will help with balance, however, it can only do so much. So, my best advice is to start by playing with the propensity score.
Basically proceed as you would do if you were doing a propensity score matching, trying to figure out the best specification for the creation of the propensity score.
Check the distributions, before and after matching, for all variables.
If the distributions are so different that they do not overlap, i would be inclined to take that variable out of the probit/logit model.
Otherwise you run the risk of adding a variable that explains of the "treatment" that your IPW will be extreamly sensitive. (for instance 1/p-> infty if p-> 0)

Once you have a reasonable model, you can proceed with the OB reweighted decomposition.

Hope this helps.
Fernando
Comment
Quent Dave

Join Date: Mar 2016

Posts: 15
#7

24 Jul 2020, 12:15

Dear Fernando,

Thank you for your answers. I will keep working on it and try to find a good model while working on the variables to ensure better balance and I hope I'll get there. With time and reading I will get there

There is one last question that I have, which actually I think explains my whole confusion on OB: I have difficulty to understand the signification of the basic counterfactual and so when to choose weights w(0) or w(1) for the standard decomposition.

In your paper (2019), you write that vc can be written X1.B0 such that dv = X1.(B1-B0) + (X1 - X0).B0 (in that case we put w(0)) but we can also use the counterfactual X0.B1 (in that case w(1)) such that we have dv = X0(B1-B0) + B1.(X1-X0).
=> I am facing difficulties into trying to make a literal sense of those two equations and which weight I should prefer in my case where 1 = Private and 0 = Public.
If choosing w(0) => vc = X1.B0 then is it assuming what would happen to put Public school students if they went into Private school but keeping their public coefficients?
I'm getting so lost to be honest, as the point that it becomes blurry for me to justify the use of w(1) and w(0)

I also tried to apply your code for a pooled rif-decomposition
oaxaca y x1 x2 x3, by(z) w(0)
egen rif_y=rifvar(y) if e(sample), q(10) by(z)
oaxaca rif_y x1 x2 x3, by(z) w(0)

But as I am using Multiple Imputation data, it seems that the egen rif_y doesn't work. I'm gonna try to find how to deal with that.

Thanks Fernando for your time
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2471
#8

26 Jul 2020, 08:26

Hi Quent
so if 1 is private and 0 is public, the counterfactual x1*b0 means what would happen if students from private schools (x1) face the "coefficients" of public schools.
Also, I didnt write oaxaca_rif to work with Multiple imputed data, so I think you need to create the "RIFS' manually, one per "imputed sample"
Fernando
Comment

Announcement

Oaxaca_rif & standard vs. reweighted model

Comment

Comment

Comment

Comment

Comment

Comment

Comment