Want to Know How to Model for Uncertainty of Latent Variables in Country-Level Panel Data

Joseph L. Staats

Join Date: Aug 2015

Posts: 28
#1

Want to Know How to Model for Uncertainty of Latent Variables in Country-Level Panel Data

09 May 2019, 07:25

Hello,

I have a research project involving country-level panel data where I want to model for uncertainty of the outcome variable and the independent variable of interest, each of which is a latent variable. These latent variables (constructed by other researchers) contain composite values derived from multiple other variables, not all of which were available for any given country and/or year. Each country-year observation of the latent variables is accompanied in an adjacent column by the standard deviation of the posterior distribution of the latent variable for each country-year observation.

I wish to follow the technique for modeling for uncertainty of the latent variables suggested by Charles D. Crabtree and Christopher J. Fariss in a 2015 article in Research and Politics (July-September, pp.1-9) titled “Uncovering Patterns Among Latent Variables: Human Rights and De Facto Judicial Independence.” As described by the authors, they: (1) duplicate their dataset 1,000 times; (2) assign a random draw from the posterior distribution of each latent variable to each country-year observation; (3) use each value thus obtained as new values for each country-year value of the latent outcome variable and independent variable of interest and estimate a set of 1,000 regression models; and then (4) combine the results across the multiple sets of data to create one set of coefficient and standard error estimates.

Crabtree and Fariss did their work using R, not Stata. For various reasons, I want to do my work in Stata, but have been unable to come up with a method to do so, despite my own efforts and consultation with experienced users of Stata (including one of the authors of the above-referenced article). To start things, I want to do this with xtreg with fixed effects, but also plan to use xtabond2 for additional models.

Below is a short sample of my data.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(i_code year) float(dep_var dep_var_post_sd ind_var ind_var_post_sd control_var1 control_var2) 1 2005 1.33 .27 .9403 .0377 .03395293 10.098463 1 2006 1.09 .27 .9403 .0369 .02516864 10.103597 1 2007 1.25 .26 .943 .0366 .01446522 10.09891 1 2008 1.19 .25 .9433 .0371 -.02323937 10.057076 1 2009 .78 .19 .9417 .0386 -.04175259 9.996817 2 2005 -1.69 .22 .1994 .0522 .01805011 6.534809 2 2006 -1.47 .18 .2093 .0571 .02249222 6.541407 2 2007 -1.4 .17 .2155 .0612 .03343279 6.558741 2 2008 -1.32 .17 .2183 .0636 .00843944 6.55176 2 2009 -1.35 .16 .2199 .0673 .03083248 6.56701 3 2005 -.68 .16 .4752 .0442 .0926275 8.366809 3 2006 -.58 .15 .4738 .0426 .10671155 8.453825 3 2007 -.62 .15 .4716 .0432 .08710414 8.52325 3 2008 -.67 .15 .4692 .0441 .03209504 8.541031 3 2009 -.76 .14 .4736 .0468 .00946155 8.536921 4 2005 .68 .32 .9646 .0285 .06083228 9.514385 4 2006 1 .34 .9617 .0318 .133764 9.629064 4 2007 .98 .33 .9589 .0339 .09498945 9.708728 4 2008 .97 .32 .9552 .0365 .00071111 9.69821 4 2009 1 .34 .9507 .0403 -.12036015 9.558898 end
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

10 May 2019, 11:02

You didn't get a quick answer. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output and sample data using dataex. We don't even know exactly what you ran. We also are not likely to read a full paper to try to help you - explain it more clearly.

When you talk about latent variables, I immediately think you're going to use SEM or GSEM. You can bootstrap these easily. You can probably do the 1000 iterations with a loop rather than creating an immense data set.
Comment
Joseph L. Staats

Join Date: Aug 2015

Posts: 28
#3

10 May 2019, 13:08

Phil,

Thanks for replying.

The dependent variable is the World Bank World Governance Indicators (WGI) for Rule of Law. The WGI Rule of Law variable is considered latent because it is a measure that relies on contributing values from a multitude of other measures, normally gathered annually, that are thought to be related to adherence to the rule of law. Each country-year value for Rule of Law is a point estimate, not an exact value. Each country-year point estimate is accompanied by a separate column with the standard deviation of the posterior distribution of the latent variable. The independent variable of interest is the Latent Judicial Independence (LJI) measure created by Linzer and Staton (2015), which also has a point estimate for each country-year observation of judicial independence and an adjacent column for the standard deviation of the posterior distribution.

The only Stata code I have to show you is what I used when I ran xtreg, fe on the point estimates for the dependent and independent variables, as follows:

Code:

xtset i_code year

Code:

xtreg dep_var ind_var control_var1 control_var2, fe

What I want to do now is run the same regression, but account for the uncertainty in the point estimates in the manner suggested by Crabtree and Fariss (2015), which requires taking 1,000 random draws from each country-year observation of the dependent and independent variables, based on the standard deviation of the posterior distribution of each observation. If this is easier to do in Stata using loops, that's fine with me. What I especially don't understand, however, is how to take random draws based on the posterior standard deviation of each observation.

You mentioned using SEM or GSEM. That is not an option for me. I have a manuscript reviewer who insists that I use xtreg, fe and xtabond2, not SEM. I am confident that once I figure out how to do this using xtreg, I'll be able to duplicate it with xtabond2.

You also suggested using dataex to post a sample of my data. I did that in my initial post.
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4446

10 May 2019, 18:07

Originally posted by Joseph L. Staats View Post

how to take random draws based on the posterior standard deviation of each observation

Try this:

Code:

version 15.1

clear *

set seed `=strreverse("1497784")'

input int(i_code year) float(dep_var dep_var_post_sd ind_var ind_var_post_sd control_var1 control_var2)
1 2005  1.33 .27 .9403 .0377  .03395293 10.098463
1 2006  1.09 .27 .9403 .0369  .02516864 10.103597
1 2007  1.25 .26  .943 .0366  .01446522  10.09891
1 2008  1.19 .25 .9433 .0371 -.02323937 10.057076
1 2009   .78 .19 .9417 .0386 -.04175259  9.996817
2 2005 -1.69 .22 .1994 .0522  .01805011  6.534809
2 2006 -1.47 .18 .2093 .0571  .02249222  6.541407
2 2007  -1.4 .17 .2155 .0612  .03343279  6.558741
2 2008 -1.32 .17 .2183 .0636  .00843944   6.55176
2 2009 -1.35 .16 .2199 .0673  .03083248   6.56701
3 2005  -.68 .16 .4752 .0442   .0926275  8.366809
3 2006  -.58 .15 .4738 .0426  .10671155  8.453825
3 2007  -.62 .15 .4716 .0432  .08710414   8.52325
3 2008  -.67 .15 .4692 .0441  .03209504  8.541031
3 2009  -.76 .14 .4736 .0468  .00946155  8.536921
4 2005   .68 .32 .9646 .0285  .06083228  9.514385
4 2006     1 .34 .9617 .0318    .133764  9.629064
4 2007   .98 .33 .9589 .0339  .09498945  9.708728
4 2008   .97 .32 .9552 .0365  .00071111   9.69821
4 2009     1 .34 .9507 .0403 -.12036015  9.558898
end

*
* Begin here
*
quietly expand 1000
bysort i_code year: generate int dataset = _n

// Your question is answered here:
generate double new_dep_var = rnormal(dep_var, dep_var_post_sd)
generate double new_ind_var = rnormal(ind_var, ind_var_post_sd)

// And now the thousand regressions, one for each of the thousand datasets:
tempname file_handle
tempfile thousand_regressions
postfile `file_handle' int dataset double intercept double slope using `thousand_regressions'

xtset i_code
forvalues dataset = 1/1000 {
    quietly xtreg new_dep_var new_ind_var control_var1 control_var2 if dataset == `dataset', fe
    post `file_handle' (`dataset') (_b[_cons]) (_b[new_ind_var])
}
postclose `file_handle'
use `thousand_regressions', clear

/* Combine them however */

exit

Don't forget to set the pseudorandom number generate seed at the top of your do-file.

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4446
#5

10 May 2019, 18:11

I forgot to mention this, but you can use drawnorm if the authors recommend maintaining some kind of correlation between the dep_var and ind_var in the thousand datasets.
Comment
Joseph L. Staats

Join Date: Aug 2015

Posts: 28
#6

10 May 2019, 19:08

Joseph,

Thanks so much for your expert help on this. I just ran the program and it gave me intercepts and slopes for the independent variable for each of the 1,000 regressions. One other thing I need are the standard errors for each of the 1,000 coefficients. Can you show me how to do that? And I need coefficients and standard errors for the two control variables, as well. The control variables don't require random draws themselves, but their coefficients and standard errors will vary for each regression because the independent variable changes for each regression.

Once I have coefficients and standard errors for the 1,000 regressions, I know how to combine them using Rubin's rules.

Thanks again.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4446
#7

10 May 2019, 19:16

Originally posted by Joseph L. Staats View Post

One other thing I need are the standard errors for each of the 1,000 coefficients. Can you show me how to do that?

The standard errors of the regression coefficients are given by _se[. . .] in the same manner that the regression coefficients, themselves, (including for the "control variables") are given by _b[. . .]

.ÿsysuseÿauto
(1978ÿAutomobileÿData)

.ÿregressÿgear_ratioÿc.turn

ÿÿÿÿÿÿSourceÿ|ÿÿÿÿÿÿÿSSÿÿÿÿÿÿÿÿÿÿÿdfÿÿÿÿÿÿÿMSÿÿÿÿÿÿNumberÿofÿobsÿÿÿ=ÿÿÿÿÿÿÿÿ74
-------------+----------------------------------ÿÿÿF(1,ÿ72)ÿÿÿÿÿÿÿÿ=ÿÿÿÿÿ60.69
ÿÿÿÿÿÿÿModelÿ|ÿÿ6.95148287ÿÿÿÿÿÿÿÿÿ1ÿÿ6.95148287ÿÿÿProbÿ>ÿFÿÿÿÿÿÿÿÿ=ÿÿÿÿ0.0000
ÿÿÿÿResidualÿ|ÿÿ8.24696489ÿÿÿÿÿÿÿÿ72ÿÿ.114541179ÿÿÿR-squaredÿÿÿÿÿÿÿ=ÿÿÿÿ0.4574
-------------+----------------------------------ÿÿÿAdjÿR-squaredÿÿÿ=ÿÿÿÿ0.4498
ÿÿÿÿÿÿÿTotalÿ|ÿÿ15.1984478ÿÿÿÿÿÿÿÿ73ÿÿ.208197915ÿÿÿRootÿMSEÿÿÿÿÿÿÿÿ=ÿÿÿÿ.33844

------------------------------------------------------------------------------
ÿÿgear_ratioÿ|ÿÿÿÿÿÿCoef.ÿÿÿStd.ÿErr.ÿÿÿÿÿÿtÿÿÿÿP>|t|ÿÿÿÿÿ[95%ÿConf.ÿInterval]
-------------+----------------------------------------------------------------
ÿÿÿÿÿÿÿÿturnÿ|ÿÿ-.0701437ÿÿÿ.0090039ÿÿÿÿ-7.79ÿÿÿ0.000ÿÿÿÿ-.0880926ÿÿÿ-.0521947
ÿÿÿÿÿÿÿ_consÿ|ÿÿÿ5.795966ÿÿÿ.3591537ÿÿÿÿ16.14ÿÿÿ0.000ÿÿÿÿÿ5.080006ÿÿÿÿ6.511926
------------------------------------------------------------------------------

.ÿdisplayÿinÿsmclÿasÿtextÿ_b[turn]ÿ"ÿ±ÿ"ÿ_se[turn]
-.07014366ÿ±ÿ.0090039

.
Comment

Joseph L. Staats

Join Date: Aug 2015
Posts: 28

10 May 2019, 20:50

Joseph,

After a little uncertainty on how to add the control variables (the "double" used before the intercept and slope of the independent variable had me confused), I believe I have everything accomplished now. The program runs to completion and gives me values for coefficient and standard error for the independent variable and each of the control variables. Could you look at the code just to make sure I didn't do something wrong?

Thanks.

Code:

version 15.1

clear *

set seed `=strreverse("1497784")'

input int(i_code year) float(dep_var dep_var_post_sd ind_var ind_var_post_sd control_var1 control_var2)
1 2005  1.33 .27 .9403 .0377  .03395293 10.098463
1 2006  1.09 .27 .9403 .0369  .02516864 10.103597
1 2007  1.25 .26  .943 .0366  .01446522  10.09891
1 2008  1.19 .25 .9433 .0371 -.02323937 10.057076
1 2009   .78 .19 .9417 .0386 -.04175259  9.996817
2 2005 -1.69 .22 .1994 .0522  .01805011  6.534809
2 2006 -1.47 .18 .2093 .0571  .02249222  6.541407
2 2007  -1.4 .17 .2155 .0612  .03343279  6.558741
2 2008 -1.32 .17 .2183 .0636  .00843944   6.55176
2 2009 -1.35 .16 .2199 .0673  .03083248   6.56701
3 2005  -.68 .16 .4752 .0442   .0926275  8.366809
3 2006  -.58 .15 .4738 .0426  .10671155  8.453825
3 2007  -.62 .15 .4716 .0432  .08710414   8.52325
3 2008  -.67 .15 .4692 .0441  .03209504  8.541031
3 2009  -.76 .14 .4736 .0468  .00946155  8.536921
4 2005   .68 .32 .9646 .0285  .06083228  9.514385
4 2006     1 .34 .9617 .0318    .133764  9.629064
4 2007   .98 .33 .9589 .0339  .09498945  9.708728
4 2008   .97 .32 .9552 .0365  .00071111   9.69821
4 2009     1 .34 .9507 .0403 -.12036015  9.558898
end

*
* Begin here
*
quietly expand 1000
bysort i_code year: generate int dataset = _n

// Your question is answered here:
generate double new_dep_var = rnormal(dep_var, dep_var_post_sd)
generate double new_ind_var = rnormal(ind_var, ind_var_post_sd)

// And now the thousand regressions, one for each of the thousand datasets:
tempname file_handle
tempfile thousand_regressions
postfile `file_handle' int dataset double intercept double ivslope double ivSE cv1slope cv1SE cv2slope  cv2SE  using `thousand_regressions'

xtset i_code
forvalues dataset = 1/1000 {
    quietly xtreg new_dep_var new_ind_var control_var1 control_var2 if dataset == `dataset', fe
    post `file_handle' (`dataset') (_b[_cons]) (_b[new_ind_var]) (_se[new_ind_var]) (_b[control_var1]) (_se[control_var1]) (_b[control_var2]) (_se[control_var2])
     
}
postclose `file_handle'
use `thousand_regressions', clear

/* Combine them however */

exit

Announcement