Robustness check

Peter Walser

Join Date: Sep 2017

Posts: 8
#1

Robustness check

17 Sep 2017, 12:36

Hello,
I wanted to do a robustness check with the user-created -checkrob-. Basically, it worked. However, when the results are transformed into a table as a last step the code stops with the alert
variable b_corevar1 not found
r(111);
For each variable, two new variables named b_varname and se_varname are created.
Help file of checkrob: http://fmwww.bc.edu/RePEc/bocode/c/checkrob.html
I tested the following:

Code:

checkrob 10 12: reghdfe indepvar corevar1 ... corevar10 testingvar1 ... testingvar12, absorb(...) vce(cluster ...)

I already tried to rename the first variable to the format b_corevar1 as it is also done by the program but it gets still not found.

Is anybody here familiar with checkrob?
I would be thankful for any advice.

Kind regards
Peter
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

18 Sep 2017, 13:29

You didn't get a quick answer. Usually, folks answer quickly or not at all.
With a user-written program and no one volunteering an answer, you can either look at the code yourself or ask the author. I assume you've looked at the variables in the variable window list? It might be there with a small difference in how it is written.

However, you might also benefit from following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. I'm not sure what it means to "rename the first variable to the format b_corevar1". You can rename variables or change formats, but I don't know about renaming to a format.
1 like
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

18 Sep 2017, 13:53

Little wonder you get an error message.

As a matter of fact, there are a cornupia of reasons for that.

You may wish to read the help files attentively. Although I have no experience with the user-written program, I observed that:

a) The output will be presented in 2 files with .txt extension

b) You are supposed to use - cluster() - , not vce(cluster).

b) The coefficients and SEs will be displayed in the first file, nota bene: "for each regression".

c) The second file will present "the following statistics for each core and testing variable: maximum (Max), minimum (Min) and average value (Mean) of the coefficient over all regressions, average standard deviation (AvgSTD), share of regressions where the coefficient is significant (PercSigni), share of regressions where the sign is positive (Perc+), respective negative (Perc-), the average t-value (AvgT) and the number of times the variable was included in the regressions (Obs) (this might differ between core variables (and between testing variables) if a variable is sometimes dropped. For each variable the average over optional estimation return variables in these regression where the variable featured is also reported".

d) The help-file examples presented the "testingvars" at a maximum of 4, which seemed to me reasonable, so to speak, hence (and according to the formula also presented in the help files), instead of 16 regressions, that would reach in your case - guess how much - 4096 regressions. Quite mind-boggling, to say the least.

e) You should start by typing the "core variables", then proceed with the "testing variables". However, your command shows you surprisingly started by the independent variable (I cannot envisage the reason), followed by the expected sequence.

f) Last but not least, the examples used only - reg -, but not at all - reghdfe -, and I suspect for due reason.

Hopefully that helps!

Last edited by Marcos Almeida; 18 Sep 2017, 13:57.

Best regards,

Marcos
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

18 Sep 2017, 14:01

A current email address for the author is an easy Google, but he doesn't appear to be a member here, so the advice to contact him directly may be germane.

Disclaimer: I haven't looked at the program or what it does and don't have a view on how best to work in that area.
Comment
Peter Walser

Join Date: Sep 2017

Posts: 8
#5

19 Sep 2017, 08:32

Hello guys,
many thanks for your answers!

@Phil Bromiley:
So, the programm does create 2 new variables for each independent variable (or in other words, for each corevar and for each testingvar).
Sorry, it might be a translation mistake of me, but by "rename to the format" I mean that if a variable is called corevar1 the two variables the programm creates are named b_corevar1 and se_corevar1. Thus, it is adding an "b_" and an "se_" in front of each variable. However, apparently all these new created variables get deleted at some point because 1. they are not shown in the variables manager after the programm finishes and 2. the programm stops because one of these variables (to be precise, the first created one) is not found as I mentioned above.
Edit: So what I did is to create the variable b_corevar1 by myself beforehand (Simply gen b_corevar1=corevar1). However, the error is still the same. So apparently, the program does not only not find the variable if it exists already, it also does not allocate the correct values (the coefficent of corevar1) to the variable.

@Marcos Almeide:
a) Yes. Both are created. The first one (result.txt) also correctly, but the second one (table_result.txt) is empty.
b1) Why? I am not sure about that. I think the examples given are just examples. I would say that does not mean that one can't use other commands. I get no error because my regression command does not work. Moreover, all the coefficients and SEs can be calculated. Same for f) reghdfe
b2) Yes. The table looks like the following: In the first row I have the names of all new created variables (and "no" in the first column), i. e. no, b_corevar1, se_corevar1,...,b_testingvar12, se_testingvar12
In the first column I have "no" in the first row and the number of regression in ascending order (1. 2. 3. etc)
And then of course the respective values of the coefficients/SEs in the other cells.
c) Yes.
d) Yes I know. The calculation takes about 40 minutes. But it workes even with so many variables. Why do you think it is mind-boggling? This is the first time I work with Stata and do an empirical work so I am quite unexperienced. I know the rough purpose of an robustness check but nothing more in this topic. I would assume that the more testing vars you have the better the result. Is that wrong?
e) Sorry, I meant dependent variable instead of independent. Yes, it is actually not written down in the syntax. I think that is an mistake. If you take a look at the second example:
"checkrob 2 3: reg wage education experience age educspouse children
Here education and experience are core variables; age, educspouse and children are testing variables."
In my opinion this implies that wage is the dependent variable.
f) see b1)

Nick Cox:
Thanks for the hint!

So, asked differently, if you guys do a robustness check, how do you do it?
For general information: I have to write a thesis and do an empirical test. My advisor recommended me to do robustness check. However, I do not really know how to do it. I just had one course in uni about basic stuff of econometrics. That is why I though that if I do a fixed effects regression and cluster my regression will be (heteroskedastic?-)robust The programm is testing for all combinations of testing vars being included in the regression as far as I understood. Is that the normal way how to check for robustness? If yes, shouldn't there be an already in Stata implemented method do it?
I don't mind how I can do the robustness check I just want to it a way it works fine.

Kind regards
Peter

Last edited by Peter Walser; 19 Sep 2017, 09:02.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#6

19 Sep 2017, 09:03

Peter:
the first thing I would do is answering a basic question: robust to what?
Heteroskedasticity (as you state)? Higher-than-average observations?
Stata has different suites of command to deal with these issues (see: -regress postestimation-; -regress postestimation diagnostic plots-).
As an aside, if I got one of your code right, you regressed wage on a set of predictors, including education.
As ability (lurking behind residuals) can "influence" both education achievements and wage bargaining, your regression is at risk of endogeneity.

Kind regards,
Carlo
(Stata 19.0)
Comment
Peter Walser

Join Date: Sep 2017

Posts: 8
#7

19 Sep 2017, 09:17

Hello Carlo,
thanks for your answer.

the first thing I would do is answering a basic question: robust to what?

That is the question I would like to be able to answer.

Heteroskedasticity (as you state)?

Isn't my regression automatically heteroskedastic-robust if I use fixed effects and clustering for the time and entity component I have in my panel data?

Higher-than-average observations?

What exactly do you mean by that? Outliers? I took care that all of variables have a skewness within -3/+3 and a kurtosis <10. If this was not the case I winsorized for example at 2nd and 98th percentile. My advisor told me that variables are useful if they fulfill this conditions and if not that I should use winsorizing as a method to achieve this.

As an aside, if I got one of your code right, you regressed wage on a set of predictors, including education.
As ability (lurking behind residuals) can "influence" both education achievements and wage bargaining, your regression is at risk of endogeneity.

The code does not represent my actual regression. It is just an example out of the help file of checkrob.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#8

19 Sep 2017, 09:32

Peter:
thanks for providing further details.
Now it's clear (to me, at least) that you're dealing with a panel dataset.
Assuming that you have a large N, small T panel dataset and you're using -xtreg, fe-, both options -robust- and -cluster- do the same jobs and accomodate for heteroskedasticity and/or autocorrelation.
I'm not a fan of winsorizing but other opinions (especially it they come from advisors!) are pefectly legal.
Glad with reading that your code is not at risk of endogeneity.

Kind regards,
Carlo
(Stata 19.0)
Comment
Peter Walser

Join Date: Sep 2017

Posts: 8
#9

19 Sep 2017, 09:51

Yes, I have an unbalanced panal dataset with N=2000 and T=180 (monthly data over 15 years), resulting in about 250,000 observations.
Sorry for not mentioning it earlier.

I'm not a fan of winsorizing but other opinions (especially it they come from advisors!) are pefectly legal.

Yeah he told me it is a quity easy method to implement so I should use it.

Glad with reading that your code is not at risk of endogeneity.

Honestly, I am not quite sure whether my code is not at risk of endogeneity. How do I recognize this?
Because of some unusual patterns in the coefficients of several lags of a variable (return of a share, i.e. return 1 month ago, return 2 months ago etc.) I made some Wald tests among these lags and found that some of these tests' hypothesis can't be rejected (have a high p-value). Is that the problem you mean and/or a big problem in general?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#10

19 Sep 2017, 10:07

Peter:
one of the most apparent example of endogeneity (which differs from my previouse one) occurs when you have reversal causation (eg, is low income that causes depression or the other way round?); another one that springs to my mind (which is instead related to my previous example) is omitted variables (and to avoid it you should have a comprehensive knowledge of the data generating process).
As far as lags are concerned, if they refer to the dependent variable (that is, a lagged dependent variable is included among predictors), you shoud switch from a static to dynamic panel data regression model (say, -xtabond-)

Kind regards,
Carlo
(Stata 19.0)
Comment
Peter Walser

Join Date: Sep 2017

Posts: 8
#11

19 Sep 2017, 10:57

one of the most apparent example of endogeneity (which differs from my previouse one) occurs when you have reversal causation (eg, is low income that causes depression or the other way round?); another one that springs to my mind (which is instead related to my previous example) is omitted variables (and to avoid it you should have a comprehensive knowledge of the data generating process).

Okay, I might have both problems in my regression. However, I basically only have to test the relation of one of "independent" vars to the dependent var. I tried to set up as many other vars as possible to explain as much variance as possible. My advisor said that I should care about the variables that correlate with my crucrial variable and definitely find and include these in the model. If I miss other vars this would only lead to more noise which is not too bad. However, most of these variables have simultaneous causality.
Additionally, I cannot change much on my data, it was given to me by my advisor.

As far as lags are concerned, if they refer to the dependent variable (that is, a lagged dependent variable is included among predictors), you shoud switch from a static to dynamic panel data regression model (say, -xtabond-)

My advisor told me -reghdfe- is the method I have to use. I think introducing, explaining and using your suggested method would break the mold of my thesis.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#12

19 Sep 2017, 11:13

Peter:
far from me any desire to introduce innovative approaches now that, as it seems, your thesis is approaching its end!
Follow your advisor's suggestions, especially during the final trait of the flight.
All the best for your research.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment