Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrapping R2 and Bootstrap testing R2 across subsamples

    Hi,

    I need some help on how I can accomplish the following using stata.

    I have panel data containing observations for >100 firms (panels) over a number of years. On a conceptual level I would like to run a Pooled OLS univariate regression for each subsample based on a median sample split for a given variable. Subsequently, I want to obtain the R2 of these two regressions and calculate the difference in R2. Then I want to obtain a simulated distribution of differences in R2 based on bootstrapped regressions. So essentially I want to use the bootstrap command in stata to create pseudo sample splits and produce a distribution of differences in R2 between the subsample regressions. The goal is to test whether the actual difference in R2 is different from the simulated distribution of differences in R2.

    To get the median split I have created a dummy, which =1 for observations > median value and =0 for observations <= median value. The regression I want to run looks like this: reg y x if dummy=1/0, cluster(Panel_ID). This regression will be used to obtain the actual difference in R2 between subsamples. Additionally, I want to obtain a simulated distribution of differences in R2 from the bootstrap without replacement. I was thinking of using something like the following for the bootstrap command: boostrap r2=e(r2), reps(2000) *size(=n from original regression of each subsample)*: reg y x, cluster(Panel_ID).
    Based on the actual difference in R2 and the simulated distribution of differences in R2, I want to test whether the actual difference in R2 is different from the bootstrapped distribution of differences in R2 using a bootstrap test.

    Below I have summarized my questions/problems:

    (1) How can I save the R2 of each initial regression and calculate the difference in R2 (=actual difference in R2)?
    (2) How can I create a simulated distribution of differences in R2 between pseudo sample split subsamples? For the bootstrap command I want to set *size()* to the n that is observed for each subsample regression performed earlier (which should approx. be the same, therefore I want my pseudo subsamples to have the same size()). Any suggestions for setting size() are welcome. Furthermore,I want to save the obtained differences in R2 for testing (without losing my original dataset).
    (3) Lastly, I want to test whether the actual difference in R2 is different from the simulated distribution of differences in R2 using a bootstrap test. How can I conduct such a test?

    Any help and suggestions would be much appreciated.

    Many thanks, Ali

  • #2
    You didn't get a quick answer. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. Instead of trying to give us sort of your model, give us an actual model (it should be as simple as will demonstrate the problem). I don't know if your bootstrap command works or not.

    Start by simplifying the problem. Instead of doing everything all at once, solve one problem at a time. You've got so many models and estimates running around, I can't follow your posting.

    Comment


    • #3
      Hi Phil,

      Thanks for the helpful suggestions. Unfortunately, I cannot provide a data example due to my dataset containing proprietary data.

      That being said, I would like to get some helpful insights that will help me move in the right direction in terms of executing the above described steps. Keeping in mind that I am running a simple univariate pooled OLS regression for 2 subsamples (median split), in general terms, how can I use the bootstrap command to obtain a simulated distribution of differences in R2 (i.e. let stata create a sample split and perform the regression as described above (post #1) and obtain the difference in R2 --> repeat this process 2000 times)?

      Any help on how I could write the required code, or any helpful information that will increase my current knowledge of the bootstrap command and/or regarding the above described problem (or a part of it) would be amazing. I have not worked with the bootstrap command or done any bootstrap testing before, therefore I am having a tough time figuring out how to solve the above described problem (#1).

      Comment


      • #4
        Originally posted by Ali Malik View Post
        ...

        That being said, I would like to get some helpful insights that will help me move in the right direction in terms of executing the above described steps. Keeping in mind that I am running a simple univariate pooled OLS regression for 2 subsamples (median split), in general terms, how can I use the bootstrap command to obtain a simulated distribution of differences in R2 (i.e. let stata create a sample split and perform the regression as described above (post #1) and obtain the difference in R2 --> repeat this process 2000 times)?

        ...
        Ali,

        I assume that by median split, you want to run 2 OLS regressions using data above the median and below the median of each boostrapped sample. I think you can't do this within one command, but you can write a short program to conduct the split. You would return the r^2 in each subsample to a scalar. Then you call the program within bootstrap. Example 4 from the manual should give you an outline of what you need to do.

        After you write that program, I believe you would ask -bootstrap- to save the results in a data file, i.e. the empirical difference in r^2. I would guess that under the null, the difference should be 0. I've never seen this sort of application before, so I don't know the theoretical distribution of the r^2 difference. I would guess that you would want to calculate some sort of 95% confidence interval of the difference, and then you see if that interval contains 0 or not. I'm unfamiliar with bootstrapped CIs, so I have no idea if you calculate the 2.5th and 97.5th percentiles of the difference and that's your CI, or if you need some sort of transformation. I've only done bootstrapping once!

        I'm honestly not sure if there is a good substantive reason to test if r^2 is different. I would normally think that we'd prefer to test if the regression coefficients are different. I don't think you need to use bootstrapping for that; you can search for how to do a Chow test on the forum or in Google if that's your goal (but yes, you could bootstrap it if you wanted to!). People tend to over-focus on r^2 in my opinion.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Hi Weiwen,

          Thanks for the helpful advice!

          After you write that program, I believe you would ask -bootstrap- to save the results in a data file, i.e. the empirical difference in r^2. I would guess that under the null, the difference should be 0. I've never seen this sort of application before, so I don't know the theoretical distribution of the r^2 difference. I would guess that you would want to calculate some sort of 95% confidence interval of the difference, and then you see if that interval contains 0 or not. I'm unfamiliar with bootstrapped CIs, so I have no idea if you calculate the 2.5th and 97.5th percentiles of the difference and that's your CI, or if you need some sort of transformation. I've only done bootstrapping once!
          I am running a prediction model (within sample) with one explanatory variable. The bootstrap test is based on assuming the null hypothesis is true (no relation between dependent and independent variable). The test for difference in R2 is conducted to examine the difference in the explanatory power of the independent variable between the two subsamples of my sample (median split), where I am trying to replicate the approach used in a paper. I do not use a confidence interval for the purpose of testing the difference in R2.

          In case it is not clear how I want to conduct this test, here is what I want to do:
          1. Run two regression for each of the two subsamples (based on median split), obtain the difference in the R2 and save this scalar
          2. Use bootstrap the create a simulated distribution of differences in R2 (under the null hypothesis), save this empirical distribution.
          3. Use something like an F-test to test whether the actual difference in R2 (Point 1) is statistically different from the simulated distribution of differences in R2 (Point 2).
          I assume that by median split, you want to run 2 OLS regressions using data above the median and below the median of each boostrapped sample. I think you can't do this within one command, but you can write a short program to conduct the split. You would return the r^2 in each subsample to a scalar. Then you call the program within bootstrap. Example 4 from the manual should give you an outline of what you need to do.
          Regarding Example 4 from the manual, I have thought about using this approach but was not sure if this would give me the desired result. Your comment makes me want to reconsider using this approach. Earlier I was thinking of obtaining a simulated distribution of differences by running bootstrap twice (i.e. run the bootstrapped regression on a sample of size n=number of obs in regression under Point 1, save the R2, and do the same thing again and in the end calculate the difference between these R2's). I figured this is probably not the right approach and therefore asked for some help. Do you have any idea how I can write the program to conduct the sample split (where the subsamples need to be of same size as the original number of observations in each subsample regression) and return the difference in R2?

          Comment


          • #6
            Originally posted by Ali Malik View Post
            I am running a prediction model (within sample) with one explanatory variable. The bootstrap test is based on assuming the null hypothesis is true (no relation between dependent and independent variable). The test for difference in R2 is conducted to examine the difference in the explanatory power of the independent variable between the two subsamples of my sample (median split), where I am trying to replicate the approach used in a paper. I do not use a confidence interval for the purpose of testing the difference in R2.

            In case it is not clear how I want to conduct this test, here is what I want to do:
            1. Run two regression for each of the two subsamples (based on median split), obtain the difference in the R2 and save this scalar
            2. Use bootstrap the create a simulated distribution of differences in R2 (under the null hypothesis), save this empirical distribution.
            3. Use something like an F-test to test whether the actual difference in R2 (Point 1) is statistically different from the simulated distribution of differences in R2 (Point 2).
            ...
            I see I misunderstood your goal: it sounds like you want to use the bootstrap to obtain an empirical distribution of the R-square difference (e.g. above - below median) in a dataset where the difference should be zero. If I understood you right, this has some conceptual similarities to the bootstrap likelihood ratio test for latent class models. In that case, we are trying to see if a model with k latent classes fits better than a model with k+1 classes. For reasons best explained by people who actually know what they're talking about, the likelihood ratio test statistic (-2 * the difference in log likelihoods) does not have its usual asymptotic chi-square distribution, but it can be simulated empirically.

            To do this, we take a k-class model's parameters, and simulate data based on those parameters. In each simulated dataset, we fit a k- and a k+1 class model, but we know that the k-class model is true. If we take the -2LL difference in each simulated dataset, we have the empirical distribution of that statistic under the null hypothesis. Compare that to the real-life -2LL difference, and you have your test statistic.

            If the question you are asking is one that's got some substantive justification, then I am going to guess that you need a parallel process. You need to simulate some data based on the null hypothesis, then bootstrap that. I am going to guess that you could simulate some data based on the parameters of a pooled model (i.e. you don't fit separate models for above and below the median). Doesn't the paper you are trying to replicate describe what they bootstrapped or simulated?

            As I alluded to earlier, some statistician may have actually figured out the sampling distribution of the R^2 difference, and then you can compare your actual R^2 difference to that sampling distribution and get a p-value from there. For example, maybe someone far smarter than either of us proved, somehow, that the statistic follows a chi-square distribution with 5 degrees of freedom. Then, you don't need any bootstrapping. (This actually happened in the latent class example, except that some other statisticians attempted to refute the proof; all the math involved is very far over my head.)
            Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

            When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

            Comment


            • #7
              For the exact quote from the paper, please see the attached screenshot. The paper uses quintiles 1 and 5 as their subsamples, whereas in my research I am interested in the subsamples resulting from a median split.

              Attached Files

              Comment


              • #8
                Originally posted by Ali Malik View Post
                For the exact quote from the paper, please see the attached screenshot. The paper uses quintiles 1 and 5 as their subsamples, whereas in my research I am interested in the subsamples resulting from a median split.
                FYI, per the FAQ, screenshots are not preferred as they don't always format properly. You can insert a block of text in quotes by using the " button on the formatting toolbar. That said, I can see that screenshot.

                ...Instead, we use a bootstrap test based on simulating the empirical distribution of the test statistic, assuming that the null is true (Noreen, 1989). Inthis case, the null hypothesis is that earnings volatility is unrelated to earnings predictability, and the test statistic is the difference in adjusted R^2 between ...
                I don't know the Noreen (1989) article (not enough info to find the full citation), but the gist of your quote is indeed that the authors simulated data where the null hypothesis is true. Translated to your case, I think you need to simulate a parallel dataset where the R^2 is unrelated to the dependent variable. But the description of how they did this is uninformative to me - either it obscures much more than it reveals, or I am a mere applied statistician whose brain cannot comprehend the authors' magnificence. Or perhaps both are true.

                We simulate the empirical distribution under the null by randomly splitting the full sample (11,061 observations) into pseudo-earnings volatility quintiles...
                OK, what do they mean by "pseudo" earnings volatility quintiles? Also, they split the sample into quintiles, but they do not know that the null hypothesis is true in the sample they had. They are supposed to be simulating data where the null is true. Maybe the mechanism to generate the pseudo earnings volatility quintiles addresses this issue, and if so, you need to replicate it. I don't mean to give you the run around, but while I can envision how to program this, I'm not 100% sure I actually can pull it off. Also, the fact that your data are clustered introduces additional complications. You can tell the bootstrap command that your data are clustered, but I don't know if there's anything else you need to be aware of.
                Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                Comment


                • #9
                  Also, the fact that your data are clustered introduces additional complications. You can tell the bootstrap command that your data are clustered, but I don't know if there's anything else you need to be aware of.
                  I have been advised not to use the cluster option, but in the end it would be my choice whether to cluster or not. Let us assume that the cluster option is not used/needed in my regressions, any idea on how to write the code or in general terms what you envision in terms of how to program this?

                  Indeed, the description of the methodology as described in the screenshot is quite vague.

                  Again, I really appreciate your helpful insights!

                  Comment


                  • #10
                    OK. Since you didn't describe your data structure, let's make up data where we know the null hypothesis is false.

                    Code:
                    clear
                    set obs 1000
                    set seed 1000
                    gen x = rnormal()
                    sum x, det
                    gen median = x >= r(p50)
                    gen y = 1.5 * x + rnormal() if median == 0
                    replace y = 1.5 * x + rnormal(0,3) if median == 1
                    twoway scatter y x, msize(small) || lfit y x
                    regress y x
                    You can see that the variance of y is larger above the median. You know that if you fit OLS to the entire dataset, it will be "wrong" in some sense - the data are heteroskedastic. We also know that the true beta for x is 1.5. In this data, the R^2 is 0.2842. y is x plus random noise. The R^2 will be smaller if you increase the amount of random noise by increasing the standard deviation in the call to -rnormal(mean,sd)-, even if beta is identical.

                    The problem from there is that I don't know what they mean when they split the sample into "pseudo" earnings quantiles. If they bootstrapped from the actual data, then that doesn't seem right - they have no idea if the data correspond to the null. I can't think of a principled way to simulate data corresponding to the null. Also, it struck me that in this case, the problem is akin to testing for heteroskedasticity. The paragraph you cited essentially said that earnings volatility may differ by future earnings. You could fit a regression model using a dummy for above/below median, then run one of the tests for heteroskedasticity - which I am not that familiar with and which you should research. For example,

                    Code:
                    regress y c.x##i.median
                    estat hettest
                    estat szroeter, rhs
                    This works even if the betas differ above and below the median:

                    Code:
                    clear
                    set obs 1000
                    set seed 1000
                    gen x = rnormal()
                    sum x, det
                    gen median = x >= r(p50)
                    gen y = 1.5 * x + rnormal() if median == 0
                    replace y = 2.5 * x + rnormal(0,2) if median == 1
                    regress y c.x##i.median
                    estat hettest
                    estat szroeter, rhs
                    Sorry I can't be of more help, but your source is not very informative. Moreover, I am still not convinced that this question is a particularly useful one. You could try looking through their citations to see if anyone has implemented this test and described it more clearly.
                    Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                    When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                    Comment

                    Working...
                    X