r (2001) insufficient observations

Stefano Grillini

Join Date: Jun 2015

Posts: 85
#1

r (2001) insufficient observations

15 Jul 2016, 10:26

Dear all,

I've a weird problem in looping regressions in a time series dataset. My sample is made of 94 stocks. For each stock I have 3 liquidity measures and its squared returns. I also have 12 aggregated market measures: 3 for each liquidity measure and the 3 remaining are fixed. I need to regress for each stock's liquidity measure the above variables on the form:
y= b0 + b1ML1 + b2ML2 + b3ML3 + b4M4 + b5M5 + b6M6 + e
I also need to store (eststo) the estimates. Unfortunately, some stocks' measures are very limited.

My code is the following:

Code:

local N = 94 forvalues i = 1/`N' { regress dqspr`i' Dav_qspr lagDav_qspr leadDav_qspr irelandind lagirelandind leadirelandind Dsqrtret`i' } eststo

If I

Code:

qui regress

Stata gives me the error 2001 "Insufficient observations" from the first regression, while if I

Code:

regress

Stata gives me error 2001 from the 32nd regression, which is actually the first stock with limited observations (seven to be precise).
How can I overcome this problem without regress manually? Is there any way to "ignore" regressions with only limited observations?

Thanks for your help

Stefano Grillini
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#2

15 Jul 2016, 10:39

Something like this:

Code:

local N = 94 forvalues i = 1/`N' { capture regress dqspr`i' Dav_qspr lagDav_qspr leadDav_qspr irelandind /// lagirelandind leadirelandind Dsqrtret`i' if c(rc) == 0 { estssto results`i' } else if c(rc) == 2001 { display "Insufficient results for i == `i': moving on." } else { display "Unanticipated error in regression with i = `i'" exit `c(rc)' } }

Notes:

1. The -capture- command, in addition to blocking error conditions, also suppresses output. If you want to see the regression results, insert -noisily- between -capture- and -regress-.

2. This code will give a pass to regressions with insufficient observations, and display a notification in each instance. But it will still break on any other unanticipated error condition.
Comment
Stefano Grillini

Join Date: Jun 2015

Posts: 85
#3

15 Jul 2016, 10:47

Thank you very much Clyde, it works (hopefully, as I have to do the same with another dataset with 4000 stocks).

Best Wishes

Stefano Grillini
Comment
Stefano Grillini

Join Date: Jun 2015

Posts: 85
#4

16 Jul 2016, 11:15

Hi all, eststo command can store maximum 300 estimates. As I'm now working on a greater dataset, do you have any tips about a way to overcome this limitation? (my code is the one above, but for more than 300 regressions)

Thanks

Stefano
Comment
Stefano Grillini

Join Date: Jun 2015

Posts: 85
#5

16 Jul 2016, 11:29

I was thinking about storing estimates using "statsby", as I then need to do further (simple) calculations with the results, so the CODE I thought is something like:

Code:

local N = 351 forvalues i = 1/`N' { statsby _b e(r2) _se df=e(df_r), by(date) saving("N:\...\myreg1.dta", replace): regress /// dqspr`i' Dav_qspr lagDav_qspr leadDav_qspr irelandind lagirelandind leadirelandind Dsqrtret`i' gen t = _b/_se gen p = 2*ttail(df,abs(t)) }

As you can see, I need for each regression the coefficients, t-stat, p-value and r2 (actually I need the adjusted r2, but I don't know how to get it). if I run this command, I have two types of problems. Firstly, the system does not recognise _b. In addition, it runs one regression every day (date variable). To overcome at least the second limitation, I thought to generate a scalar variable with all "1" values to use as by group. Any other ideas, suggestions are appreciated.

Thanks

Stefano
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#6

16 Jul 2016, 11:36

Well, to be honest, I'm always skeptical of projects that run hundreds of regression analyses and somehow try to synthesize the results. I'm inclined to believe that in the end, the results are either incomprehensible or grossly oversimplified to make them digestible. So maybe Stata's limit is a warning to not do things this way. But at this point, I won't pursue that issue and just credit you with having a reasonable plan.

So, you won't be able to store these estimates all in memory at one time. -help limits- reveals that the maximum of 300 estimates stored in active memory is built into Stata and is not a quirk of -eststo-. But you can save the estimates to disk files instead. So if you replace -estssto results`i'- with -estimates save results`i'-, you will end up with a bunch of files called results1.ster through resultsBIGNUMBER.ster in your working directory. When you need to work with them, you can invoke the -estimates use- command.

At the end of the day, though, if you have a plan that requires having thousands of estimates in memory at the same time, it isn't going to happen. Since the limit of 300 estimates in memory at once applies throughout Stata, if you need to work with more than 300 sets of estimates, you need to find a way to process them serially, or in batches of fewer than 300. (For example, maybe you don't really need the full estimation results for your end product. Maybe you just need specific statistics which can be pulled out of them one at a time and stored in local macros or in matrices, etc.)
Comment
Stefano Grillini

Join Date: Jun 2015

Posts: 85
#7

16 Jul 2016, 11:57

Well Clyde, I appreciate your comment and I personally agree with you. However, at this stage, I'm partly replicating another study, so the estimates from all these regressions are mainly used to construct tables. In my field, empirical finance, it does often happen that models are replicated for a considerable number of stocks in a market, as in this case. Generally, I agree with you regarding the increasing complexity of "sometimes meaningless" aggregated measures, but it is often necessary to have a broad overview.

I'll try to follow your suggestion regarding the replacement of -eststo- with -estimates save-.

In any case, do you think the alternative code, using statsby, is in this case a valid solution? If so, any ideas why it does not recognise _b?

I really appreciated your comment

Thanks

Stefano
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17733
#8

16 Jul 2016, 11:58

Stefano wrote:

...(actually I need the adjusted r2, but I don't know how to get it)...

Adjusted R2 is stored as -e(r2_a)-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Stefano Grillini

Join Date: Jun 2015

Posts: 85
#9

16 Jul 2016, 11:59

Thanks Carlo,

At least I solved this issue.

Stefano
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#10

16 Jul 2016, 15:43

Well, the approach using -statsby- can be made to work, but it requires a different data structure from what you have there. It is doing one regression for each date because that is exactly what you told Stata to do when you specified -by(date)-. But your original problem calls for doing 94 regressions based on 94 different pairs of outcome variable and predictor. So you need to get -statsby- to do that. That, in turn, requires going to long layout:

Code:

gen long obs_no = _n reshape long dqspr Dsqrtret, i(obs_no) j(_j) statsby _b _se e(r2_a) e(df_r), saving(results, replace) by(_j): /// regress dqspr Dav_qspr lagDav_qspr leadDav_qspr irelandind /// lagirelandind leadirelandind Dsqrtret

should get you the results you want. The code you had about calculating t from _b and _se does not make sense because _b and _se do not exist: they are not variables in your data set. They are the stubs of a series of variables in the data set results that you are creating, but they are not accessible from within your data set. To do that and other calculations with the results you need next -use results, clear-. That will bring all your regression output into active memory. Then you can start generating new variables. You still won't get to write -gen t = _b/_se- because now there is a whole suite of _b* and _se* variables and you will have to specify which one you want (or construct a loop to get a t statistic for each variable, or whatever it is you need).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#11

16 Jul 2016, 19:57

Wait, there's something that's confusing me. Your original request was a loop from 1 through 94. By #5, we're up to 351 iterations and you exceeded the limits of stored estimates. But I also notice that in #5 you are doing 351 iterations, and each iteration is a -statsby, by(date)-, which could involve any number of regressions, depending on how many different dates you have. In my response in #10, I just presumed that the -by(date)- was some kind of misunderstanding of how -statsby- works. You'll notice, by the way, that my response in #10 does not include any explicit loop: the iterating is done within -statsby-. But now I'm wondering if your problem is more complex than you originally made it appear. Do you want to iterate over both dates and 351 variables? If you do, then the -by(_j)- option to -statsby- should be -by(_j date)-. (Still no explicit loop needed.)
Comment
Stefano Grillini

Join Date: Jun 2015

Posts: 85
#12

18 Jul 2016, 06:37

Actually your point is correct Clyde. I could get the information I was looking for with the loop and storing results with eststo (exactly the code you provided in #2. However, as I also specified, I need to replicate the same code for bigger samples (the 351 is not even the biggest). Here is when I come to an end, because eststo cannot store all these estimates. I just thought as statsby as an alternative to eststo, which seems to be impossible to overcome with so many estimates.

You are also right in #11, where you say

each iteration is a -statsby, by(date)-, which could involve any number of regressions, depending on how many different dates you have

In fact, Stata runs a regression for each date, with obviously one observation for each variable, which is statistically wrong and meaningless. I could overcome this issue, creating a new variable:

Code:

generate group = _n replace group = [1]

This solves the problem, as it runs one regression for the whole time period. However, the command substitutes the last estimates to the first row in the new .dta file. So if I have 351 regressions, in the new .dta file I find only estimation for the 351st, as it iteratively replaces previous estimations.

Thanks

Stefano
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#13

18 Jul 2016, 07:33

I can't follow what you're describing. I think you need to show a small representative sample of your data (please use -dataex- to do that)*, and then a hand-worked example of what you want the results to be. At this point, I don't understand what regressions you want to do.

*If you do not already have the -dataex- command, you get it by running -ssc install dataex-. Then read -help dataex- for the simple instructions on how to use it.
Comment

Stefano Grillini

Join Date: Jun 2015
Posts: 85

#14

20 Jul 2016, 08:34

Dear Clyde, I copied a small sample of the dataset for only the last observations of the time series and for only five stocks.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int date float(dqspr1 dqspr2 dqspr3 dqspr4 dqspr5 Dav_qspr lagDav_qspr leadDav_qspr irelandind lagirelandind leadirelandind Dsqrtret1 Dsqrtret2 Dsqrtret3 Dsqrtret4 Dsqrtret5)
20426            .5   .25000006          .1            3           1     .14946306    -.05095674             .    -.3014796    -1.159085            .    .08694334    -.9833107    .3165502    -.4738849   -.9660669
20429     -.3333333         -.4   -.7272727    -.1666667           0    -.08122023             .      .1716124      2.91223            .    -2.595253            .    212.88423   15.291967     .2820318 -.011661846
20430             0   2.0000002    3.111111           .4         -.5      .1716124    -.08122023   -.005136395    -2.595253      2.91223   -1.3584417            .     -.920926   -.8880337    1.1364436    19.67324
20431            .5    .5555555    .3783783    -.7857143           0   -.005136395      .1716124    -.06889795   -1.3584417    -2.595253   -1.4172257    -.5012603     .6805121   2.2384634    -.9792513           .
20432             0   -.7857143   -.8431373    1.6666666           1    -.06889795   -.005136395    -.10478014   -1.4172257   -1.3584417   -1.6713814    -.7312129    -.9875518   -.4125474     6.902411           .
20433     -.3333333           1         4.5        -.875           0    -.10478014    -.06889795             .   -1.6713814   -1.4172257            .    3.3150625    35.187714   -.6777869     10.07242    8.054161
20436           -.5   1.8333334  -.20454547            3         -.5    .000213613             .    -.14328423   -2.0021513            .       .89406     1.570544    -.8879591   -.8308975    -.9980708           0
20437             1   -.7058824  -.25714287            3           1    -.14328423    .000213613      -.216351       .89406   -2.0021513    .29542577   -.54340386    117.52133   28.039536     153.7277 -.017751353
20438            .5         -.4   -.2307692         -.75         -.5      -.216351    -.14328423     .11927483    .29542577       .89406     .3793121    -.7568971     -.880347    -.995514     2.008746   -.8895476
20439      .3333333    .3333333         -.8    .50000006           2     .11927483      -.216351      .4606423     .3793121    .29542577    -.7395194     7.534437     .8116891    475.6776    -.2791414  .005952462
20440           .25        2.75        -.25    -.1666667   -.3333333      .4606423     .11927483             .    -.7395194     .3793121            .    -.8999519     2.281612  -.10682856    -.8475372    35.76267
20443          -.48         -.4    3.333333         10.2           0     .25564197             .     -.3662845   -4.4447746            .    1.6510344     40.20555    -.6940711   -.7143504     5.478612   -.8875475
20444      .2307693    .5555555     .923077    -.9464286         1.5     -.3662845     .25564197    -.14107579    1.6510344   -4.4447746     .5135787    -.9785348    -.7191756   -.5166548    -.9145004    10.92438
20445    -.26375002   -.7857143        -.96            1         -.6    -.14107579     -.3662845      .1520068     .5135787    1.6510344   -1.8510345     1.390468     -.220939   1.4038823            0   -.8216918
20446     -.5925297   1.6666666          70    -.8333333         -.5      .1520068    -.14107579             .   -1.8510345     .5135787            .    -.3154395     2.252932   -.9268817     .3978528   -.5620139
20447             .           .           .            .           .             .      .1520068             .            .   -1.8510345            .            .            .           .            .           .
20450             .           .           .            .           .             .             .             .            .            .   -1.5234817            .            .           .            .           .
20451             .           .           .            .           .             .             .      .6462319   -1.5234817            .    2.8659816            .            .           .            .           .
20452     -.3162393         2.5        18.4    -.8333333           0      .6462319             .    -.11682374    2.8659816   -1.5234817    .08692798            .     -.929072   -.8248055            .     8.10583
20453           .35   -.2142857   -.7783505           12   -.3333333    -.11682374      .6462319             .    .08692798    2.8659816            .            .    22.306507   1.1994845            .  -.54897934
end
format %tdnn/dd/CCYY date

Using the command discussed above for the five stocks in the sample:

Code:

local N = 5
forvalues i = 1/`N' {
    capture regress dqspr`i' Dav_qspr lagDav_qspr leadDav_qspr irelandind lagirelandind leadirelandind Dsqrtret`i'
    if c(rc) == 0 {
        eststo results`i'
    }
    else if c(rc) == 2001 {
        display "Insufficient results for i == `i': moving on."
    }
 else if c(rc) == 2000 {
        display "Insufficient results for i == `i': moving on."
    }
    else {
        display "Unanticipated error in regression with i = `i'"
        exit `c(rc)'
    }
}

esttab using "N:\...irelandmodel1.csv", ar2
eststo clear

I obtain exactly what I want. The total sample in this case is made of 94 stocks (local N = 94), so I don't have any problem in using eststo.
However, the present analysis has to be replicated for other 3 markets, which contain more than 400 stocks (actually one of them has 3000 stocks). Here is the problem, as eststo does not store all these results, so I need to find an alternative way to do this.

All the estimates obtained are then used to construct a table indicating:
- How many positive coefficients;
- How many coefficients are significant;
- What's the average ar2;
- and so on.

So. as you can see these estimates can be also done in Stata, so I believe it is not necessary to store them in an Excel file. That's why I thought about statsby.

Hope this clarifies the issue

Thanks

Stefano

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30165

#15

20 Jul 2016, 09:08

Yes, -statsby- is your friend here. The following code will get you all of the coefficients and p-values.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int date float(dqspr1 dqspr2 dqspr3 dqspr4 dqspr5 Dav_qspr lagDav_qspr leadDav_qspr irelandind lagirelandind leadirelandind Dsqrtret1 Dsqrtret2 Dsqrtret3 Dsqrtret4 Dsqrtret5)
20426            .5   .25000006          .1            3           1     .14946306    -.05095674             .    -.3014796    -1.159085            .    .08694334    -.9833107    .3165502    -.4738849   -.9660669
20429     -.3333333         -.4   -.7272727    -.1666667           0    -.08122023             .      .1716124      2.91223            .    -2.595253            .    212.88423   15.291967     .2820318 -.011661846
20430             0   2.0000002    3.111111           .4         -.5      .1716124    -.08122023   -.005136395    -2.595253      2.91223   -1.3584417            .     -.920926   -.8880337    1.1364436    19.67324
20431            .5    .5555555    .3783783    -.7857143           0   -.005136395      .1716124    -.06889795   -1.3584417    -2.595253   -1.4172257    -.5012603     .6805121   2.2384634    -.9792513           .
20432             0   -.7857143   -.8431373    1.6666666           1    -.06889795   -.005136395    -.10478014   -1.4172257   -1.3584417   -1.6713814    -.7312129    -.9875518   -.4125474     6.902411           .
20433     -.3333333           1         4.5        -.875           0    -.10478014    -.06889795             .   -1.6713814   -1.4172257            .    3.3150625    35.187714   -.6777869     10.07242    8.054161
20436           -.5   1.8333334  -.20454547            3         -.5    .000213613             .    -.14328423   -2.0021513            .       .89406     1.570544    -.8879591   -.8308975    -.9980708           0
20437             1   -.7058824  -.25714287            3           1    -.14328423    .000213613      -.216351       .89406   -2.0021513    .29542577   -.54340386    117.52133   28.039536     153.7277 -.017751353
20438            .5         -.4   -.2307692         -.75         -.5      -.216351    -.14328423     .11927483    .29542577       .89406     .3793121    -.7568971     -.880347    -.995514     2.008746   -.8895476
20439      .3333333    .3333333         -.8    .50000006           2     .11927483      -.216351      .4606423     .3793121    .29542577    -.7395194     7.534437     .8116891    475.6776    -.2791414  .005952462
20440           .25        2.75        -.25    -.1666667   -.3333333      .4606423     .11927483             .    -.7395194     .3793121            .    -.8999519     2.281612  -.10682856    -.8475372    35.76267
20443          -.48         -.4    3.333333         10.2           0     .25564197             .     -.3662845   -4.4447746            .    1.6510344     40.20555    -.6940711   -.7143504     5.478612   -.8875475
20444      .2307693    .5555555     .923077    -.9464286         1.5     -.3662845     .25564197    -.14107579    1.6510344   -4.4447746     .5135787    -.9785348    -.7191756   -.5166548    -.9145004    10.92438
20445    -.26375002   -.7857143        -.96            1         -.6    -.14107579     -.3662845      .1520068     .5135787    1.6510344   -1.8510345     1.390468     -.220939   1.4038823            0   -.8216918
20446     -.5925297   1.6666666          70    -.8333333         -.5      .1520068    -.14107579             .   -1.8510345     .5135787            .    -.3154395     2.252932   -.9268817     .3978528   -.5620139
20447             .           .           .            .           .             .      .1520068             .            .   -1.8510345            .            .            .           .            .           .
20450             .           .           .            .           .             .             .             .            .            .   -1.5234817            .            .           .            .           .
20451             .           .           .            .           .             .             .      .6462319   -1.5234817            .    2.8659816            .            .           .            .           .
20452     -.3162393         2.5        18.4    -.8333333           0      .6462319             .    -.11682374    2.8659816   -1.5234817    .08692798            .     -.929072   -.8248055            .     8.10583
20453           .35   -.2142857   -.7783505           12   -.3333333    -.11682374      .6462319             .    .08692798    2.8659816            .            .    22.306507   1.1994845            .  -.54897934
end
format %tdnn/dd/CCYY date

isid date
reshape long dqspr Dsqrtret, i(date) j(_j)
tempfile results
statsby _b _se e(r2_a) e(df_r), saving(`results') by(_j): regress dqspr Dav_qspr lagDav_qspr leadDav_qspr ///
    irelandind lagirelandind leadirelandind Dsqrtret
    
use `results', clear
rename _eq2_stat_1 adjusted_r2
rename _eq2_stat_2 df_r
ds _b*
local predictors `r(varlist)'
local predictors: subinstr local predictors "_b_" "", all
foreach p of local predictors {
    gen t_`p' = _b_`p'/_se_`p'
    gen p_`p' = 2*ttail(df_r, abs(t_`p'))
}

As I do not understand what you mean by "how many coefficients are positive" and "how many are significant" I will leave it to you to take it from here. It is likely that -egen- functions will play a role in finishing the job, though not knowing where we are going here, I can't be more specific than that.

Note that your example data, while very useful for setting up the code (and I thank you for it), does not contain sufficiently many observations with non-missing values to actually calculate standard errors for your coefficients (note that df_r is always zero), but with a realistic size data set that should not be a problem.

Note also that this code assumes that variable date uniquely identifies observations in your data. If that is not the case, you will have to create a unique identifier for observations and use that variable, rather than date, in the -i()- option of the -reshape- command.

So. as you can see these estimates can be also done in Stata, so I believe it is not necessary to store them in an Excel file.

In my very strongly held opinion Excel should NEVER (is that shouted loud enough?) play any intermediate role in data analysis. Excel should be used only to send final results and receive original data sets from other people. Data analysis should not include any steps that involve Excel along the way because you have no way assuring the integrity of data in an Excel file and it leaves no audit trail of any modifications or calculations made in it. Reserve Excel for the beginning and the end only (and even that only if your colleagues prefer it.)

Announcement