Suest command implementation with more than 300 estimates stored

Santiago Rodrigo Cesteros

Join Date: Oct 2021

Posts: 6
#1

Suest command implementation with more than 300 estimates stored

06 Oct 2021, 02:33

Dear community,

I hope all of you are doing well.

I am currently working on a project where I need to implement the suest command after obtaining different estimates from a linear regression model. The model is the same one for all the different outcomes, but the number of observations may be different for each of such outcomes. The problem here is that I need to store around 1000 estimates before implementing suest, but Stata only allows to store up to 300 due to potential memory issues that may arise if you store too many results. This is not an issue for me since I am working in a server with a lot of memory capacity.

I cannot switch from estimates store to estimates save because suest works only with the former, not the latter (that is my understanding, but I wish I was wrong).

To be illustrative, I show my code here:

Code:

foreach Y in `Yvars' { reg `Y' `X' `X2' estimates store `Y' } suest `Yvars', vce(cluster ClusterVar)

After this, I also implement the lincom command.

My problem is the limit of 300 estimates that can be stored in the memory, since suest works only with estimates store, not with estimates save. I am looking for a solution to this problem, and I am thiking of different possibilities:
Is there any other Stata command that can do the same work of suest, but that it may be possible to work with estimates save instead of estimates store?

Is it possible to link a .ster file generated with estimates save(where the estimate results are stored in the hard drive) to the memory so my results saved in the disk are "loaded" to the memory sequentially and suest can do its job? I am thinking of something like that Stata "calls to memory" batches of, let's say, 200 estimates saved in the disk, then suest starts to work, and the estimates stored in the memory are replaced subsequently after the suest uses them.

I am not sure if there is any other possible solution to my problem, but everything is welcome.

Thanks a lot.

Last edited by Santiago Rodrigo Cesteros; 06 Oct 2021, 02:45.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

06 Oct 2021, 07:15

I think what you seek will not be possible.

Have you tried running running your desired analysis using a smaller number of your outcomes – say, 200 of them – to confirm that what you seek to do is not constrained by suest in other ways that the number of estimates that can be stored?
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2459
#3

06 Oct 2021, 08:01

Hi Santiago.
considering your model description, I wonder if you will face other challenges.
for instance, why is it that you have 1000 models? could you describe what is it you are trying to do?
WIth so many models, you are bound to have significant results of some of them, so corrections for that may be necessary (beyond Joint estimations)
you will also have many variables per model. Say, 10 explanatory variables + a constant. So a total of 11 coefficients per regression. Multiply this by 1000 and you will be reaching the upper bounds of Stata's matrix dimensions (11k if older versions If i remember correctly). A lot of the math and computations will become incredibly sluggish at that point.

Perhaps if you restate your question, and explain more on the purpose, it would be easier to find a feasible solution.
F
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

06 Oct 2021, 08:31

Fernando expresses concerns that I also had while writing post #2. On Statalist, we often see complicated problems which it turns out can be avoided, rather than solved, by suggesting a different path to the poster's ultimate destination.

With that said, in this openly available paper from Sociological Methocology

https://journals.sagepub.com/doi/ful...81175019852763

the authors write

Canette (2014) explained that the SUEST method can also be implemented using generalized structural equation modeling software, such as Stata’s gsem command (see also Lindsey 2016).

Canette, Isabel. 2014. “Using gsem to Combine Estimation Results.” The Stata Blog. Retrieved May 16, 2019. http://blog.stata.com/2014/08/18/usi...ation-results/.

While I remain concerned that you have set out on a difficult path to a result more easily reached by another path, perhaps this article will help.
2 likes
Comment
Santiago Rodrigo Cesteros

Join Date: Oct 2021

Posts: 6
#5

07 Oct 2021, 03:04

Originally posted by William Lisowski View Post

I think what you seek will not be possible.

Have you tried running running your desired analysis using a smaller number of your outcomes – say, 200 of them – to confirm that what you seek to do is not constrained by suest in other ways that the number of estimates that can be stored?

Hi William,

I run my code with only around 150 outcomes and it works well. I got no errors during the process. So far the only constraint I am facing is the number of estimates that can be stored in memory.

Last edited by Santiago Rodrigo Cesteros; 07 Oct 2021, 03:15.
Comment
Santiago Rodrigo Cesteros

Join Date: Oct 2021

Posts: 6
#6

07 Oct 2021, 03:11

Originally posted by FernandoRios View Post

Hi Santiago.
considering your model description, I wonder if you will face other challenges.
for instance, why is it that you have 1000 models? could you describe what is it you are trying to do?
WIth so many models, you are bound to have significant results of some of them, so corrections for that may be necessary (beyond Joint estimations)
you will also have many variables per model. Say, 10 explanatory variables + a constant. So a total of 11 coefficients per regression. Multiply this by 1000 and you will be reaching the upper bounds of Stata's matrix dimensions (11k if older versions If i remember correctly). A lot of the math and computations will become incredibly sluggish at that point.

Perhaps if you restate your question, and explain more on the purpose, it would be easier to find a feasible solution.
F

Hi Fernando,

We are working with administrative data where different public entites can spend money to buy goods and servicies using different modalities. These aggregated amounts can be disaggregated by type of product at different levels. The less disaggregated category we have give us around 50 different type of products, while the most disaggregated one give us around 1,000 different products (e.g., the first classfication could be food, but food can be water, sofdrinks, meat, etc.). The analysis we are performing with the aggregated amounts is what we want to do with at the product level. Instead of having one (aggregated) outcome, we have now one outcome per type of product.

I hope this has clarified a bit more the context in which I am working.
Comment
Santiago Rodrigo Cesteros

Join Date: Oct 2021

Posts: 6
#7

07 Oct 2021, 03:13

Originally posted by William Lisowski View Post

Fernando expresses concerns that I also had while writing post #2. On Statalist, we often see complicated problems which it turns out can be avoided, rather than solved, by suggesting a different path to the poster's ultimate destination.

With that said, in this openly available paper from Sociological Methocology

https://journals.sagepub.com/doi/ful...81175019852763

the authors write

While I remain concerned that you have set out on a difficult path to a result more easily reached by another path, perhaps this article will help.

Hi William,

Thanks for your reply. I will take a look to that. I mean, if the gsem command can do the same job that suest does, then it may be a possibility to consider.
Comment

John Mullahy

Join Date: Dec 2016
Posts: 750

11 Oct 2021, 21:12

If the objective of using suest is to obtain the joint vcov matrix of all the parameter estimates then consider this approach. Suppose there are n models each having m parameters; in your case n>300. To get the overall nm x nm vcov matrix (what you would like suest to do) you could loop over pairs of models, use suest to compute vcov(j,k) for each pair j,k=1,...,n, and use Mata to build up the nm x nm overall vcov.

Here's an illustrative example that uses a probit model with n=4 and m=3 but the model could be anything suest can accommodate.

Code:

cap drop _all

set seed 2345

set matastrict off

est clear

local nparm=3
local nmod=4

set obs 100
forval j=1/`nmod' {
 gen y`j'=uniform()>.5
}
gen x1=uniform()
gen x2=uniform()

forval j=1/`nmod' {
 qui probit y`j' x1 x2
 est store est`j'
}

qui suest est*

matrix list e(V)

mata vall=J(`nmod'*`nparm',`nmod'*`nparm',.)

forval j=1/`nmod' {
 forval k=1/`nmod' {
  qui suest est`j' est`k'
  mata {
    jind=((`j'-1)*`nparm'):+(1..`nparm')
    kind=((`k'-1)*`nparm'):+(1..`nparm')
    if (`j'==`k') vall[jind,kind]=st_matrix("e(V)")[1..`nparm',1..`nparm']
    if (`j'!=`k') vall[jind,kind]=st_matrix("e(V)")[1..`nparm',(`nparm'+1)..(2*`nparm')]
  }
 }
}

mata vall

cap restore

The results include

Code:

. qui suest est*

.
. matrix list e(V)

symmetric e(V)[12,12]
                  est1_y1:    est1_y1:    est1_y1:    est2_y2:    est2_y2:    est2_y2:    est3_y3:
                       x1          x2       _cons          x1          x2       _cons          x1
   est1_y1:x1   .20144323
   est1_y1:x2  -.02133833   .22367378
est1_y1:_cons  -.08849961  -.09432986    .1037153
   est2_y2:x1   .01781022   .00583711  -.00971164   .20391255
   est2_y2:x2   .00633163   .03233892  -.00004085  -.01983431   .22355507
est2_y2:_cons  -.00995285   .00052661  -.00400146  -.08954456  -.09561693   .10448775
   est3_y3:x1   .00814581  -.01490431   .00534518   .01570078   .02823561  -.02836561   .19094824
   est3_y3:x2  -.01445972    .0190207   .00012208   .02786313   .01541896  -.01892129  -.01547745
est3_y3:_cons   .00502301   .00051938  -.00502998  -.02791921  -.01921036   .02617372  -.08662506
   est4_y4:x1  -.01064478  -.02477466   .01838488   .07090832   .02020261   -.0427962   .04911927
   est4_y4:x2  -.02509674   .00416723   .01871459   .02132308   .03701142  -.03531496     .005577
est4_y4:_cons   .01815388   .01873788  -.02365554  -.04304151   -.0347871   .04498214   -.0249226

                  est3_y3:    est3_y3:    est4_y4:    est4_y4:    est4_y4:
                       x2       _cons          x1          x2       _cons
   est3_y3:x2   .21944462
est3_y3:_cons  -.09490548   .10297088
   est4_y4:x1   .00490062  -.02467669   .20081972
   est4_y4:x2   .03826455  -.01841812   -.0217125   .21863611
est4_y4:_cons  -.01807432   .02098608  -.08957031  -.09175165    .1035499


.
. mata vall
                   1              2              3              4              5              6
     +-------------------------------------------------------------------------------------------
   1 |   .2014432289   -.0213383293    -.088499614    .0178102211    .0063316331   -.0099528452
   2 |  -.0213383293    .2236737818   -.0943298633    .0058371147    .0323389243    .0005266125
   3 |   -.088499614   -.0943298633    .1037153048   -.0097116429   -.0000408514   -.0040014563
   4 |   .0178102211    .0058371147   -.0097116429    .2039125516    -.019834313   -.0895445589
   5 |   .0063316331    .0323389243   -.0000408514    -.019834313    .2235550742   -.0956169275
   6 |  -.0099528452    .0005266125   -.0040014563   -.0895445589   -.0956169275    .1044877521
   7 |   .0081458089   -.0149043075    .0053451763    .0157007764    .0282356058   -.0283656056
   8 |  -.0144597209    .0190207017    .0001220784    .0278631338    .0154189551   -.0189212904
   9 |   .0050230052    .0005193787   -.0050299753   -.0279192085   -.0192103559    .0261737194
  10 |   -.010644782   -.0247746616    .0183848769    .0709083151    .0202026114   -.0427962015
  11 |  -.0250967382    .0041672292    .0187145906    .0213230811      .03701142   -.0353149617
  12 |   .0181538812    .0187378786   -.0236555394   -.0430415081   -.0347871019    .0449821393
     +-------------------------------------------------------------------------------------------
                   7              8              9             10             11             12
      -------------------------------------------------------------------------------------------+
   1     .0081458089   -.0144597209    .0050230052    -.010644782   -.0250967382    .0181538812  |
   2    -.0149043075    .0190207017    .0005193787   -.0247746616    .0041672292    .0187378786  |
   3     .0053451763    .0001220784   -.0050299753    .0183848769    .0187145906   -.0236555394  |
   4     .0157007764    .0278631338   -.0279192085    .0709083151    .0213230811   -.0430415081  |
   5     .0282356058    .0154189551   -.0192103559    .0202026114      .03701142   -.0347871019  |
   6    -.0283656056   -.0189212904    .0261737194   -.0427962015   -.0353149617    .0449821393  |
   7     .1909482445   -.0154774518   -.0866250641    .0491192701     .005576996   -.0249226027  |
   8    -.0154774518    .2194446182   -.0949054836    .0049006152    .0382645516    -.018074319  |
   9    -.0866250641   -.0949054836    .1029708791   -.0246766864   -.0184181169    .0209860843  |
  10     .0491192701    .0049006152   -.0246766864    .2008197189   -.0217125046   -.0895703135  |
  11      .005576996    .0382645516   -.0184181169   -.0217125046    .2186361054   -.0917516532  |
  12    -.0249226027    -.018074319    .0209860843   -.0895703135   -.0917516532    .1035499044  |
      -------------------------------------------------------------------------------------------+

Compare the result displayed by matrix list e(V) after the suest est* command with the result displayed by the mata vall command.

This is computationally wasteful in at least two ways. First, the suest blocks could be larger (i.e. accommodate more than just two models). Second, since the overall vcov is symmetric it is only necessary to compute either the lower or the upper off-diagonal blocks. The code shown here could presumably be modified to handle these at the expense of more complicated loop indexing.

Comment

John Mullahy

Join Date: Dec 2016

Posts: 750
#9

12 Oct 2021, 07:11

Correction to my previous posting #8: I had imagined that the constraint was with suest but just realized the constraint is with the number of estimates that can be stored using estimates store. Apologies for confusion I might have created.

I think the same basic idea I proposed in #8 can still be used, however. One would do a (j,k) loop to estimate pairs of models, estimates store those estimates, suest based on those stored estimates, update the Mata vall matrix, estimates clear the recently stored estimates, and continue the (j,k) loops to completion.For the same reasons mentioned in #8 this is computationally wasteful but that could be remedied with better looping logic if necessary.
Comment
Santiago Rodrigo Cesteros

Join Date: Oct 2021

Posts: 6
#10

12 Oct 2021, 09:46

Originally posted by John Mullahy View Post

Correction to my previous posting #8: I had imagined that the constraint was with suest but just realized the constraint is with the number of estimates that can be stored using estimates store. Apologies for confusion I might have created.

I think the same basic idea I proposed in #8 can still be used, however. One would do a (j,k) loop to estimate pairs of models, estimates store those estimates, suest based on those stored estimates, update the Mata vall matrix, estimates clear the recently stored estimates, and continue the (j,k) loops to completion.For the same reasons mentioned in #8 this is computationally wasteful but that could be remedied with better looping logic if necessary.

Hello, John.

Thanks a lot for your reply. I am thinking if even after cleaning the estimates stored I would be able to implement lincom command, which a postestimation command. Estimates will be available in the Mata vall matrix, but I am not sure if there is a way to "tell" lincom that the estimates are in Mata instead of the memory...
1 like
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2459
#11

12 Oct 2021, 09:50

have you considered mvreg (since i think you mention using Linear regressions only?)
This may not work for cases where data is missing. However, if missingness is because of the dependent variable, you will also have to deal with other types of problems.
Comment
Santiago Rodrigo Cesteros

Join Date: Oct 2021

Posts: 6
#12

13 Oct 2021, 06:35

Originally posted by FernandoRios View Post

have you considered mvreg (since i think you mention using Linear regressions only?)
This may not work for cases where data is missing. However, if missingness is because of the dependent variable, you will also have to deal with other types of problems.

Hi, Fernando.

Thanks a lot again for your contribution. Effectively, I am using linear regression, but I see two issues with this command. First one is that mvreg does not allow for clustering standard errors. Second one is that our outcome variables have missing values (they may have more or less missing observations), so it may not no a proper command for this framework.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2459
#13

13 Oct 2021, 07:13

Yes, I thought so
but couple of things tokeep in mind. If your dep variable has missing values, then a simple LR may not be appropriate (i don't think) Because your will haven endogenous sample selection (because of dep variable).
If you are willing to assume that, the other option, which i suggested on twitter, was for you to construct your own version of mvreg that estimates this for MORE equations at the same time.

something like this:

Code:

program myvreg, args lnf xb1 xb2 qui: { replace `lnf'=0 replace `lnf'=`lnf'-($ML_y1-`xb1')^2 if $ML_y1!=. replace `lnf'=`lnf'-($ML_y2-`xb2')^2 if $ML_y2!=. } end ml model lf myvreg (y1 = x1 x2) (y2=x1 x2), missing maximize est sto m1 ml model lf myvreg (y3 = x1 x2) (y4=x1 x2), missing maximize est sto m2 suest m1 m2

So above, rather than using 4 regressions, i only used 2.
You could expand the example to say, 10 regressions, and you will only need 100 estimations (which will be stored) and then call into suest.

As I mentioned before, however, you may run into other problems with matrix sizes.

Also, for your question on how to use the Mata outcome. You willhave to use more "by hand" procedure, since you technically can track each equation and you have each variance and covariance, so you will have to do all the math that lincom does by hand.

F
Comment

Announcement