Collect mean and std error

Dennis Uhrskov

Join Date: May 2018
Posts: 25

Collect mean and std error

28 May 2018, 11:05

Hi statalist,

I am trying to collect mean and std error using -statsby- for 5 variables, all but one has missing values (rightly so).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(YM Hh_Lh_1 Hh_Lh_2 Hh_Lh_3 Hh_Lh_4 Hh_Lh_5)
450  -.5541771 . . .  -.5541771
451  -.8913546 . . .  -.8913546
452   1.180533 . . .   1.180533
453 -1.3362566 . . . -1.3362566
454  2.1672866 . . .  2.1672866
455   .1864842 . . .   .1864842
456  -2.896444 . . .  -2.896444
457   -.640608 . . .   -.640608
458   -.329043 . . .   -.329043
459 .068221726 . . . .068221726
460 -.20969166 . . . -.20969166
461 -2.2442036 . . . -2.2442036
462    1.39843 . . .    1.39843
463  3.5063734 . . .  3.5063734
464 -1.2129344 . . . -1.2129344
465  -1.823473 . . .  -1.823473
466  -3.513199 . . .  -3.513199
467   .8716688 . . .   .8716688
468  -4.835299 . . .  -4.835299
469   .7619265 . . .   .7619265
470  -1.219116 . . .  -1.219116
471  -4.920485 . . .  -4.920485
472 -1.1379392 . . . -1.1379392
473  -.4942082 . . .  -.4942082
474  .06450081 . . .  .06450081
475 -.54365623 . . . -.54365623
476 -1.0742679 . . . -1.0742679
477   2.245706 . . .   2.245706
478  -1.124578 . . .  -1.124578
479  -8.332492 . . .  -8.332492
480  -6.281683 . . .  -6.281683
481  -12.63192 . . .  -12.63192
482  3.0493126 . . .  3.0493126
483 -.04377797 . . . -.04377797
484 -.57220787 . . . -.57220787
485  -5.425836 . . .  -5.425836
486   -4.02298 . . .   -4.02298
487   .6950481 . . .   .6950481
488 -1.2252523 . . . -1.2252523
489  -2.062287 . . .  -2.062287
490 -4.6790004 . . . -4.6790004
491 -3.4238594 . . . -3.4238594
492  3.3442326 . . .  3.3442326
493  -4.906322 . . .  -4.906322
494 -3.4882426 . . . -3.4882426
495  2.3637578 . . .  2.3637578
496  -2.596212 . . .  -2.596212
497  -1.723976 . . .  -1.723976
498  .08288559 . . .  .08288559
499  -1.614125 . . .  -1.614125
500   .3808062 . . .   .3808062
501  .05496636 . . .  .05496636
502 -1.1708735 . . . -1.1708735
503   .8644865 . . .   .8644865
504 -2.3744347 . . . -2.3744347
505 -1.0844857 . . . -1.0844857
506 -1.3938882 . . . -1.3938882
507 -2.0912716 . . . -2.0912716
508  -.2791283 . . .  -.2791283
509 -1.7620302 . . . -1.7620302
510  2.3326828 . . .  2.3326828
511   -.476401 . . .   -.476401
512   4.274342 . . .   4.274342
513  -3.094057 . . .  -3.094057
514  -4.921236 . . .  -4.921236
515   2.854689 . . .   2.854689
516 -1.7556505 . . . -1.7556505
517  .26775002 . . .  .26775002
518   .9392418 . . .   .9392418
519 -2.1816537 . . . -2.1816537
520  -3.456911 . . .  -3.456911
521  1.0038673 . . .  1.0038673
522  -2.462537 . . .  -2.462537
523  -2.274594 . . .  -2.274594
524 -2.5359976 . . . -2.5359976
525 -1.1176423 . . . -1.1176423
526   .9934656 . . .   .9934656
527   .4337447 . . .   .4337447
528   -4.44937 . . .   -4.44937
529   2.403954 . . .   2.403954
530  2.5855935 . . .  2.5855935
531   .6877573 . . .   .6877573
532   .7853801 . . .   .7853801
533   .9464576 . . .   .9464576
534  1.5675107 . . .  1.5675107
535   .3676826 . . .   .3676826
536  .27401504 . . .  .27401504
537   .9743056 . . .   .9743056
538    .824662 . . .    .824662
539  -.7150991 . . .  -.7150991
540 -.51185876 . . . -.51185876
541  -.0590181 . . .  -.0590181
542  .16325147 . . .  .16325147
543  .26473868 . . .  .26473868
544 -.08085568 . . . -.08085568
545 -.12526481 . . . -.12526481
546  1.2527564 . . .  1.2527564
547  -.3500435 . . .  -.3500435
548  .05872368 . . .  .05872368
549  -.4448217 . . .  -.4448217
end
format %tm YM

I have tried:

Code:

statsby r(mean) _se, clear : summarize Hh_Lh_1 Hh_Lh_2 Hh_Lh_3 Hh_Lh_4 Hh_Lh_5

And also where i specified each r(mean) and _se as se1=_se[Hh_Lh_1] and so on, none of which works.

Thanks in advance, let me know if i need to clarify something.

Dennis

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

28 May 2018, 12:25

Well, the -summarize- comand does not leave an _se[] vector behind. In fact, it doesn't calculate standard errors at all.

In any case, I'm not sure what you're trying to do. The purpose of -statsby- is to aggregate estimation result carried in subsets of the data. But you don't specify a -by()- option in your command. Do you just want the mean and standard errors of each of these variables for the entire data set? If so, perhaps what you want is -tabstat Hh_*, statistics(mean sem)-.

If that's not what you are looking for, please post back with a clearer explanation.
Comment
Dennis Uhrskov

Join Date: May 2018

Posts: 25
#3

28 May 2018, 12:49

Hi Clyde,

Thank you for your answer, and -tabstat- seems to do the trick, it is because i want to calculate t-statistics.
How can i capture the output? i tried:

Code:

statsby mean sem, clear: tabstat Hh_*, statistics(mean sem)

Some background:
My Hh_* variables are made out of 5 dummy variables (time periods). as so:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(period1 period2 period3 period4 period5) 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 1 . . . 1 end

Some of the periods overlap, so i cannot just have one variable taking different values for different periods.

Whether this is the best way to go about it, i am not sure but it was the only way that i could seem to think of (i am not that well into STATA just yet).

Basically i want to end up with the average of the Hh_* in each period, as well as the t-stat.

Then afterwards i run a regression on each of the Hh_* as well.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

28 May 2018, 13:13

I'm completely confused by your description and I have no idea what you want. I don't know what you mean by "capture the output." I have a sense that you want to somehow put these results into a data file, but why do you want to have a data file for just the mean and standard errors of each variable? Perhaps you want

Code:

local sems forvalues i = 1/5 { local sems `sems' se_period`i' = period`i' } collapse (mean) period* (sem) `sems'

This will create a data set containing just the means and standard errors for the five periods. But it will have only a single observation, so I don't understand how you plan to do any kind of regression with it.

I also don't understand what the point of variables period2 through period4 is as they contain only missing values. For that matter, period1 and period5 are actually both constantly 1, so they seem to serve no purpose either.
Comment
Dennis Uhrskov

Join Date: May 2018

Posts: 25
#5

28 May 2018, 13:24

I understand the confusion, I am trying to replicate the following table, using a different dataset:

As for the periods, i have 3.200.000 observation, so each period do have entries, which i have coded as 1 if in the period and 0 otherwise.
Ave is the average of the Hh_* for each period, and t(Ave) is the corresponding t-statistic.

The regression i am going to run is the following:

I hope i start to make some sense
Comment

Dennis Uhrskov

Join Date: May 2018
Posts: 25

28 May 2018, 13:26

And the regression i have run as:

Code:

forvalues i = 1(1)5 {
tempfile copy // Saving data
save `copy'

keep if period`i' == 1

// Running regression Hh-Lh = a + b*MktRf + s*SMB + h*HML
regress Hh_Lh_`i' MktRf SMB HML, vce(robust)
statsby se_alpha=_se[_cons] se_MktRf=_se[MktRf] se_SMB=_se[SMB] se_HML=_se[HML] b_alpha=_b[_cons] b_MktRf=_b[MktRf] b_SMB=_b[SMB] b_HML=_b[HML] r2=e(r2), clear: regress Hh_Lh_`i' MktRf SMB HML, vce(robust)

gen t_alpha = b_alpha / se_alpha
gen t_MktRf = b_MktRf / se_MktRf
gen t_SMB = b_SMB / se_SMB
gen t_HML = b_HML / se_HML
gen r2a  = e(r2_a)

drop r2 se_alpha se_MktRf se_SMB se_HML
export excel using "Regression", sheet("T4 Regression") sheetmodify cell(A`i')
    
// Open dataset again
use `copy', clear
}

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

28 May 2018, 14:12

OK, I think I understand what you are trying to do now. -statsby- does not work the way you are trying to use it. If you had non-overlapping periods, you could create a single period variable and then use -statsby- (although, with so many observations there are faster ways to do it). But since your periods overlap, -statsby- will not help you at all. I'm afraid you have to do this by creating a postfile and updating it in a loop over periods.

I think this is what you want, or is closely related:

Code:

//    SET UP TOY DATA
clear*
set obs 1000
set seed 1234
forvalues i = 1/5 {
    gen period`i' = (runiform() < 0.2)
    gen HhLh_`i' = rnormal(50)
}
gen MktRf = rnormal(100)
gen SMB = rnormal(100)
gen HML = rnormal(100)

//    SET UP A POSTFILE
capture postutil clear
tempfile results
postfile handle int period float (mean_Hh_Lh se_Hh_Lh b_MktRf se_MktRf ///
    b_SMB se_SMB b_HML se_HML r2a) using `results'
    
//    DO CALCULATIONS FOR EACH PERIOD AND STORE RESULTS IN THE POSTFILE
forvalues i = 1/5 {
    local topost (`i')
    tabstat HhLh_`i' if period`i' == 1, save statistics(mean sem)
    matrix M = r(StatTotal)
    local topost `topost' (M[1,1]) (M[2,1])
    regress HhLh_`i' MktRf SMB HML if period`i' == 1, vce(robust)
    foreach v of varlist MktRf SMB HML {
        local topost `topost' (_b[`v']) (_se[`v'])
    }
    local topost `topost' (e(r2a))
    post handle `topost'
}

//    DONE WITH CALCULATIONS
postclose handle

use `results', clear
foreach x in MktRf SMB HML {
    gen t_`x' = b_`x'/se_`x'
    order t_`x', after(se_`x')
}
// INSERT YOUR export excel COMMAND HERE

Comment

Dennis Uhrskov

Join Date: May 2018

Posts: 25
#8

28 May 2018, 15:27

Thanks for the provided code, with just minor tweaks it worked like a charm

One thing though, i don't have any luck tweaking it to provide the _b[_cons] and _se[_cons] as well. could you maybe point me in the right direction?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

28 May 2018, 15:49

Code:

//    SET UP A POSTFILE
capture postutil clear
tempfile results
postfile handle int period float (mean_Hh_Lh se_Hh_Lh b_cons se_cons b_MktRf se_MktRf ///
    b_SMB se_SMB b_HML se_HML r2a) using `results'
    
//    DO CALCULATIONS FOR EACH PERIOD AND STORE RESULTS IN THE POSTFILE
forvalues i = 1/5 {
    local topost (`i')
    tabstat HhLh_`i' if period`i' == 1, save statistics(mean sem)
    matrix M = r(StatTotal)
    local topost `topost' (M[1,1]) (M[2,1])
    regress HhLh_`i' MktRf SMB HML if period`i' == 1, vce(robust)
    local topost `topost' (_b[_cons]) (_se[_cons])
    foreach v of varlist MktRf SMB HML {
        local topost `topost' (_b[`v']) (_se[`v'])
    }
    local topost `topost' (e(r2a))
    post handle `topost'
}

//    DONE WITH CALCULATIONS
postclose handle

use `results', clear
foreach x in cons MktRf SMB HML {
    gen t_`x' = b_`x'/se_`x'
    order t_`x', after(se_`x')
}
// INSERT YOUR export excel COMMAND HERE

Comment

Dennis Uhrskov

Join Date: May 2018

Posts: 25
#10

29 May 2018, 08:22

Hi Clyde,

Thank you for your provided code, it helped me out a lot.

However, I thought I simply needed the mean and the std. error to find the t-statistics for the means of the periods.
I quess i was wrong.

Could you (or someone else) help me out on this subject? i need the t-statestics of the average of Hh-Lh returns, I have now tried several methods, but have gotten no satisfactory values.
So i have 5 averages of Hh-Lh from different periods, one period is the full period, the population, the rest are sub-periods, the samples.

Dennis
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

29 May 2018, 08:36

However, I thought I simply needed the mean and the std. error to find the t-statistics for the means of the periods.
I quess i was wrong.

Please elaborate. t = mean/se, so the mean and standard error are, indeed, all you need to calculate a t-statistic. So what is it you think you are missing?
2 likes
Comment
Dennis Uhrskov

Join Date: May 2018

Posts: 25
#12

29 May 2018, 08:54

Ahh, then my 'sanity' check was off..
I just saw the t-stats as being WAY off, as opposed to what range i was trying to hit.

My stats:

Code:

Ave t(Ave)

-0.116 -53.901

0.087 28.175

0.092 23.992

-0.004 -0.794

-0.268 -90.441

From the article the range of Ave is between 0.03 and 0.19 and t(Ave) is between 0.25 and 1.63.
However, i use other periods, and other data, so there will always be a difference, i just didn't anticipate such a big difference. Which is the reason i thought/think that i have done something wrong.

But thanks for clarifying Clyde.
Comment

Ave	t(Ave)
-0.116	-53.901
0.087	28.175
0.092	23.992

-0.004	-0.794
-0.268	-90.441

Announcement

Collect mean and std error

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment