Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collect mean and std error

    Hi statalist,

    I am trying to collect mean and std error using -statsby- for 5 variables, all but one has missing values (rightly so).

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(YM Hh_Lh_1 Hh_Lh_2 Hh_Lh_3 Hh_Lh_4 Hh_Lh_5)
    450  -.5541771 . . .  -.5541771
    451  -.8913546 . . .  -.8913546
    452   1.180533 . . .   1.180533
    453 -1.3362566 . . . -1.3362566
    454  2.1672866 . . .  2.1672866
    455   .1864842 . . .   .1864842
    456  -2.896444 . . .  -2.896444
    457   -.640608 . . .   -.640608
    458   -.329043 . . .   -.329043
    459 .068221726 . . . .068221726
    460 -.20969166 . . . -.20969166
    461 -2.2442036 . . . -2.2442036
    462    1.39843 . . .    1.39843
    463  3.5063734 . . .  3.5063734
    464 -1.2129344 . . . -1.2129344
    465  -1.823473 . . .  -1.823473
    466  -3.513199 . . .  -3.513199
    467   .8716688 . . .   .8716688
    468  -4.835299 . . .  -4.835299
    469   .7619265 . . .   .7619265
    470  -1.219116 . . .  -1.219116
    471  -4.920485 . . .  -4.920485
    472 -1.1379392 . . . -1.1379392
    473  -.4942082 . . .  -.4942082
    474  .06450081 . . .  .06450081
    475 -.54365623 . . . -.54365623
    476 -1.0742679 . . . -1.0742679
    477   2.245706 . . .   2.245706
    478  -1.124578 . . .  -1.124578
    479  -8.332492 . . .  -8.332492
    480  -6.281683 . . .  -6.281683
    481  -12.63192 . . .  -12.63192
    482  3.0493126 . . .  3.0493126
    483 -.04377797 . . . -.04377797
    484 -.57220787 . . . -.57220787
    485  -5.425836 . . .  -5.425836
    486   -4.02298 . . .   -4.02298
    487   .6950481 . . .   .6950481
    488 -1.2252523 . . . -1.2252523
    489  -2.062287 . . .  -2.062287
    490 -4.6790004 . . . -4.6790004
    491 -3.4238594 . . . -3.4238594
    492  3.3442326 . . .  3.3442326
    493  -4.906322 . . .  -4.906322
    494 -3.4882426 . . . -3.4882426
    495  2.3637578 . . .  2.3637578
    496  -2.596212 . . .  -2.596212
    497  -1.723976 . . .  -1.723976
    498  .08288559 . . .  .08288559
    499  -1.614125 . . .  -1.614125
    500   .3808062 . . .   .3808062
    501  .05496636 . . .  .05496636
    502 -1.1708735 . . . -1.1708735
    503   .8644865 . . .   .8644865
    504 -2.3744347 . . . -2.3744347
    505 -1.0844857 . . . -1.0844857
    506 -1.3938882 . . . -1.3938882
    507 -2.0912716 . . . -2.0912716
    508  -.2791283 . . .  -.2791283
    509 -1.7620302 . . . -1.7620302
    510  2.3326828 . . .  2.3326828
    511   -.476401 . . .   -.476401
    512   4.274342 . . .   4.274342
    513  -3.094057 . . .  -3.094057
    514  -4.921236 . . .  -4.921236
    515   2.854689 . . .   2.854689
    516 -1.7556505 . . . -1.7556505
    517  .26775002 . . .  .26775002
    518   .9392418 . . .   .9392418
    519 -2.1816537 . . . -2.1816537
    520  -3.456911 . . .  -3.456911
    521  1.0038673 . . .  1.0038673
    522  -2.462537 . . .  -2.462537
    523  -2.274594 . . .  -2.274594
    524 -2.5359976 . . . -2.5359976
    525 -1.1176423 . . . -1.1176423
    526   .9934656 . . .   .9934656
    527   .4337447 . . .   .4337447
    528   -4.44937 . . .   -4.44937
    529   2.403954 . . .   2.403954
    530  2.5855935 . . .  2.5855935
    531   .6877573 . . .   .6877573
    532   .7853801 . . .   .7853801
    533   .9464576 . . .   .9464576
    534  1.5675107 . . .  1.5675107
    535   .3676826 . . .   .3676826
    536  .27401504 . . .  .27401504
    537   .9743056 . . .   .9743056
    538    .824662 . . .    .824662
    539  -.7150991 . . .  -.7150991
    540 -.51185876 . . . -.51185876
    541  -.0590181 . . .  -.0590181
    542  .16325147 . . .  .16325147
    543  .26473868 . . .  .26473868
    544 -.08085568 . . . -.08085568
    545 -.12526481 . . . -.12526481
    546  1.2527564 . . .  1.2527564
    547  -.3500435 . . .  -.3500435
    548  .05872368 . . .  .05872368
    549  -.4448217 . . .  -.4448217
    end
    format %tm YM


    I have tried:
    Code:
    statsby r(mean) _se, clear : summarize Hh_Lh_1 Hh_Lh_2 Hh_Lh_3 Hh_Lh_4 Hh_Lh_5
    And also where i specified each r(mean) and _se as se1=_se[Hh_Lh_1] and so on, none of which works.

    Thanks in advance, let me know if i need to clarify something.


    Dennis


  • #2
    Well, the -summarize- comand does not leave an _se[] vector behind. In fact, it doesn't calculate standard errors at all.

    In any case, I'm not sure what you're trying to do. The purpose of -statsby- is to aggregate estimation result carried in subsets of the data. But you don't specify a -by()- option in your command. Do you just want the mean and standard errors of each of these variables for the entire data set? If so, perhaps what you want is -tabstat Hh_*, statistics(mean sem)-.

    If that's not what you are looking for, please post back with a clearer explanation.

    Comment


    • #3
      Hi Clyde,

      Thank you for your answer, and -tabstat- seems to do the trick, it is because i want to calculate t-statistics.
      How can i capture the output? i tried:
      Code:
      statsby mean sem, clear: tabstat Hh_*, statistics(mean sem)
      Some background:
      My Hh_* variables are made out of 5 dummy variables (time periods). as so:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(period1 period2 period3 period4 period5)
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      1 . . . 1
      end
      Some of the periods overlap, so i cannot just have one variable taking different values for different periods.

      Whether this is the best way to go about it, i am not sure but it was the only way that i could seem to think of (i am not that well into STATA just yet).

      Basically i want to end up with the average of the Hh_* in each period, as well as the t-stat.

      Then afterwards i run a regression on each of the Hh_* as well.

      Comment


      • #4
        I'm completely confused by your description and I have no idea what you want. I don't know what you mean by "capture the output." I have a sense that you want to somehow put these results into a data file, but why do you want to have a data file for just the mean and standard errors of each variable? Perhaps you want

        Code:
        local sems
        forvalues i = 1/5 {
            local sems `sems' se_period`i' = period`i'
        }
        collapse (mean) period* (sem) `sems'
        This will create a data set containing just the means and standard errors for the five periods. But it will have only a single observation, so I don't understand how you plan to do any kind of regression with it.

        I also don't understand what the point of variables period2 through period4 is as they contain only missing values. For that matter, period1 and period5 are actually both constantly 1, so they seem to serve no purpose either.

        Comment


        • #5
          I understand the confusion, I am trying to replicate the following table, using a different dataset:
          Skærmbillede 2018-05-28 kl. 21.15.56.png
          As for the periods, i have 3.200.000 observation, so each period do have entries, which i have coded as 1 if in the period and 0 otherwise.
          Ave is the average of the Hh_* for each period, and t(Ave) is the corresponding t-statistic.

          The regression i am going to run is the following:
          Click image for larger version

Name:	Skærmbillede 2018-05-28 kl. 21.16.05.png
Views:	1
Size:	15.2 KB
ID:	1446300

          I hope i start to make some sense

          Comment


          • #6
            And the regression i have run as:
            Code:
            forvalues i = 1(1)5 {
            tempfile copy // Saving data
            save `copy'
            
            keep if period`i' == 1
            
            // Running regression Hh-Lh = a + b*MktRf + s*SMB + h*HML
            regress Hh_Lh_`i' MktRf SMB HML, vce(robust)
            statsby se_alpha=_se[_cons] se_MktRf=_se[MktRf] se_SMB=_se[SMB] se_HML=_se[HML] b_alpha=_b[_cons] b_MktRf=_b[MktRf] b_SMB=_b[SMB] b_HML=_b[HML] r2=e(r2), clear: regress Hh_Lh_`i' MktRf SMB HML, vce(robust)
            
            gen t_alpha = b_alpha / se_alpha
            gen t_MktRf = b_MktRf / se_MktRf
            gen t_SMB = b_SMB / se_SMB
            gen t_HML = b_HML / se_HML
            gen r2a  = e(r2_a)
            
            drop r2 se_alpha se_MktRf se_SMB se_HML
            export excel using "Regression", sheet("T4 Regression") sheetmodify cell(A`i')
                
            // Open dataset again
            use `copy', clear
            }

            Comment


            • #7
              OK, I think I understand what you are trying to do now. -statsby- does not work the way you are trying to use it. If you had non-overlapping periods, you could create a single period variable and then use -statsby- (although, with so many observations there are faster ways to do it). But since your periods overlap, -statsby- will not help you at all. I'm afraid you have to do this by creating a postfile and updating it in a loop over periods.

              I think this is what you want, or is closely related:
              Code:
              //    SET UP TOY DATA
              clear*
              set obs 1000
              set seed 1234
              forvalues i = 1/5 {
                  gen period`i' = (runiform() < 0.2)
                  gen HhLh_`i' = rnormal(50)
              }
              gen MktRf = rnormal(100)
              gen SMB = rnormal(100)
              gen HML = rnormal(100)
              
              //    SET UP A POSTFILE
              capture postutil clear
              tempfile results
              postfile handle int period float (mean_Hh_Lh se_Hh_Lh b_MktRf se_MktRf ///
                  b_SMB se_SMB b_HML se_HML r2a) using `results'
                  
              //    DO CALCULATIONS FOR EACH PERIOD AND STORE RESULTS IN THE POSTFILE
              forvalues i = 1/5 {
                  local topost (`i')
                  tabstat HhLh_`i' if period`i' == 1, save statistics(mean sem)
                  matrix M = r(StatTotal)
                  local topost `topost' (M[1,1]) (M[2,1])
                  regress HhLh_`i' MktRf SMB HML if period`i' == 1, vce(robust)
                  foreach v of varlist MktRf SMB HML {
                      local topost `topost' (_b[`v']) (_se[`v'])
                  }
                  local topost `topost' (e(r2a))
                  post handle `topost'
              }
              
              //    DONE WITH CALCULATIONS
              postclose handle
              
              use `results', clear
              foreach x in MktRf SMB HML {
                  gen t_`x' = b_`x'/se_`x'
                  order t_`x', after(se_`x')
              }
              // INSERT YOUR export excel COMMAND HERE



              Comment


              • #8
                Thanks for the provided code, with just minor tweaks it worked like a charm

                One thing though, i don't have any luck tweaking it to provide the _b[_cons] and _se[_cons] as well. could you maybe point me in the right direction?

                Comment


                • #9
                  Code:
                  //    SET UP A POSTFILE
                  capture postutil clear
                  tempfile results
                  postfile handle int period float (mean_Hh_Lh se_Hh_Lh b_cons se_cons b_MktRf se_MktRf ///
                      b_SMB se_SMB b_HML se_HML r2a) using `results'
                      
                  //    DO CALCULATIONS FOR EACH PERIOD AND STORE RESULTS IN THE POSTFILE
                  forvalues i = 1/5 {
                      local topost (`i')
                      tabstat HhLh_`i' if period`i' == 1, save statistics(mean sem)
                      matrix M = r(StatTotal)
                      local topost `topost' (M[1,1]) (M[2,1])
                      regress HhLh_`i' MktRf SMB HML if period`i' == 1, vce(robust)
                      local topost `topost' (_b[_cons]) (_se[_cons])
                      foreach v of varlist MktRf SMB HML {
                          local topost `topost' (_b[`v']) (_se[`v'])
                      }
                      local topost `topost' (e(r2a))
                      post handle `topost'
                  }
                  
                  //    DONE WITH CALCULATIONS
                  postclose handle
                  
                  use `results', clear
                  foreach x in cons MktRf SMB HML {
                      gen t_`x' = b_`x'/se_`x'
                      order t_`x', after(se_`x')
                  }
                  // INSERT YOUR export excel COMMAND HERE

                  Comment


                  • #10
                    Hi Clyde,

                    Thank you for your provided code, it helped me out a lot.

                    However, I thought I simply needed the mean and the std. error to find the t-statistics for the means of the periods.
                    I quess i was wrong.

                    Could you (or someone else) help me out on this subject? i need the t-statestics of the average of Hh-Lh returns, I have now tried several methods, but have gotten no satisfactory values.
                    So i have 5 averages of Hh-Lh from different periods, one period is the full period, the population, the rest are sub-periods, the samples.

                    Dennis

                    Comment


                    • #11
                      However, I thought I simply needed the mean and the std. error to find the t-statistics for the means of the periods.
                      I quess i was wrong.
                      Please elaborate. t = mean/se, so the mean and standard error are, indeed, all you need to calculate a t-statistic. So what is it you think you are missing?

                      Comment


                      • #12
                        Ahh, then my 'sanity' check was off..
                        I just saw the t-stats as being WAY off, as opposed to what range i was trying to hit.

                        My stats:
                        Code:
                         
                        Ave t(Ave)
                        -0.116 -53.901
                        0.087 28.175
                        0.092 23.992
                        -0.004 -0.794
                        -0.268 -90.441
                        From the article the range of Ave is between 0.03 and 0.19 and t(Ave) is between 0.25 and 1.63.
                        However, i use other periods, and other data, so there will always be a difference, i just didn't anticipate such a big difference. Which is the reason i thought/think that i have done something wrong.

                        But thanks for clarifying Clyde.

                        Comment

                        Working...
                        X