Summary Statistics for the Sample Used in Regression

Attaullah Shah

Join Date: Aug 2014

Posts: 1669
#1

Summary Statistics for the Sample Used in Regression

10 Mar 2015, 23:40

I have a data set that has 303,706 observations, however, regression models uses only 177,714 observation (due to missing values of some of the variables). Is there a way to find summary statistics of only those observations which are used in the regression i.e. only the 177,714 observations. Thanks in advance.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17713

10 Mar 2015, 23:45

Attaullah may want to consider the -e(sample) option:

Code:

. use auto.dta, clear
(1978 Automobile Data)

. sum price mpg rep78

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       price |        74    6165.257    2949.496       3291      15906
         mpg |        74     21.2973    5.785503         12         41
       rep78 |        69    3.405797    .9899323          1          5

. reg price mpg rep78

      Source |       SS       df       MS              Number of obs =      69
-------------+------------------------------           F(  2,    66) =   11.06
       Model |   144754063     2  72377031.7           Prob > F      =  0.0001
    Residual |   432042896    66  6546104.48           R-squared     =  0.2510
-------------+------------------------------           Adj R-squared =  0.2283
       Total |   576796959    68  8482308.22           Root MSE      =  2558.5

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -271.6425   57.77115    -4.70   0.000    -386.9864   -156.2987
       rep78 |   666.9568   342.3559     1.95   0.056     -16.5789    1350.492
       _cons |   9657.754    1346.54     7.17   0.000       6969.3    12346.21
------------------------------------------------------------------------------

. sum  price mpg rep78 if e(sample)

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       price |        69    6146.043     2912.44       3291      15906
         mpg |        69    21.28986    5.866408         12         41
       rep78 |        69    3.405797    .9899323          1          5

Kind regards,
Carlo
(Stata 19.0)

Comment

Richard Williams

Join Date: Apr 2014
Posts: 5008

10 Mar 2015, 23:52

Or better yet, estat sum, which I just recently discovered.

Code:

. sysuse auto, clear
(1978 Automobile Data)

. reg price mpg rep78

      Source |       SS       df       MS              Number of obs =      69
-------------+------------------------------           F(  2,    66) =   11.06
       Model |   144754063     2  72377031.7           Prob > F      =  0.0001
    Residual |   432042896    66  6546104.48           R-squared     =  0.2510
-------------+------------------------------           Adj R-squared =  0.2283
       Total |   576796959    68  8482308.22           Root MSE      =  2558.5

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -271.6425   57.77115    -4.70   0.000    -386.9864   -156.2987
       rep78 |   666.9568   342.3559     1.95   0.056     -16.5789    1350.492
       _cons |   9657.754    1346.54     7.17   0.000       6969.3    12346.21
------------------------------------------------------------------------------

. estat sum

  Estimation sample regress              Number of obs =     69

  -------------------------------------------------------------
      Variable |        Mean     Std. Dev.       Min        Max
  -------------+-----------------------------------------------
         price |    6146.043      2912.44       3291      15906
           mpg |    21.28986     5.866408         12         41
         rep78 |    3.405797     .9899323          1          5
  -------------------------------------------------------------

The e(sample) qualifier is useful in many other cases, e.g. when you want to do things besides summary statistics.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Attaullah Shah

Join Date: Aug 2014

Posts: 1669
#4

11 Mar 2015, 00:12

Wow, that was a wonderful solution. Thanks gentleman Carlo Lazzaro.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17713
#5

11 Mar 2015, 00:45

Thanks Richard,
I wasn't aware of -estat sum-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35727
#6

11 Mar 2015, 04:36

Strictly, e(sample) is not an option. It's a function. It works even if no estimation results are in memory, but not usefully...
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17713
#7

11 Mar 2015, 07:52

Nick.
sorry for my previous mistaking and thanks for the clarification.

Kind regards,
Carlo
(Stata 19.0)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#8

11 Mar 2015, 08:56

Vince Wiggins explained e(sample) way back when.

http://www.stata.com/statalist/archi.../msg00635.html

Key phrase: "-e(sample)- is nothing more than a cleverly hidden variable exposed through the -e(sample)- function"

If you want to make sure you are working with the same cases throughout an analysis, you may want to run an analysis that includes all variables of interest and then generate a new variable equal to e(sample). This would be useful if, say, you want cases dropped that are missing on any of the variables of interest, not just the subset of variables used in a particular analysis.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement