Testing differences in means

Carlos Noyola

Join Date: Feb 2019
Posts: 13

Testing differences in means

04 Dec 2019, 11:49

I am trying to replicate some of the results in (Krueger, 1999) to get better at econometrics. Treatment means being assigned to a small class, while not being treated means assignment to a regular class (However, there is a third type of class: regular size with teaching assistant, and I am not sure whether I should treat this the same as not being treated).

In order to study the impact of class size in learning outcomes, Krueger first proceeds to show the differences in sample means for several variables: free lunch (measure of socioeconomic class), White/Asian, Age in 1985, Attrition rate, class size and percentile score. He presents the sample mean for every variable broken down by class size, and then the joint P-value. For instance, he presents in one row the percentage of students with free lunch in small classes, in the next column the mean for students with free lunch in regular classes and then the same for regular classes with teaching assistant. The Joint p-value is then testing whether the difference between the number of students with the benefit of free lunch is statistically different across class size. My problem is, where is that p-value coming from?

I realized that, since we want to see the effect on the treatment variable, I have to create a variable for that, and then use regress on that variable and the variables that I am interested in, and then test the hypothesis that jointly, free lunch in small classes, regular and regular with TA does not differ. But I do not know how to finish it. My main confusion is that the variable that indicates the class size is just one, cltypek, taking 3 different values. Should I interact each variable with cltypek? Is this a t test or an F test? Can I obtain the p values for all 6 rows with just one regression, or do I have to run one for every variable. The treatment variable I created is treat_k, and here is what I did

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(treat_k freelunch_k white_or_asian age1985 attrition_k avg_perc_k csk) byte cltypek
. . 0 6 .         . 5273 .
1 0 1 5 0   57.8278   15 1
1 0 0 6 0   80.9352   17 1
. . 1 6 .         . 5273 .
0 1 0 5 1  48.92318   22 3
. . 1 6 .         . 5273 .
. . 0 6 .         . 5273 .
. . 1 6 .         . 5273 .
. . 1 6 .         . 5273 .
. . 1 6 .         . 5273 .
0 0 1 5 0  82.97334   22 2
1 1 1 5 0 65.025406   15 1
0 1 1 5 0  38.99766   24 2
0 1 0 5 1  6.663901   25 3
. . 1 5 .         . 5273 .
. . 1 5 .         . 5273 .
. . 0 5 .         . 5273 .
. . 0 6 .         . 5273 .
. . 0 5 .         . 5273 .
. . 0 5 .         . 5273 .
0 0 1 5 0  71.85199   21 2
. . 1 6 .         . 5273 .
1 0 0 5 0  54.26017   14 1
. . 1 6 .         . 5273 .
1 0 1 5 0  81.48058   14 1
. . 1 6 .         . 5273 .
. . 1 5 .         . 5273 .
0 0 1 5 0   46.6673   17 2
0 0 1 5 0  49.86032   23 3
0 1 0 5 1         .   24 2
. . 1 6 .         . 5273 .
1 0 1 5 0  92.33137   14 1
1 0 1 5 0  51.51086   17 1
0 0 1 6 0  89.32625   22 3
0 0 1 5 1  58.13467   21 2
0 1 1 6 1 35.491154   23 3
0 1 0 5 1 71.690445   23 2
. . 1 6 .         . 5273 .
1 0 1 5 0  43.22603   17 1
0 1 1 6 0  66.37458   22 2
. . 1 6 .         . 5273 .
0 0 1 5 1         .   20 2
. . 1 6 .         . 5273 .
0 0 1 5 1  38.99766   25 2
. . 0 5 .         . 5273 .
. . 0 . .         . 5273 .
0 0 1 5 1  60.37717   24 3
1 1 1 5 0  89.28534   14 1
. . 1 6 .         . 5273 .
. . 1 5 .         . 5273 .
0 1 0 5 1 33.207203   19 3
0 1 0 5 0  90.76884   25 3
0 0 0 5 1  60.87349   24 3
0 0 1 6 0  95.92397   22 2
0 0 1 5 0 27.713375   19 3
0 0 1 5 1  52.00714   23 2
. . 0 5 .         . 5273 .
. . 0 5 .         . 5273 .
. . 1 6 .         . 5273 .
0 1 0 5 1  20.12069   26 2
0 0 1 5 1  57.15902   24 3
. . 1 7 .         . 5273 .
. . 0 6 .         . 5273 .
. . 1 6 .         . 5273 .
1 0 1 5 0  22.93918   16 1
0 1 0 5 1  24.96125   24 2
. . 0 5 .         . 5273 .
1 1 0 5 0 19.405693   16 1
. . 0 5 .         . 5273 .
. . 1 5 .         . 5273 .
0 1 0 6 1  40.08408   27 3
. . 1 6 .         . 5273 .
. . 0 5 .         . 5273 .
. . 0 5 .         . 5273 .
. . 0 6 .         . 5273 .
0 0 1 5 1  64.27664   24 2
1 1 1 5 0  87.63383   14 1
1 0 1 5 0  50.46646   13 1
. . 1 6 .         . 5273 .
1 0 1 5 1  66.91418   13 1
1 0 1 6 0   94.0528   13 1
. . 0 6 .         . 5273 .
. . 1 5 .         . 5273 .
0 0 1 5 0  71.02356   21 2
. . 1 6 .         . 5273 .
. . 1 5 .         . 5273 .
0 1 0 6 0 20.411865   23 3
1 0 1 5 1  88.51144   14 1
. . 1 6 .         . 5273 .
. . 1 6 .         . 5273 .
0 0 1 5 1  54.41739   22 3
. . 0 7 .         . 5273 .
1 0 1 5 0  92.31004   14 1
. . 1 6 .         . 5273 .
. . 0 6 .         . 5273 .
. . 0 5 .         . 5273 .
1 1 1 6 1 15.655228   14 1
0 0 0 5 0  92.45034   27 2
0 0 1 5 1  50.44516   22 3
1 0 1 5 0 69.302605   15 1
end
label values cltypek cltypek
label def cltypek 1 "small class", modify
label def cltypek 2 "regular class", modify
label def cltypek 3 "regular + aide class", modify

Code:

 gen treat_k=.
(11,598 missing values generated)

. replace treat_k=0 if cltypek==2 | cltypek==3
(4,425 real changes made)

. replace treat_k=1 if cltypek==1
(1,900 real changes made)

. *Regress
. reg treat_k freelunch_k white_or_asian age1985 attrition_k avg_perc_k csk

      Source |       SS           df       MS      Number of obs   =     5,853
-------------+----------------------------------   F(6, 5846)      =   1027.79
       Model |  630.994112         6  105.165685   Prob > F        =    0.0000
    Residual |  598.175886     5,846  .102322252   R-squared       =    0.5133
-------------+----------------------------------   Adj R-squared   =    0.5129
       Total |     1229.17     5,852   .21004272   Root MSE        =    .31988

------------------------------------------------------------------------------
     treat_k |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
 freelunch_k |   .0342283   .0096581     3.54   0.000     .0152948    .0531617
white_or_a~n |  -.0261966   .0100654    -2.60   0.009    -.0459285   -.0064647
     age1985 |   .0030928   .0094153     0.33   0.743    -.0153648    .0215503
 attrition_k |   .0230936   .0088482     2.61   0.009     .0057479    .0404393
  avg_perc_k |   .0006209   .0001673     3.71   0.000     .0002928    .0009489
         csk |  -.0678052   .0008694   -77.99   0.000    -.0695094   -.0661009
       _cons |   1.643794   .0539098    30.49   0.000     1.538111    1.749477
------------------------------------------------------------------------------

.

Hopefully you can help me to understand

Tags: joint pvalue, Krueger, paper, regression, replicate

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

04 Dec 2019, 14:08

Perhaps Krueger 1999 is folklore among your circle. But this is a multi-disciplinary, international forum. I have never heard of it, and I suspect the majority of Forum members haven't either. References should always be provided as a citation complete enough to enable the reader to actually find the article should the need to check it. Or better still, if possible, post a URL for an ungated web site containing it.

Your approach does not do what is intended. The purpose of this part of the analysis is to show the extent of similarity or difference among the three treatment groups. p-values alone are not an adequate approach (and arguably they aren't even useful in this context). You need to show the actual average values of each of these variables in each group. For the categorical ones, n and % are simplest, and for the continuous ones typically mean and standard deviation (or median and interquartile range) is shown. Each variable requires its own separate analysis. If you wish to include p-values, you can get them from chi square tests (for the discrete ones) and from the overall F-tests of regression analyses (for the continuous ones). The following code is illustrative:

Code:

foreach v of varlist freelunch_k white_or_asian attrition { tab `v' cltypek, col chi2 } foreach v of varlist age1985 avg_perc_k { tabstat `v', by(cltypek) statistics(N mean sd) regress `v' i.cltypek }

I have skipped the variable csk because I do not know what to make of it. It has a very bizarre distribution and, also not knowing what it is supposed to represent, I have no idea who best to show its descriptive statistics.

I think it is better, at least at this point in the analysis, to keep all three levels of class type distinct. If the differences between large class and large class + teaching assistant are seen to be minimal, then you might consider combining those two into a single "untreated" group--but I would be cautious about that, as there may be differences on variables not measured that are nevertheless relevant. If your ultimate analyses of the effect of class type on whatever your outcome of interest is show that the two "untreated" groups have pretty much the same outcomes, then the case for combining the two into a single untreated group becomes a bit stronger.
1 like
Comment
Carlos Noyola

Join Date: Feb 2019

Posts: 13
#3

05 Dec 2019, 12:03

Thank you very much, and sorry for not providing the link. Probably it was even unnecessary to bring it up. Now I understand why the way I was thinking about it is wrong.
Comment

Announcement

Testing differences in means

Comment

Comment