Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test Post

    There are several ways of doing "t-tests" and tests for differences in proportions using survey data. The t-test is a test of difference in means between two subpopulations
    That the PSUs in 2009 and 2011 were drawn independently does not affect the validity of the test in any way.



    For the t-test see: http://www.ats.ucla.edu/stat/stata/faq/svyttest.htm

    which you could have found by typing in Stata:

    "search survey ttest" or "search survey t-test"

    For comparing two proportions, there are parallel commands:

    Code:
    sysuse auto, clear
    gen mkr = substr(make,1,2)
    svyset mkr [pw = turn]
    
    gen hiprice = price>8000
    
    svy: prop hiprice, over(foreign)
    test _b[Foreign]=_b[Domestic]
    lincom _b[Foreign]-_b[Domestic]
    
    /* A slightly different approach */
    svy: tab foreign hiprice, row se
    lincom _b[p12]-_b[p22]
    
    /* My favorite */
    svy: reg hiprice foreign
    Last edited by Steve Samuels; 15 Nov 2014, 21:03.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

  • #2


    For "t-tests" of survey data (comparison of two means) see:http://http://www.ats.ucla.edu/stat/...q/svyttest.htm

    which you could have found by typing in Stata either of
    Code:
    search survey ttest"
    Code:
    search survey t-test"
    .

    So the answer to your question:
    Also, I have read that there is no svy:ttest option, but that I can use svy: regress or svy: mean to run a survey-adjusted t test. Is that correct?
    is "Yes".

    Proportions are means of 0-1 variables. Therefore the procedures for comparing means will also compare proportions. However confidence intervals for the proportions themselves may extend below zero or above one.Stata will do all adjustments correctly, provided your svyset statement is correct. At a minimum you will need to create "super-strata". See, e.g. http://www.stata.com/statalist/archi.../msg00048.html. If you have a question about svyset, ask in a separate post.

    There are survey commands designed to specifically estimate proportions. Howeve only svy: tabulate is guaranteed to have CIs for the proportions within the [0,1] intervals. This is because the command estimates CIs on the logistic scale, then converts back to the probability scale.

    Some examples for proportions:
    Code:
    sysuse auto, clear
    gen mkr = substr(make,1,2)
    svyset mkr [pw = turn]
    gen hiprice = price>8000  
    
    svy: prop hiprice, over(foreign) coeflegend
    lincom _b[_prop_2:Foreign]-_b[_prop_2:Domestic]
    
    svy: reg hiprice ibn.foreign,nocons
    svy: reg hiprice foreign
    
    /* recommended */
    svy: tab foreign hiprice, row se ci
    lincom _b[p12]-_b[p22]
    Last edited by Steve Samuels; 16 Nov 2014, 07:20.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3

      The reason that the upper confidence limits are >1 is that the standard error for the risk difference is based on RD\( \pm 1.96 \thinspace\)SE. Given that the RDs are so close to 1, the upper limit is greater than 1. The same thing can happen with proportions that are close to zero or one. You will have to truncate values upper limits >1 to 1 by hand when you present your results.
      Last edited by Steve Samuels; 22 Nov 2014, 15:25.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4


        A diagram of a three-source capture-recapture data setup is shown Hook and Regal (1997), page 3.


        [1] E. B. Hook and R. R. Regal. Validity of methods for model selection, weighting for model uncertainty, and small sample adjustment in capture-recapture estimation. American Journal of Epidemiology, 145(12):1138–1144, 1997.

        Linked to here http://aje.oxfordjournals.org/conten.../1138.full.pdf


        N_hat, N_lb, and N_ub are the estimated total, and the upper and lower confidence limits

        x_hat is the estimate of the missing cell predicted by the model..

        The total except for the missing cell is known; call it N_obs. The estimated total is formed by adding the estimate of the missing cell:

        N_hat = N_obs + x_hat

        The models being compared are fit by poisson, with Model000 the simplest, having main effects of the three categories + a constant term. The models , increasing in complexity (single two-way interactions, two two-way interactions) to Model123, which contains the three possible pairs of two-way interactions. This is the saturated model;it has seven parameters, one for each observed cell.


        Gp2 is a fit test described in the article referenced in the help; I don't have the article so I can't describe the test in detail. The \(\texttt{DoF}\) shown in the output is the number of parameters in the model minus 1 (for the constant). AIC and BIC are well-known criteria that can be used to compare models. See p. 157 of the Stata "Base Reference" manual, or just type "help BIC" to get the link.. Smaller values indicate a better fit. The AIC and BIC values shown are actually the difference between the single model AICs (BICx) for that model (not shown in the output) subtracted from those of the saturated model, which serves as a baseline.
        Reference

        Code:
        . clear
        
        . input freq a b c
         . 5       1       1       1
         . 15      1       0       1
         . 3       0       1       1
         . 20      0       0       1
         . 7       1       1       0
         . 37      1       0       0
         . 14      0       1       0
         . .       0       0       0
         . end
        
        . list
        
             +------------------+
             | freq   a   b   c |
             |------------------|
          1. |    5   1   1   1 |
          2. |   15   1   0   1 |
          3. |    3   0   1   1 |
          4. |   20   0   0   1 |
          5. |    7   1   1   0 |
             |------------------|
          6. |   37   1   0   0 |
          7. |   14   0   1   0 |
          8. |    .   0   0   0 |
             +------------------+
        
        
        . . recap freq a b c
        ........
          +-------------------------------------------------------------------------------+
          |    Model   DoF    Gp2   p_value     AIC     BIC   x_hat   N_hat   N_lb   N_ub |
          |-------------------------------------------------------------------------------|
          | Model000     3   2.04       .56   -3.96    -3.8      51     152    127    193 |
          | Model100     4   2.03       .36   -1.97   -1.86      50     151    124    201 |
          | Model020     4   1.12       .57   -2.88   -2.77      67     168    129    249 |
          | Model003     4   2.03       .36   -1.97   -1.86      50     151    126    196 |
          | Model120     5    .72        .4   -1.28   -1.23      93     194    127    529 |
          | Model103     5   2.02       .16     .02     .07      49     150    122    209 |
          | Model023     5   1.01       .31    -.99    -.93      73     174    128    308 |
          | Model123     6      0         1       0       0     164     265    126   1258 |
          +-------------------------------------------------------------------------------+
        Note that the best fitting model, in terms of smallest AIC and BIC is Model000, the simplest model, so that you would choose x_hat = 152 and N_hat = 127 as your best estimate of the population total.
        Last edited by Steve Samuels; 04 Dec 2014, 21:21.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Welcome to Statalist, Dominique

          When you bootstrap, you implictly reject the assumptions of ordinary least squares, especially the assumption that the "error terms" have a constant standard deviation \(\sigma^2_e\).

          The formula for the standard error of a forecast (new observation) depends on that assumption. As a result, Stata (and you) can't use it.


          Linear model: Y(x) is the outcome with covariates \(\mu(x)\) the mean, and \(e\) is the random part, called the "error" or deviation from the mean

          \[
          y(x) = \mu(x) + e
          \]

          Regression estimates provide an estimate of the mean, which is the predicted value: \(\widehat{\mu}\).

          If a regression is run on a set of data. Let \(\widehat{\mu}(x)\) be the predicted value at covariates x. Suppose you now want the standard error for an observation \(y^*\) with the same values of x. You don't know the true mean, but you can estimate it with \(\widehat{\mu}(x)\) , so that we can write:

          \[
          y^*(x) =
          \widehat{\mu}(x) + e^*
          \]

          Here \(e^*\) is the new random part of \(y^*$. Since \(e^*\) is a new observation, independent of those that went into the calculation of \(\widehat{\mu}(x)\), the variance of the new observation is a sum of the variances of the two parts above:

          \[
          \text{Var}(y^*(x)) = \text{Var}(\widehat{\mu}(x)) + \text{Var}(e^*)
          \]

          It's quite possible in some instances that \(\text{Var}(e^*)\) will be a function of \(\widehat{\mu}(x)\)

          The bootstrap provides an estimated variance for \(\widehat{\mu}(x)$, but not for \(e^*\). The modern approach to finding what is known as "prediction
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            Welcome to Statalist, Dominique!

            Bottom line: To compute a standard error for the forecast (a new observation \(y^*)\), you need an estimate of the variance for the error or deviation term. OLS assumes that that variance is either a constant \(\sigma\) or a known function of a constant. The bootstrap makes no assumption about the variance of the error terms. As a consequence it provides no information for estimating the

            Longer:

            If you do a regression, the prediction for a new individual * with covariates \(x\) is the estimated mean \(\widehat{\mu}(x)\).

            However the actual value for a new individual (*) with covariate values $x$, will be the true mean + a deviation or error term:

            \[
            y^* = \mu(x) + e^*
            \]

            Here \(\mu(x)\) is the mean of \(y\) evaluated at the covariates x, and \(e^*\) is the new random part of \(y^*\). The only assumption about \(e^*\) is that it has expectation zero.

            We don't know \(\mu(x)\), so we have to substitute the estimated mean from the regression:

            \[
            y^*(x) =
            \widehat{\mu}(x) + e^*
            \]

            The variance of the new observation is a sum of the variances of the two parts on the left-hand side.

            \[
            \text{var}(y^*(x)) = \text{var}(\widehat{\mu}(x)) + \text{var}(e^*)
            \]

            Both OLS and the bootstrap estimate the first term. But OLS estimates the second only under the assumption of constant error variance. Bootstrapping the regression coefficients alone provides no information about the error variance.


            The modern approach to getting at what is called "prediction error" is cross-validation or a bootstrap analysis, e.g. Efron and Tibshirani, 1997.


            Reference
            Efron, Bradley, and Robert Tibshirani. 1997. Improvements on cross-validation: the 632+ bootstrap method. Journal of the American Statistical Association 92, no. 438: 548-560.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              I've read the Hautch preprint and have two observations: 1. The durations they describe are the time intervals \(\tau\) between events, not the times of events themselves; 2. They discretize the data; and 3) the model they describe cannot be fit by stcox.
              Steve Samuels
              Statistical Consulting
              [email protected]

              Stata 14.2

              Comment


              • #8

                I think there is a misunderstanding here related to terminology and to theory.

                In your earlier post( http://www.statalist.org/forums/foru...ulation-option)

                you asked how to compute "the standard deviation for mean". The standard deviation of the mean is what is universally called the standard error. However it was clear that what you were requesting the the standard deviation for the individual values of log lead exposure (your example) in a subpopulation. Call those values \(X\).

                Let the population mean for the \(X\)s is
                \[
                \overline{X} =
                \sum X_i /N
                \]

                The standard deviation of a measurement \(X\) in a finite population is:
                \[
                S_x = \sqrt{\sum (X_i - \overline{X})^2
                /N}\]
                This is a fixed attribute of the population; it describes variation of \(X\) within the population.

                The standard deviation for the sample mean, on the other hand, represents how variable the estimated mean is from sample to sample. It will depend on the sample design and sample size. To avoid confusion with the population standard deviation, it is referred to as the standard error.

                Now to the current question.

                Now you are asking for the "standard deviation" of "count data" in age categories. Fror "count data" there are two related population parameters for a category \(j\): one is the number of people in the category \(N_j\) and the proportion in the category \(P_j=N_j/N \), where \(N\) is the total count in the population. The estimates for these are given by svy: tabulate. You obviously are interested in the "count", as you use that option.

                The wording of your question implies (in analogy without previous post) that you are really interested in a population standard deviation associated with count. Is this so? If this is the case, then it is easy to derive.

                The population count for category j is \(T_j\), defined as:
                \[
                T_j =
                \sum_{i=1}^N Y_i^j
                \]
                where
                \[
                Y_i^j =
                \begin{cases}
                0 & \text{(element \(i\) is in category \(j\)} \\
                1 & \text{element \(i\) is not in category \(j\)}
                \end{cases}
                \]

                The \(Y_i\) is the individual "count" variable. The population standard deviation of the \(Y_i^j\) is:
                \[
                S_j = P_j=N_j \times (1 - P_j=N_j/)
                \]
                This is the same as the true SD for a theoretical binomial random variable with probability of success \( P_j\). In Stata , you can compute these values in many ways. Here is one:

                Code:
                webuse nhanes2, clear
                svy , subpop(if region==2): prop agegrp, coeflegend
                mata:
                m = st_matrix("e(b)")'
                sd = diagonal(sqrt(diag(m)-diag(m*m')))
                sd
                end
                Last edited by Steve Samuels; 18 Jan 2015, 14:25.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #9
                  You can do this directly.

                  For a multiple regression with outcome \(y\) and predictors \(x_k, \thinspace k = 1 \dots P\), suppose \(b_k\) is the regression coefficient associated with \(x_k\). Then the standardized regression coefficient \(b_k^'\) is:

                  \[
                  b_k^{'} =b_k \frac{s_{x_k}}{s_y}
                  \]

                  where the \(s_*\) indicates the standard deviation of the variable (*). This is the form used in a handout by Richard Williams: https://www3.nd.edu/~rwilliam/stats1/x92.pdf ).

                  Richard gets coefficients for survey regression by transforming variables in a 2007 Statalist post :http://http://www.stata.com/statalis.../msg00075.html

                  Here's an approach that does it by transforming the coefficients. It requires that you estimate the SDs yourself:

                  Code:
                  sysuse auto, clear
                  local y mpg   /* outcome */
                  local xvars turn length weight /* predictors */
                  
                  svyset turn [pw = price]
                  /* Get coefficients */
                  
                   svy: regress `y' `xvars'
                      matrix b = e(b)
                  qui {
                       /* Get SDs of y and predictors */
                      svy: mean `y'
                      estat sd
                      matrix sy = r(sd)
                  
                      svy: mean `xvars'
                      estat sd
                      matrix sx = r(sd)
                  
                      /*Compute standardized coefficients */
                      mata:
                       sy = st_matrix("sy")
                       sx = st_matrix("sx")'
                       b = st_matrix("b")
                       bx = b[1 ,1..(cols(b)-1)]'
                       st_matrix("betas",(sx:/sy):*bx)
                      end
                  } // end quiet block
                  
                  /* Display Results */
                  matrix rownames betas = `xvars'
                  matrix list betas
                  
                  betas[3,1]
                                  c1
                    turn   .09803076
                  length  -.45034061
                  weight  -.45578916
                  Last edited by Steve Samuels; 27 Jan 2015, 11:15.
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    Use \(\LaTeX\)
                    Last edited by Steve Samuels; 21 Feb 2015, 17:03.
                    Steve Samuels
                    Statistical Consulting
                    [email protected]

                    Stata 14.2

                    Comment


                    • #11
                      I received the following private query, edited:

                      Dear Sir:

                      I’ve got your e-mail address from Statalist. May I ask you one question. I found your explanation about sampling weights somewhere at Statalist (it was as follows:

                      Each person in a selected household shares the selection probability of the household and gets the same sampling weight.

                      I have sampling weights at household level, and I’ve to report the results at individual (specifically per adult equivalent expenditure per annum). I’ve found some suggestions to multiply sampling weights with household sizes to analyze the household expenditure data at individual level. May I know your above suggestion is in line with other suggestions.

                      I have confused that whether I should directly use sampling weights given (believe to be household weights) with svy command or I should use individual weights / population weights.

                      It's an interesting question, and I replied that I would answer if it were asked on Statalist, I've decided not to wait. (The post referred to was at: http://www.stata.com/statalist/archi.../msg01516.html). The bottom line answer is that creating if one creates such a new "individual" weight, one can indeed estimate averages per-person, but doing so requires creation of two new variables and three extra statements. The simplest approach is to use svy: ratio to estimate the per-person average. The following code illustrates the two approaches. For simplicity, assume every person is an adult

                      Code:
                      clear
                      sysuse auto, clear
                      
                      /*  Set up household data with "turn" as the income variable*/
                      gen hhwt = trunk
                      gen hhid = substr(make,1,2)
                      egen hhinc = total(turn), by(hhid)  /* household adult income */
                      egen hhsize = count(turn), by(hhid)  /* number of . adults */
                      
                      /* svyset */
                      svyset hhid [pweight= hhwt]
                      
                      
                      /* Estimate average income per person */
                      svy: ratio av_adult_inc: hhinc/hhsize
                      
                      /* Four Statements if you want to revise weight */
                      gen av_adult_inc = hhinc/hhsize
                      gen new_wt = hhwt*hhsize
                      svyset hhid [pweight= new_wt]
                      svy: mean  av_indiv_inc
                      Both svy: ratio and svy: mean give the same estimate and standard error. I quote only svy: ratio

                      Code:
                      Number of strata =       1          Number of obs    =      74
                      Number of PSUs   =      23          Population size  =    1018
                                                          Design df        =      22
                      
                       av_adult_inc: hhinc/hhsize
                      
                      --------------------------------------------------------------
                                   |             Linearized
                                   |      Ratio   Std. Err.     [95% Conf. Interval]
                      -------------+------------------------------------------------
                      av_adult_inc |   40.77771    .659689       39.4096    42.14582
                      --------------------------------------------------------------
                      So the analysis which creates a new weight variable is not only unnecessary but undesirable in two ways: 1) it requires the creation of two extra variables and one extra svyset statement; 2) the analyst will have to explain that the new weight is not a real per-person weight. (If the the study did have incomes for individuals, each would get the household weight, as I said in the earlier post.)


                      Last edited by Steve Samuels; 17 Mar 2015, 10:41.
                      Steve Samuels
                      Statistical Consulting
                      [email protected]

                      Stata 14.2

                      Comment


                      • #12
                        I've now read Athey and Imbens (2006), Here's the proof that plain DID (no reweighting, no predictive margins) estimates ATET. See also Imbens and Woodridge, 2008.

                        Definitions: \(i\) indexes individuals; Group \(G_i = 0,1\); time \(T_i =0,1\). \(Y_i(0)\) is the, possibly counterfactual, response of person \(i\) in the absence of intervention. \(Y_i(1)\) is the, possibly counterfactual, response if person \(i\) got the intervention.

                        The standard (linear) DID model is:

                        \[
                        E(Y_i) = \alpha +\beta\, T_i + \gamma\, G_i + \tau \,G_i T_i \quad \quad \quad \quad (1)
                        \]
                        where \(\tau\) is the DID, to be estimated by least squares regression.

                        In the (possibly counterfactual) absence of intervention, the expected outcome is:
                        \[
                        E(Y_i(0)) = \alpha +\beta\, T_i + \gamma\, G_i \quad \quad \quad \quad (2)
                        \]
                        In the (possibly counterfactual) presence of intervention, the expected outcome is:
                        \[
                        E(Y_i(1) = \text{E}(Y_i(0)) + \tau \quad \quad \quad \quad (3)
                        \]
                        Now, ATET is the expected difference in \(Y_i(0) -Y_i(1)\) for those treated by time 1, i.e. with \(G = 1\) and \(T = 1\). Plugging these values into Equations 2 & 3 we get:
                        \[

                        \text{ATET} = E(Y_i(1) -Y_i(0)\, | \,G=1,T =1) = \\
                        \\
                        E(Y_i(1)| \,G =1, T=1) -E(Y_i(0)\, | \,G=1, T =1) = \tau
                        \]

                        References:

                        Athey, Susan, and Guido W Imbens. 2006. Identification and inference in nonlinear difference in differences models. Econometrica 74, no. 2: 431-497.

                        Imbens, Guido M, and Jeffrey M Wooldridge. 2008. Recent developments in the econometrics of program evaluation.
                        Last edited by Steve Samuels; 25 Mar 2015, 12:16.
                        Steve Samuels
                        Statistical Consulting
                        [email protected]

                        Stata 14.2

                        Comment


                        • #13
                          I've now read Athey and Imbens (2006), Here's the proof that plain DID (no reweighting, no predictive margins) estimates ATET. See also Imbens and Woodridge, 2008.


                          Attached Files
                          Steve Samuels
                          Statistical Consulting
                          [email protected]

                          Stata 14.2

                          Comment


                          • #14
                            Treatment effects are in the units of the dependent variables, as the definition in the TE Intro chapter of the manual shows (p. 192). Therefore your statement that
                            Just to elaborate ATT value for unit cost is 0.164 which implies it is 16.4 % increase
                            is clearly wrong. The effect is on the scale of whatever currency units your data were in.

                            Your statement about the your debt ratio variable is also ambiguous. If it entered as percents \(Y_1 -Y_0\) is should be read of as an effect in "percentage points". A statement that the effect is a "-2.28%" difference would commonly be understood as a relative effect, e.g.

                            \[
                            \frac{E(Y_1)-E(Y_0)}{E(Y_0)} = -0.0228.
                            \]
                            Last edited by Steve Samuels; 12 Apr 2015, 18:42.
                            Steve Samuels
                            Statistical Consulting
                            [email protected]

                            Stata 14.2

                            Comment


                            • #15
                              \begin{eqnarray}
                              a & = & b \\
                              c & = & d
                              \end{eqnarray}
                              Steve Samuels
                              Statistical Consulting
                              [email protected]

                              Stata 14.2

                              Comment

                              Working...
                              X