Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • svy: reg get standardized coefficients for both continuous and categorical predictors

    Hi Statalist,

    I came across an issue for getting standardized coefficients for svy: reg, when there are both continuous and categorical predictors.

    For continuous predictors, I used to just standardize both y and x before regress, which works. Alternatively, there is this post here showing how to do that if all predictors are continuous: https://www.statalist.org/forums/for...-weighted-data

    However, it become tricky when I have categorical variables. I am not so sure how to standardize a categorical/binary variable (I tried but the svy: reg failed..).

    There is also this post here showing how to get the standardized coefficient when there is *one* 0/1 binary variable (example: the foreign variable in the "sysuse auto" dataset ) (https://www.statalist.org/forums/for...-weighted-data). I also wonder if this strategy would work when the predictor is, let's say, i.rep78 (a categorical variable)

    I wonder how to combine the solutions for both posts? The example regression model that I am interested at is:

    Code:
    sysuse auto, clear
    svyset turn [pw = price]
    svy: reg mpg turn length weight i.foregin i.rep78
    Thanks in advance for any thoughts or suggestions!

    Yingyi


  • #2
    Hi Statalist,

    I tried to figure this out but still could not find a solution. I tried to modify the scripts from this post (from Steve Samuels: https://www.statalist.org/forums/for...eighted-data):

    Code:
     
     sysuse auto, clear local y mpg   /* outcome */ local xvars turn length weight /* predictors */  svyset turn [pw = price] /* Get coefficients */   svy: regress `y' `xvars'     matrix b = e(b)       /* Get SDs of y and predictors */     svy: mean `y'     estat sd     matrix sy = r(sd)      svy: mean `xvars'     estat sd     matrix sx = r(sd)      /*Compute standardized coefficients */     mata:      sy = st_matrix("sy")      sx = st_matrix("sx")'      b = st_matrix("b")      bx = b[1 ,1..(cols(b)-1)]'      st_matrix("betas",(sx:/sy):*bx)     end   matrix rownames betas = `xvars' matrix list betas

    I added another local variable for categorical predictors, but I started to get error message when calculating SD:
    Code:
    sysuse auto, clear
    local y mpg   /* outcome */
    local xvars turn length weight /* predictors */
    local xvarscat i.foreign i.rep78 /*categorical predictors*/
    svyset turn [pw = price]
    /* Get coefficients */
     svy: regress `y' `xvars' `xvarscat'
        matrix b = e(b)
        matrix list b
        
         /* Get SDs of y and predictors */
        svy: mean `y'
        estat sd
        matrix sy = r(sd)
    
        svy: mean `xvars'
        estat sd
        matrix sx = r(sd)
        
        svy: mean `xvarscat'
        estat sd
        matrix sxcat = 1
    Here is the error message:
    . svy: mean `xvarscat'
    (running mean on estimation sample)
    factor-variable and time-series operators not allowed
    r(101);



    I wonder if anyone has any insights towards this question? I am very much appreciated for any directions! Thanks again!

    Yingyi

    Comment


    • #3
      I'm not sure why this example isn't working for you, but it works for me:
      Code:
      . sysuse auto, clear
      (1978 automobile data)
      
      . local xvarscat i.foreign i.rep78 /*categorical predictors*/
      
      . svyset turn [pw = price]
      
      Sampling weights: price
                   VCE: linearized
           Single unit: missing
              Strata 1: <one>
       Sampling unit 1: turn
                 FPC 1: <zero>
      
      . svy: mean `xvarscat'
      (running mean on estimation sample)
      
      Survey: Mean estimation
      
      Number of strata =  1                Number of obs   =      69
      Number of PSUs   = 18                Population size = 424,077
                                           Design df       =      17
      
      --------------------------------------------------------------
                   |             Linearized
                   |       Mean   std. err.     [95% conf. interval]
      -------------+------------------------------------------------
           foreign |
         Domestic  |   .6994107    .121862      .4423043    .9565171
          Foreign  |   .3005893    .121862      .0434829    .5576957
                   |
             rep78 |
                1  |   .0215268    .014786      -.009669    .0527225
                2  |   .1125763   .0469625      .0134941    .2116584
                3  |    .454816   .0645902      .3185425    .5910895
                4  |   .2577056   .0536099      .1445986    .3708125
                5  |   .1533754   .0775002     -.0101357    .3168866
      --------------------------------------------------------------
      David Radwin
      Senior Researcher, California Competes
      californiacompetes.org
      Pronouns: He/Him

      Comment


      • #4
        Hi David, thank you very much for your input here. I really appreciate it!

        I was looking for a way to get standardized coefficients and se for each predictor (including both continuous and categorical predictors) using svy: regress.

        I could not find a way to automate what I want. I ended up standardizing every individual predictor manually. For categorical variable, I created dummies for each level of the variable. (That said, the results are npot perfect because I did not know how to standardize the intercept...)

        I really wish that svy: regress could encompass the ", beta" option like we have in regular regress statement.





        Comment


        • #5
          I think I understand. If you just want the standardized coefficients and don't care about standard errors, or if you want to check your manual calculations, you could use aweights like
          Code:
          regress `y' `xvars' `xvarscat' [aweight=price] , beta
          David Radwin
          Senior Researcher, California Competes
          californiacompetes.org
          Pronouns: He/Him

          Comment


          • #6
            Originally posted by David Radwin View Post
            I think I understand. If you just want the standardized coefficients and don't care about standard errors, or if you want to check your manual calculations, you could use aweights like
            Code:
            regress `y' `xvars' `xvarscat' [aweight=price] , beta
            Hi David, thank you so much for the follow up! The [aweight = ] option seems the way to go to solve my puzzle. I wonder if this could further encompass sampling strata? For instance, this is how I set up sample weights:
            svyset turn [pw = price], strata(rep78)

            Code:
            sysuse auto, clear
            local y mpg   /* outcome */
            local xvars  length weight /* predictors */
            local xvarscat i.foreign i.rep78 /*categorical predictors*/
            svyset turn [pw = price], strata(rep78) 
            svy: mean `xvarscat'
            regress `y' `xvars' `xvarscat' [aweight=price] , beta
            For the last line of scripts, how can I encompass the stratum variable?

            Thanks again!


            Comment


            • #7
              How to properly incorporate sampling strata is a question that is probably best answered by the creator or distributor of the dataset. I'm sorry I can't help with that topic.
              David Radwin
              Senior Researcher, California Competes
              californiacompetes.org
              Pronouns: He/Him

              Comment


              • #8
                David says his solution works so long as you don't care about the standard errors, i.e. you only want the point estimates. If you only want the point estimates, things like sampling strata do not matter (they are just used to adjust the standard errors, not the point estimates].

                Also, somebody can correct me if I'm wrong, but wouldn't the t-values for the standardized coefficients be the same as the t-values you got by using svy: reg? That is, if you add the command to your do-file

                svy: regress `y' `xvars' `xvarscat'

                Aren't the t-values for each variable provided by it the correct t-values for where you instead use aweights? (I don't know that for a fact, but it seems logical to me. At least it seems logical if David's statement about the use of aweights is correct -- off the top of my head, I don't know if it is or isn't.)
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  If I understand your (Richard Williams') question correctly, I am afraid not. Moreover, the difference in t-values between the two methods is not trivial.

                  To illustrate, in the following example based on the original post, using svy: reg, the t-value for length is -1.69, whereas in the same regression model using reg with aweights, the t-value for length is -2.52. Setting aside for a moment the important debate over the use of null hypothesis statistical tests, under the conventional p < .05 standard and 2-tailed test, the first result is statistically significant and the second is not.

                  Code:
                  . sysuse auto, clear
                  (1978 automobile data)
                  
                  . svyset turn [pw = price]
                  
                  Sampling weights: price
                               VCE: linearized
                       Single unit: missing
                          Strata 1: <one>
                   Sampling unit 1: turn
                             FPC 1: <zero>
                  
                  . svy: reg mpg turn length weight i.foreign i.rep78
                  (running regress on estimation sample)
                  
                  Survey: Linear regression
                  
                  Number of strata =  1                                Number of obs   =      69
                  Number of PSUs   = 18                                Population size = 424,077
                                                                       Design df       =      17
                                                                       F(8, 10)        =   24.58
                                                                       Prob > F        =  0.0000
                                                                       R-squared       =  0.7003
                  
                  ------------------------------------------------------------------------------
                               |             Linearized
                           mpg | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                          turn |  -.0021137   .1683286    -0.01   0.990    -.3572561    .3530286
                        length |  -.1510464   .0894002    -1.69   0.109    -.3396644    .0375716
                        weight |  -.0021393   .0019522    -1.10   0.288    -.0062581    .0019795
                               |
                       foreign |
                      Foreign  |  -3.501613   1.348396    -2.60   0.019    -6.346481   -.6567456
                               |
                         rep78 |
                            2  |  -.3229413   1.215734    -0.27   0.794    -2.887916    2.242034
                            3  |   .2452496   .9106538     0.27   0.791    -1.676062    2.166561
                            4  |   1.687671   1.158219     1.46   0.163    -.7559587      4.1313
                            5  |   3.532869   1.864861     1.89   0.075    -.4016428    7.467381
                               |
                         _cons |   56.19769   8.184245     6.87   0.000     38.93044    73.46494
                  ------------------------------------------------------------------------------
                  
                  . reg mpg turn length weight i.foreign i.rep78 [aweight=price]
                  (sum of wgt is 424,077)
                  
                        Source |       SS           df       MS      Number of obs   =        69
                  -------------+----------------------------------   F(8, 60)        =     17.53
                         Model |  1575.88882         8  196.986103   Prob > F        =    0.0000
                      Residual |   674.31245        60  11.2385408   R-squared       =    0.7003
                  -------------+----------------------------------   Adj R-squared   =    0.6604
                         Total |  2250.20127        68  33.0911952   Root MSE        =    3.3524
                  
                  ------------------------------------------------------------------------------
                           mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                          turn |  -.0021137    .218006    -0.01   0.992    -.4381906    .4339632
                        length |  -.1510464   .0600039    -2.52   0.015     -.271072   -.0310208
                        weight |  -.0021393   .0016669    -1.28   0.204    -.0054736     .001195
                               |
                       foreign |
                      Foreign  |  -3.501613   1.460497    -2.40   0.020    -6.423042   -.5801845
                               |
                         rep78 |
                            2  |  -.3229413   3.019317    -0.11   0.915    -6.362475    5.716592
                            3  |   .2452496   2.842812     0.09   0.932    -5.441222    5.931721
                            4  |   1.687671   2.961134     0.57   0.571     -4.23548    7.610821
                            5  |   3.532869   3.151439     1.12   0.267    -2.770947    9.836686
                               |
                         _cons |   56.19769   8.248686     6.81   0.000     39.69786    72.69752
                  ------------------------------------------------------------------------------
                  David Radwin
                  Senior Researcher, California Competes
                  californiacompetes.org
                  Pronouns: He/Him

                  Comment


                  • #10
                    Well, now my intuition is getting confused. ;-) But remember, svy uses pweights, and David's regression example was using aweights. pweights is basically aweights + vce(robust). Further, the beta option DOES work with pweights. So,

                    Code:
                    . svy, vce(robust): reg mpg turn length weight i.foreign i.rep78
                    (running regress on estimation sample)
                    
                    Survey: Linear regression
                    
                    Number of strata =  1                                Number of obs   =      69
                    Number of PSUs   = 18                                Population size = 424,077
                                                                         Design df       =      17
                                                                         F(8, 10)        =   24.58
                                                                         Prob > F        =  0.0000
                                                                         R-squared       =  0.7003
                    
                    ------------------------------------------------------------------------------
                                 |             Linearized
                             mpg | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                    -------------+----------------------------------------------------------------
                            turn |  -.0021137   .1683286    -0.01   0.990    -.3572561    .3530286
                          length |  -.1510464   .0894002    -1.69   0.109    -.3396644    .0375716
                          weight |  -.0021393   .0019522    -1.10   0.288    -.0062581    .0019795
                                 |
                         foreign |
                        Foreign  |  -3.501613   1.348396    -2.60   0.019    -6.346481   -.6567456
                                 |
                           rep78 |
                              2  |  -.3229413   1.215734    -0.27   0.794    -2.887916    2.242034
                              3  |   .2452496   .9106538     0.27   0.791    -1.676062    2.166561
                              4  |   1.687671   1.158219     1.46   0.163    -.7559587      4.1313
                              5  |   3.532869   1.864861     1.89   0.075    -.4016428    7.467381
                                 |
                           _cons |   56.19769   8.184245     6.87   0.000     38.93044    73.46494
                    ------------------------------------------------------------------------------
                    
                    . reg mpg turn length weight i.foreign i.rep78 [pweight=price], beta
                    (sum of wgt is 424,077)
                    
                    Linear regression                               Number of obs     =         69
                                                                    F(8, 60)          =      20.51
                                                                    Prob > F          =     0.0000
                                                                    R-squared         =     0.7003
                                                                    Root MSE          =     3.3524
                    
                    ------------------------------------------------------------------------------
                                 |               Robust
                             mpg | Coefficient  std. err.      t    P>|t|                     Beta
                    -------------+----------------------------------------------------------------
                            turn |  -.0021137   .1988278    -0.01   0.992                -.0016806
                          length |  -.1510464   .0862039    -1.75   0.085                -.5974503
                          weight |  -.0021393   .0021876    -0.98   0.332                -.3142225
                                 |
                         foreign |
                        Foreign  |  -3.501613   1.105775    -3.17   0.002                 -.281148
                                 |
                           rep78 |
                              2  |  -.3229413   1.245539    -0.26   0.796                -.0178742
                              3  |   .2452496   .8204691     0.30   0.766                 .0213851
                              4  |   1.687671   1.190294     1.42   0.161                 .1292563
                              5  |   3.532869   2.036567     1.73   0.088                 .2229281
                                 |
                           _cons |   56.19769   8.345655     6.73   0.000                        .
                    ------------------------------------------------------------------------------
                    So. the t-values are not identical across approaches, like I thought they would be, but they also aren't as far apart as David's example had indicated.
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    StataNow Version: 19.5 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment


                    • #11
                      the t-values are not identical across approaches
                      My guess is: with "svy", not only weights are used, but standard errors are calculated by considerinng clustering according to a declared variable. In this case, we are comparing two regressions that differ because, in the first case, there's a clustering wrt the variable "turn", while in the second case there's no clustering. Is my conjecture correct? In case, are there solutions to this issue? I see that clustering of errors may not be combined with the "beta" option. If I understand it correctly, this is because looking for beta coefficients when observartions are not i.i.d. is questionable: https://stackoverflow.com/questions/...d-coefficients
                      If however my coefficient has a clear interpretation (a difference between the means of two groups), I find it quite natural to be willing to find a standardized mean difference/effect size.

                      Comment


                      • #12
                        Originally posted by Federico Tedeschi View Post
                        If however my coefficient has a clear interpretation (a difference between the means of two groups), I find it quite natural to be willing to find a standardized mean difference/effect size.
                        On second thought, the solution in this case is easier, since it's enough to properly standardize the outcome in the sample used for regression (in this case, for example, if we were interested in the difference between foreign and national car origin:
                        Code:
                        center mpg [pw=price] if foreign!=., standardize
                        svy: reg c_mpg i.foreign
                        )

                        Comment

                        Working...
                        X