Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weighted Variance Decomposition

    I want to decompose the variance of a variable into within- and between group components (over time). I also want to apply sample weights to this decomposition. Is there a command which efficiently provides this weighted decomposition? Of course, I can calculate this step-by-step, but since my data set is very large, eg looping over all groups would take quite a lot of time, so I was hoping that there might be a Stata-provided solution which is more efficient than this manual implementation.

  • #2
    Hi Alexander,
    You might find the user community contributed commands domin and domme by Joseph N. Luchman of interest.
    Code:
    ssc install domin , replace
    which domin , all
    *! domin version 3.6.0  5/5/2025 Joseph N. Luchman
    h domin    // Dominance analysis
    For documentation and examples consult his Github webpage.
    More to read is available below 7. References in the help file and this TSJ paper: Luchman, J. N. (2021). Determining relative importance in Stata using dominance analysis: domin and domme. The Stata Journal, 21(2), 510–538. https://doi.org/10.1177/1536867X211025837

    http://publicationslist.org/eric.melse

    Comment


    • #3
      I am not really sure what the OP is interested in. Would xtreg work for you? This allows a decomposition into within and between components, and it allows sampling weights (as long as weights are constant within a panel). Can you give us some more details? As an alternative, what about this:
      Code:
      mixed outcome || idcode: || timevar:
      estat icc
      Last edited by Felix Bittmann; 09 Sep 2025, 07:03.
      Best wishes

      Stata 18.0 MP | ORCID | Google Scholar

      Comment


      • #4
        Thank you for the advice! I ended up writing my own little program building on wmean (from SSC package _gwmean written by Gueorgui I. Kolev). This new function might be useful to some. This is not thoroughly tested, I apologize for any errors in advance, but it might be a starting point for others with that problem!

        Code:
        * first, define weighted variance function 
        cap prog drop wvar // weighted variance 
        program define wvar, eclass
            syntax varlist(max=1),                             /// 
                    GENerate(string)                         /// name for new variable
                    [by(string)]                             /// optional: calculate decomposition within another group 
                    [weight(string)]                        /// optional: weight
                    [SMALLsample]                            // optional: add small sample correction, same as egen sd()
        
            gettoken variable 0 : varlist 
            
            * get system locals 
            if "`by'"!="" local by_option "by(`by')"
            if "`by'"!="" local by_option2 "bys `by': "    
            if "`weight'"!="" local base_weight "weights(`weight')"
            cap drop help_wvar
            
            * variance 
            quietly egen help_wvar = wmean(`variable') , `by_option' `base_weight' 
            quietly replace help_wvar = (help_wvar - `variable')^2    
            quietly egen `generate' = wmean(help_wvar) , `by_option' `base_weight' 
            if "`smallsample'"!="" quietly `by_option2' replace `generate' = `generate' * (_N/(_N-1))
            
            drop help_wvar 
            
        end 
        
        * defining variance decomposition with multiple levels 
        
        cap prog drop var_decomp 
        program define var_decomp, eclass
        
            syntax varlist(max=1) ,                         /// 
                    GENerate(string)                         /// stub for new variables containing components
                    level(string)                            /// decomposition variables in ascending order (starting with lowest within-level), assumes that levels are nesting each other 
                    [by(string)]                             /// optional: calculate decomposition within other group(s)
                    [weight(string)]                        ///
                    [SMALLsample]                            // optional: add small sample correction, same as egen sd()
        
            gettoken variable 0 : varlist 
            cap drop decomp* 
            cap drop `generate'*
            
            * get system locals 
            local num_level : list sizeof local(level)
            forvalues i=1/`num_level'{
                local level`i' : word `i' of `level'
            } 
            if "`weight'"=="" gen decomp_weight = 1 
            if "`weight'"!="" gen decomp_weight = `weight'    
                
            * overall variance 
            if "`by'"!="" local by0 "by(`by')"
            wvar `variable' , `by0' gen(`generate'0) weight(decomp_weight) `smallsample'
            
            * within variances 
            forvalues i=1/`num_level'{
                local j = `i'-1
                if `i'==1 gen decomp_mean1 = `variable' // no mean on first level 
                if `i'!=1 egen decomp_mean`i' = wmean(`variable') , by(`by' `level1') weights(decomp_weight) 
                wvar decomp_mean`i' , by(`by' `level`i'') gen(decomp_var`i') weight(decomp_weight) `smallsample' 
                egen `generate'`i' = wmean(decomp_var`i') , `by0' weight(decomp_weight) // aggregate variances
            }
            
            * final between variance
            local i = `num_level' + 1
            local j = `i'-1
            egen decomp_mean`i' = wmean(`variable') , by(`by' `level`j'') weights(decomp_weight) 
            wvar decomp_mean`i' , `by0' gen(`generate'`i') weight(decomp_weight) `smallsample' // equal weights
        
            cap drop decomp* 
        
        end
        
        * example code
        var_decomp wage , by(year) level(establishment industry) gen(var) weight(weight)

        Comment


        • #5
          Perhaps some points for clarification: wvar simply defines a weighted variance, while var_decomp applies a variance decomposition building on wvar (and therefore allowing weights). var_decomp explicitly allows for a nested variance decomposition (eg you observe workers within establishments which are nested by industry, and you want to decompose the variance in wages by establishment and industry within each year).

          Comment


          • #6
            Thanks a lot for sharing your work. I think this is quite interesting and could be relevant for many people. However, I am currently not sure how to interpret the results. I have the following example:

            Code:
            . webuse nhanes2, clear
            
            . sum bpsystol
            
                Variable |        Obs        Mean    Std. dev.       Min        Max
            -------------+---------------------------------------------------------
                bpsystol |     10,351    130.8817    23.33265         65        300
            
            . var_decomp bpsystol, level(strata region) gen(test)
            
            . sum test*
            
                Variable |        Obs        Mean    Std. dev.       Min        Max
            -------------+---------------------------------------------------------
                   test0 |     10,351      544.36           0     544.36     544.36
                   test1 |     10,351    533.3657           0   533.3657   533.3657
                   test2 |     10,351    10.85856           0   10.85856   10.85856
                   test3 |     10,351      .13573           0     .13573     .13573
            Is test0 the overall variance, test1 the level 1 (strata) variance, test2 the level 2 (region) variance and finally test3 the between variance? If so, how would you interpret the numbers? Would it be possible to have some kind of standardization so the results are scale invariant (similar to R2?). Could you say that level 1 accounts for 97.98% of the total variance?
            Best wishes

            Stata 18.0 MP | ORCID | Google Scholar

            Comment


            • #7
              I just stumbled upon this post. I am wondering why xtsum does not do what you want?

              Code:
              use https://www.stata-press.com/data/r18/nlswork.dta
              xtset idcode year
              xtsum race age
              
              Variable         |      Mean   Std. dev.       Min        Max |    Observations
              -----------------+--------------------------------------------+----------------
              race     overall |  1.303392   .4822773          1          3 |     N =   28534
                       between |             .4862111          1          3 |     n =    4711
                       within  |                    0   1.303392   1.303392 | T-bar = 6.05689
                               |                                            |
              age      overall |  29.04511   6.700584         14         46 |     N =   28510
                       between |             5.485756         14         45 |     n =    4710
                       within  |              5.16945   14.79511   43.79511 | T-bar = 6.05308
              Cheers,
              Felix
              Stata Version: MP 18.0
              OS: Windows 11

              Comment


              • #8
                Felix B:

                Yes, this is the idea - basically, you are just applying the law of total variance:

                Var(Y) = E(Var(Y)|X) + Var(E(Y|X))

                And then introduce additional group variables which nest X, so that you can decompose the new variance term according to the same law:

                Var(Y) = E(Var(Y)|X) + [ E(Var(E(Y|X))|Z) + Var(E(E(Y|X)|Z) ]

                If X is nested by Z, the final term can be simplified to Var(E(Y|Z)). You can continue to decompose the variance term with nested categorical variables as many times as you like.

                In the program:
                > var0 is the total variance Var(Y)
                > var1 is the average within-variance of the first "level" term E(Var(Y)|X)
                > var2 is the average within-variance of the second "level" term E(Var(E(Y|X))|Z)
                > var3 is the between-variance of the second "level" Var(E(Y|Z))
                >> var1 to var3 (or more, if more levels) should sum up to var0.

                interpretation: imagine you look at individual wages with level(firm state) [assuming no firms which span state boarders]
                > var1/var0 is the share of variation in wages explained by within-firm variation
                > var2/var0 is the share of variation in wages explained by between-firm variation in a given state
                > var3/var0 is the share of variation in wages explained by between state variation

                Again, not thoroughly tested for all eventualities, but perhaps useful for others who find this thread.

                Felix K:

                I am not sure I can follow - I believe that this variance decomposition is not what I had in mind (see above). I understand that I did not express myself clearly in my initial posting, I hope the explanation above helps!

                Comment

                Working...
                X