Weighted Variance Decomposition

Alexander Busch

Join Date: Oct 2022

Posts: 21
#1

Weighted Variance Decomposition

08 Sep 2025, 17:03

I want to decompose the variance of a variable into within- and between group components (over time). I also want to apply sample weights to this decomposition. Is there a command which efficiently provides this weighted decomposition? Of course, I can calculate this step-by-step, but since my data set is very large, eg looping over all groups would take quite a lot of time, so I was hoping that there might be a Stata-provided solution which is more efficient than this manual implementation.
Tags: decomposition, variance, weights
ericmelse

Join Date: May 2014

Posts: 437
#2

09 Sep 2025, 03:44

Hi Alexander,
You might find the user community contributed commands domin and domme by Joseph N. Luchman of interest.

Code:

ssc install domin , replace which domin , all *! domin version 3.6.0 5/5/2025 Joseph N. Luchman h domin // Dominance analysis

For documentation and examples consult his Github webpage.
More to read is available below 7. References in the help file and this TSJ paper: Luchman, J. N. (2021). Determining relative importance in Stata using dominance analysis: domin and domme. The Stata Journal, 21(2), 510–538. https://doi.org/10.1177/1536867X211025837

http://publicationslist.org/eric.melse
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 766
#3

09 Sep 2025, 06:56

I am not really sure what the OP is interested in. Would xtreg work for you? This allows a decomposition into within and between components, and it allows sampling weights (as long as weights are constant within a panel). Can you give us some more details? As an alternative, what about this:

Code:

mixed outcome || idcode: || timevar: estat icc

Last edited by Felix Bittmann; 09 Sep 2025, 07:03.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment

Alexander Busch

Join Date: Oct 2022
Posts: 21

13 Sep 2025, 09:27

Thank you for the advice! I ended up writing my own little program building on wmean (from SSC package _gwmean written by Gueorgui I. Kolev). This new function might be useful to some. This is not thoroughly tested, I apologize for any errors in advance, but it might be a starting point for others with that problem!

Code:

* first, define weighted variance function 
cap prog drop wvar // weighted variance 
program define wvar, eclass
    syntax varlist(max=1),                             /// 
            GENerate(string)                         /// name for new variable
            [by(string)]                             /// optional: calculate decomposition within another group 
            [weight(string)]                        /// optional: weight
            [SMALLsample]                            // optional: add small sample correction, same as egen sd()

    gettoken variable 0 : varlist 
    
    * get system locals 
    if "`by'"!="" local by_option "by(`by')"
    if "`by'"!="" local by_option2 "bys `by': "    
    if "`weight'"!="" local base_weight "weights(`weight')"
    cap drop help_wvar
    
    * variance 
    quietly egen help_wvar = wmean(`variable') , `by_option' `base_weight' 
    quietly replace help_wvar = (help_wvar - `variable')^2    
    quietly egen `generate' = wmean(help_wvar) , `by_option' `base_weight' 
    if "`smallsample'"!="" quietly `by_option2' replace `generate' = `generate' * (_N/(_N-1))
    
    drop help_wvar 
    
end 

* defining variance decomposition with multiple levels 

cap prog drop var_decomp 
program define var_decomp, eclass

    syntax varlist(max=1) ,                         /// 
            GENerate(string)                         /// stub for new variables containing components
            level(string)                            /// decomposition variables in ascending order (starting with lowest within-level), assumes that levels are nesting each other 
            [by(string)]                             /// optional: calculate decomposition within other group(s)
            [weight(string)]                        ///
            [SMALLsample]                            // optional: add small sample correction, same as egen sd()

    gettoken variable 0 : varlist 
    cap drop decomp* 
    cap drop `generate'*
    
    * get system locals 
    local num_level : list sizeof local(level)
    forvalues i=1/`num_level'{
        local level`i' : word `i' of `level'
    } 
    if "`weight'"=="" gen decomp_weight = 1 
    if "`weight'"!="" gen decomp_weight = `weight'    
        
    * overall variance 
    if "`by'"!="" local by0 "by(`by')"
    wvar `variable' , `by0' gen(`generate'0) weight(decomp_weight) `smallsample'
    
    * within variances 
    forvalues i=1/`num_level'{
        local j = `i'-1
        if `i'==1 gen decomp_mean1 = `variable' // no mean on first level 
        if `i'!=1 egen decomp_mean`i' = wmean(`variable') , by(`by' `level1') weights(decomp_weight) 
        wvar decomp_mean`i' , by(`by' `level`i'') gen(decomp_var`i') weight(decomp_weight) `smallsample' 
        egen `generate'`i' = wmean(decomp_var`i') , `by0' weight(decomp_weight) // aggregate variances
    }
    
    * final between variance
    local i = `num_level' + 1
    local j = `i'-1
    egen decomp_mean`i' = wmean(`variable') , by(`by' `level`j'') weights(decomp_weight) 
    wvar decomp_mean`i' , `by0' gen(`generate'`i') weight(decomp_weight) `smallsample' // equal weights

    cap drop decomp* 

end

* example code
var_decomp wage , by(year) level(establishment industry) gen(var) weight(weight)

Comment

Alexander Busch

Join Date: Oct 2022

Posts: 21
#5

13 Sep 2025, 09:32

Perhaps some points for clarification: wvar simply defines a weighted variance, while var_decomp applies a variance decomposition building on wvar (and therefore allowing weights). var_decomp explicitly allows for a nested variance decomposition (eg you observe workers within establishments which are nested by industry, and you want to decompose the variance in wages by establishment and industry within each year).
Comment

Felix Bittmann

Join Date: Aug 2018
Posts: 766

Yesterday, 02:07

Thanks a lot for sharing your work. I think this is quite interesting and could be relevant for many people. However, I am currently not sure how to interpret the results. I have the following example:

Code:

. webuse nhanes2, clear

. sum bpsystol

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
    bpsystol |     10,351    130.8817    23.33265         65        300

. var_decomp bpsystol, level(strata region) gen(test)

. sum test*

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       test0 |     10,351      544.36           0     544.36     544.36
       test1 |     10,351    533.3657           0   533.3657   533.3657
       test2 |     10,351    10.85856           0   10.85856   10.85856
       test3 |     10,351      .13573           0     .13573     .13573

Is test0 the overall variance, test1 the level 1 (strata) variance, test2 the level 2 (region) variance and finally test3 the between variance? If so, how would you interpret the numbers? Would it be possible to have some kind of standardization so the results are scale invariant (similar to R2?). Could you say that level 1 accounts for 97.98% of the total variance?

Best wishes

Stata 18.0 MP | ORCID | Google Scholar

Comment

Felix Kaysers

Join Date: Oct 2022
Posts: 75

Yesterday, 07:47

I just stumbled upon this post. I am wondering why xtsum does not do what you want?

Code:

use https://www.stata-press.com/data/r18/nlswork.dta
xtset idcode year
xtsum race age

Variable         |      Mean   Std. dev.       Min        Max |    Observations
-----------------+--------------------------------------------+----------------
race     overall |  1.303392   .4822773          1          3 |     N =   28534
         between |             .4862111          1          3 |     n =    4711
         within  |                    0   1.303392   1.303392 | T-bar = 6.05689
                 |                                            |
age      overall |  29.04511   6.700584         14         46 |     N =   28510
         between |             5.485756         14         45 |     n =    4710
         within  |              5.16945   14.79511   43.79511 | T-bar = 6.05308

Cheers,
Felix
Stata Version: MP 18.0
OS: Windows 11

Comment

Alexander Busch

Join Date: Oct 2022

Posts: 21
#8

Yesterday, 12:28

Felix B:

Yes, this is the idea - basically, you are just applying the law of total variance:

Var(Y) = E(Var(Y)|X) + Var(E(Y|X))

And then introduce additional group variables which nest X, so that you can decompose the new variance term according to the same law:

Var(Y) = E(Var(Y)|X) + [ E(Var(E(Y|X))|Z) + Var(E(E(Y|X)|Z) ]

If X is nested by Z, the final term can be simplified to Var(E(Y|Z)). You can continue to decompose the variance term with nested categorical variables as many times as you like.

In the program:
> var0 is the total variance Var(Y)
> var1 is the average within-variance of the first "level" term E(Var(Y)|X)
> var2 is the average within-variance of the second "level" term E(Var(E(Y|X))|Z)
> var3 is the between-variance of the second "level" Var(E(Y|Z))
>> var1 to var3 (or more, if more levels) should sum up to var0.

interpretation: imagine you look at individual wages with level(firm state) [assuming no firms which span state boarders]
> var1/var0 is the share of variation in wages explained by within-firm variation
> var2/var0 is the share of variation in wages explained by between-firm variation in a given state
> var3/var0 is the share of variation in wages explained by between state variation

Again, not thoroughly tested for all eventualities, but perhaps useful for others who find this thread.

Felix K:

I am not sure I can follow - I believe that this variance decomposition is not what I had in mind (see above). I understand that I did not express myself clearly in my initial posting, I hope the explanation above helps!
Comment

Announcement