Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using reldist for distance/divergence of two distributions x two indicators -- how to aggregate?

    Dear Statalist,

    I have the following setup: Two independent samples, one baseline (census) and one survey sample.

    The baseline (census data) comes in aggregated form. I therefore aggregated the survey data likewise.

    A mock data structure example would be as below.
    For this mock data I can estimate distance of the two distributions by two indicators.
    I can then average these distances.

    Code:
    **if not already installed, install reldist and other required packages
    /* ssc install reldist, replace // thanks to Benn Jann for providing these!!!
    ssc install moremata, replace
    ssc install kmatch, replace
    ssc install kdens, replace */
    
    ** mock census and survey data in aggregate form, differing on two categorical variables (gender, agegroup)
    sysuse pop2000 , clear // mock census data
    keep agegrp maletotal femtotal
    expand 2, gen(male)
    gen total = maletotal if male == 1
    replace total = femtotal if male == 0
    drop maletotal femtotal
    
    expand 2, gen(sample) // mock survey data
    set seed 1234
    replace total =runiformint(200,400) if sample == 1 & male == 1
    replace total = runiformint(250,450) if sample == 1 & male == 0
     
    ** comparison of chi2 distance between census baseline (sample==0) and survey (sample==1) for two categorical variables (agegrp; male)
    reldist divergence agegrp [fweight=total], by(sample) categorical  chi2
    local est1 =  e(b)[1,1]
    reldist divergence male [fweight=total], by(sample) categorical  chi2
    local est2 =  e(b)[1,1]
    
    **average distance of sample 1 and sample 0 on both
    di (`est1' + `est2')/2
    This gives me two ensuing chi2 distance estimates
    - estimate 1: 0.23 (se: 0.013), the distance between sample 0 and sample 1 on the agegroup indicator.
    - estimate 2: 0.01 (se: 0.0019), the distance between sample 0 and sample 1 on the gender indicator.

    If I understand this correctly, these have the property of being additive.
    Hence I can properly interpret the average distance of 0.12 ((`est1' + `est2')/2) as multivariate distance of sample 0 and 1.
    But how can I get a confidence interval around this average of 0.12?

    Any insight would be much appreciated! I need this confidence interval as in my real data I have one baseline and three survey samples. I compare all three survey samples to baseline and then want to be able to say which survey sample is closest/or farthest from the baseline, and whether the three samples are significantly farther/closer away. Hence, confidence intervals for the average distance would be key.

    Thanks so much for any ideas.
    Last edited by Max Hartz; 14 May 2025, 02:06.

  • #2
    After researching a bit (see e.g. https://stats.stackexchange.com/ques...dard-deviation) , I wonder whether

    Code:
    sqrt((`se1'^2 + `se2'^2)/4)
    would be appropriate. I.e., the sum of the squared standard errors divided by 2^2, then the square root. Hope this is correctly deduced...

    Comment

    Working...
    X