Dear Statalist,
I have the following setup: Two independent samples, one baseline (census) and one survey sample.
The baseline (census data) comes in aggregated form. I therefore aggregated the survey data likewise.
A mock data structure example would be as below.
For this mock data I can estimate distance of the two distributions by two indicators.
I can then average these distances.
This gives me two ensuing chi2 distance estimates
- estimate 1: 0.23 (se: 0.013), the distance between sample 0 and sample 1 on the agegroup indicator.
- estimate 2: 0.01 (se: 0.0019), the distance between sample 0 and sample 1 on the gender indicator.
If I understand this correctly, these have the property of being additive.
Hence I can properly interpret the average distance of 0.12 ((`est1' + `est2')/2) as multivariate distance of sample 0 and 1.
But how can I get a confidence interval around this average of 0.12?
Any insight would be much appreciated! I need this confidence interval as in my real data I have one baseline and three survey samples. I compare all three survey samples to baseline and then want to be able to say which survey sample is closest/or farthest from the baseline, and whether the three samples are significantly farther/closer away. Hence, confidence intervals for the average distance would be key.
Thanks so much for any ideas.
I have the following setup: Two independent samples, one baseline (census) and one survey sample.
The baseline (census data) comes in aggregated form. I therefore aggregated the survey data likewise.
A mock data structure example would be as below.
For this mock data I can estimate distance of the two distributions by two indicators.
I can then average these distances.
Code:
**if not already installed, install reldist and other required packages /* ssc install reldist, replace // thanks to Benn Jann for providing these!!! ssc install moremata, replace ssc install kmatch, replace ssc install kdens, replace */ ** mock census and survey data in aggregate form, differing on two categorical variables (gender, agegroup) sysuse pop2000 , clear // mock census data keep agegrp maletotal femtotal expand 2, gen(male) gen total = maletotal if male == 1 replace total = femtotal if male == 0 drop maletotal femtotal expand 2, gen(sample) // mock survey data set seed 1234 replace total =runiformint(200,400) if sample == 1 & male == 1 replace total = runiformint(250,450) if sample == 1 & male == 0 ** comparison of chi2 distance between census baseline (sample==0) and survey (sample==1) for two categorical variables (agegrp; male) reldist divergence agegrp [fweight=total], by(sample) categorical chi2 local est1 = e(b)[1,1] reldist divergence male [fweight=total], by(sample) categorical chi2 local est2 = e(b)[1,1] **average distance of sample 1 and sample 0 on both di (`est1' + `est2')/2
- estimate 1: 0.23 (se: 0.013), the distance between sample 0 and sample 1 on the agegroup indicator.
- estimate 2: 0.01 (se: 0.0019), the distance between sample 0 and sample 1 on the gender indicator.
If I understand this correctly, these have the property of being additive.
Hence I can properly interpret the average distance of 0.12 ((`est1' + `est2')/2) as multivariate distance of sample 0 and 1.
But how can I get a confidence interval around this average of 0.12?
Any insight would be much appreciated! I need this confidence interval as in my real data I have one baseline and three survey samples. I compare all three survey samples to baseline and then want to be able to say which survey sample is closest/or farthest from the baseline, and whether the three samples are significantly farther/closer away. Hence, confidence intervals for the average distance would be key.
Thanks so much for any ideas.
Comment