Confidence interval of (Y/X), where Y and X are from different datasets?

Tin-chi Lin

Join Date: Jul 2014

Posts: 9
#1

Confidence interval of (Y/X), where Y and X are from different datasets?

29 Jul 2014, 14:36

Dear Statalisters,

I know this is not a question about Stata. But much of my statistical knowledge was learned here, and I was hoping this question and the discussion that might follow can contribute to the growth of the forum.

In short, I wonder whether it is possible (and how to do it) to compute the confidence intervals of Y/X, where
(i) Y is the occurrence of events during a given period of time, and
(ii) X is the exposure of events during the same period of time
The tricky part is that, Y and X were from two different datasets. In other word, the count for events was from survey A, and the exposure from survey B. The two surveys used different survey methods, although their sampling universe referred to the same population.
Some people suggested that I can use bootstrapping or jackknife to simulate the distribution of (Y/X); does this approach make sense?
Thanks very much.
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

30 Jul 2014, 04:34

The population event rate in the population is number of events divided by total exposure. Let $Y_\bullet$ indicate the population $Y$ total in survey 1 an $X_\bullet$ be the population $X$ total in survey 2, with $\overline{Y}$ and $\overline{X}$ be the corresponding population means. Then the number of events in the population divided by the total exposure is:
$$
R = \frac{X_\bullet}{Y_\bullet} = \frac{\overline{X}}{\overline{Y}}
$$
The natural estimate of $R$ is
$$
\widehat{R} = \frac{\widehat{\overline{Y}}}{\widehat{\overline{X }}}
$$
where $\widehat{A}$ indicates a sample estimate.

To get standard errors, apply the delta method to a ratio of independent random variables. For a confidence interval, the best approach, I think, would be to get a standard error and confidence interval for $\log(R)$, then convert that to one for $R$. Here is the standard error for $\log(R)$. I leave the rest to you.
$$
\text{SE}(\log\widehat{R})= \sqrt{\text{CV}_{{\widehat{\overline{Y}}}}^2 + \text{CV}_{{\widehat{\overline{X}}}}^2} \quad \quad \quad (1)
$$
where CV stands for coefficient of variation, e.g.
$$
\text{CV}_{\widehat{\overline{Y}}} = \frac{\text{SE}(\widehat{\overline{Y}})}{ \widehat{\overline{Y}}}
$$
There are other potential problems besides how to form a confidence interval. Although both samples were drawn from the same population, it's certain that distributions of age, gender, and other characteristics will differ, and this could bias the estimate of $R$. Suppose, for example, that, in the population, exposure is similar for males and females, but for a given value of $X$, females have higher risk of $Y$. If the proportion female is higher in sample 1, then $\widehat{R}$ could be biased upward.

There is a fix for this kind of bias. If you have original data for the two surveys, then you should re-weight the original sampling weights of both(if available) to the same set of control totals for demographic characteristics, via survey rakng or calibration. For the former, download Nick Winter's survwgt or Stas Kolenikov's ipfweight; for the latter, John D'Souza's calibrate and calibest, All are at SSC.

Bias could also arise if the surveys were not done at the same time and there is a seasonal variation or temporal trend in outcome. I don't have an easy fix for that.

Last edited by Steve Samuels; 30 Jul 2014, 05:10.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

28 Aug 2014, 21:15

I should have pointed out that all the estimates can be found after running svy mean on the data for each survey. Try the following example:

Code:

sysuse auto, clear svyset _n [pw= length] svy: mean turn di "(CV of Mean)^2 =" (0.5180212 /40.09305)^2 /* "Automated */ return list matrix a = r(table) matrix list a matrix se = a["se","turn"] matrix mean = a["b", "turn"] scalar semean = el(se,1,1) scalar mean = el(mean,1,1) scalar cv2 = (semean/mean)^2 scalar list cv2

Last edited by Steve Samuels; 28 Aug 2014, 21:51.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Tin-chi Lin

Join Date: Jul 2014

Posts: 9
#4

03 Sep 2014, 13:59

Hi Steve,

Thanks--this helps a lot. I have another question regarding variance approximation -- why do you choose to work on the log ratio, instead of just the ratio itself? Is it because using the log ratio makes it easier to re-weight survey data to population totals (via the calibration or survey raking method)?

Regards,

Tinchi

Last edited by Tin-chi Lin; 03 Sep 2014, 14:10.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2170
#5

03 Sep 2014, 19:40

Do you have the original disaggregated data sets? If so, you can append one to the other and create dummy variables to indicate to which set an observation belongs. Then define a variable, say W, which is X for the first data set and Y for the second. Then do the following:

Code:

gen W = X if D1 replace W = Y if D2 reg W D1 D2, nocons robust nlcom D1/D2
1 like
Comment
Tin-chi Lin

Join Date: Jul 2014

Posts: 9
#6

04 Sep 2014, 11:01

Thank you both Steve and Prof. Wooldridge -- it really helps a lot. A basic question just came to my mind: would the variance approximation (as provided by Steve or the bottom of p.49 of the lecture note here) still be "reasonably good" if X and Y are from complex surveys? I tend to think so but I am not 100% sure if the reasoning make sense. Would you care to clarify it for me?

My understanding is that, the delta method works as long as the random variables we want to approximate are asymptotically normal. But this is a somewhat big assumption anywhere, no matter whether the random variables come from simple random surveys or complex ones. Therefore the validity of the variance approximation, and whether the variables come from complex surveys, are two separate issues.

The other question is, what is the preferred way to take into account the effect of survey design when we approximate the variance of the ratio? I can think of two ways:

(1) Without combining the two datasets, simply use -svy- to compute mean and standard error for X, and then do the same thing for Y. Then plug the values (mean of X, SE of X, mean of Y, SE of Y) into the variance approximation formula.

(2) Combine the two datasets first, and then try to figure out how to correctly specify the design parameters (strata, cluster, fpc..) of the combined sample like here. Then apply the delta method using -nlcom- after running -svy: reg W X Y- as Prof. Wooldridge suggested.

Sincerely,

Tin-chi
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

06 Sep 2014, 18:33

1. Your understanding that the delta method requires normality of the base statistic is not correct. See the Wikipedia entry for the delta method for references. Yet a CLT does apply to survey estimates.

2. If you svyset a survey with complex design, then use a svy estimation command, the resulting variances will properly adjust for that design.

3. I like starting with the log ratio for three reasons:

a. Ratios are positive numbers. As such, I would expect their estimates to have asymmetric distributions skewed to the right. A symmetric confidence interval for the log ratio will translate to an asymmetric CI for the ratio itself.

b. Intervals starting with the ratio (R±k SE), instead of the log ratio (log R ± kSE) can have lower endpoints <0. For this reason, I don't favor Jeff Woodridge's version of nlcom applied to the ratio. But it can be adapted to log ratio, as in the second nlcom statement below.

Code:

svy: reg w D1 D2, nocons nlcom log(_b[D1]/_b[D2]) return list di "ll_ratio =" exp(el(r(b),1,1)-1.96*sqrt(el(r(V),1,1))) /* el() is a matrix function */ di "ratio = " _b[domestic]/_b[foreign di "ul_ratio =" exp(el(r(b),1,1)+1.96*sqrt(el(r(V),1,1)))

c. For many ratios, both Y/X and X/Y have meaningful interpretations, though I don't know if this is true in your case. Think of "miles per gallon" and "gallons consumed per mile". For event studies with an exponential distribution, the parameter $\hat{\lambda}$ = Y/X (Y events, X person years) is the maximum likelihood estimate of the constant hazard rate and $1/\hat{\lambda}$ is the expected lifetime.

Since log(X/Y) = –log(Y/X), the confidence intervals on the log scale will be consonant: you can can invert the CI endpoints for R = Y/X to get the CI for 1/R = X/Y. In contrast, you cannot go easily from a CI for R to one for 1/R, if start with (R±k SE),

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Tin-chi Lin

Join Date: Jul 2014

Posts: 9
#8

09 Sep 2014, 13:17

Hi Steve,

Thanks very much-I can appreciate the use of log ratios. I guess the question now drills down to whether the two surveys can be combined, which depends on the survey details and the study purpose, as pointed out in one of your previous posts. I searched for the conditions under with two surveys can be combined, but didn’t get much (most of which is about how to do it using Stata like here). Could you (or anyone else in the forum) advise me on this issue?

Purpose of the study: This is not an analytic study. We simply want to compute the population event rate (X/Y) and the standard error of the rate, SE(X/Y). X indicates the total number of events and Y total exposure. In practice we use the sample means of X and Y to compute the rate and the SE. X is from survey 1 and Y survey 2; both surveys are national household surveys based on complex sampling design, and the target universes differ somewhat. Below is the detail of each survey:

Survey 1: The target universe is defined as all dwelling units in the U.S. that contain members of the civilian non-institutionalized population. One adult per family is randomly selected for interview and provide proxy responses for other family members.

Survey 1 uses multistage sampling that involves stratification, clustering, and oversampling of specific population subgroups. However the public version contains only one simplified PSU and strata variable. It appears to me that people who use the public version treat the survey as single-staged stratified clustering survey WITH replacement, as indicated in p.5 of this report.

Survey 2: The universe is composed of the civilian, noninstitutional population residing in occupied households in the United States that are at least 15 year of age. An eligible person, who needs to be at least 15-year-old, is randomly selected from the household to conduct the interview.

The design of survey 2 is based on a stratified, three-stage sample.

There are two additional questions:

1. Variance estimation for subsetted data. It may be necessary to restrict survey 1 to those aged 15 or older, because that’s the universe of survey 2. However I am not sure if nlcom would be able to compute accurate standard errors (i.e. SE(X/Y)) for the subsetted data after we we svyset the combined data; using the if conditioner may not be the right way, as pointed out here. What would you suggest to compute SE(X/Y) for the subset of the combined data?

2. Is there a way to accommodate the correlation between X and Y in the variance estimation? My guess is that this should not be a problem if X and Y were from the same data, and nlcom should be to handle this. But X and Y are from different surveys; when survey 1 is appended to survey 2, by construction X and Y will be independent from each other in the combined data. However, conceptually X is correlated with -- the more time a person gets exposed the more likely an event will occur. Ignoring the (positive) correlation would inflate the variance estimate, as p.50 of this note shows.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

09 Sep 2014, 21:51

1. For an approach to combining different surveys, see: http://www.statalist.org/forums/foru...ighted-surveys. That is useful if you want to compute several ratios. However, given that you want only one estimate, you can compute the SE of the log Ratio by estimating the CVs of the sample means for the two surveys individually, as I showed in my first post. By the way, the delta-method SE of a ratio is the SE of the log ratio multiplied by the estimated ratio itself.

2. Restrict the analysis to a subpopulation? I think that you have a potential measurement error problem because of proxy responses in the first survey. Are the selected adults in that survey accurate reporters for other family members? If not, restrict the analysis to adults only, not to age 15+: the informant adults in the first survey (who presumably report without error) and to adults in the second survey. You do this in the second survey by adding the subpop() option to the svy: mean statement. You almost never use an if statement to restrict to a subgroup in survey analysis; for the reason, see the Section on subpopulation estimation in the Survey Manual, page 59 or any good sampling book. The exceptions are when the subgroup was a sampling stratum or its total was fixed by post-stratification techniques. If reporting error is not an issue in the first survey, then you can use the subpop() in that survey and restrict the analysis to those 15+.

3. Correlation of Y and X in individuals This is a non-issue- the question is the correlation between the estimated totals and means, which, as you say is zero by construction.

I will say that your program of estimating only one rate is puzzling to me. There aren't many natural phenomena that don't vary by age or gender, at least. Still it's your analysis. You've not addressed the other issues that mentioned in my first response, but I already gave my best advice, such as it is, about them.

Last edited by Steve Samuels; 09 Sep 2014, 21:56.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#10

10 Sep 2014, 06:00

I have to correct a statement in my last post. You may use an if qualifier in a survey estimation command if the subgroup is a sampling stratum, as I said; but if the subgroup total is fixed by a post-stratification re-weighting you must use the subpop() option.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Tin-chi Lin

Join Date: Jul 2014

Posts: 9
#11

10 Sep 2014, 11:03

Hi Steve,

Thanks for all these input again. We were actually interested in getting estimates for different demographic groups. But we had no idea about how to estimate the ratio when X and Y are from different surveys. The issues and knowledge involved appeared very complex to us; we didn't have any prior experience. So we decided to start with a simpler question, i.e., how to compute the estimate for the population. Thanks to your responses, we now feel much more confident than three weeks ago. And we'll start to learn how to reweight the sample as pointed out in your first response.

Regards,

Tin-chi

Last edited by Tin-chi Lin; 10 Sep 2014, 11:37.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#12

11 Sep 2014, 15:20

You are quite welcome, Tin-chi. There is an example of rates with numerator and denominator from different surveys in Korn, Edward Lee, and Barry I Graubard. 1999. Analysis of health surveys. New York: Wiley, pp 207-2011 and Graubard, B.I., and E.L. Korn. 1996. Survey inference for subpopulations. American Journal of Epidemiology 144, no. 1: 102-106.

Steve

Steven J Samuels
Consultant in Statistics
18 Cantine's Island
Saugerties NY 12477 USA

[email protected]
Phone: 1-845-246-0774

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

Confidence interval of (Y/X), where Y and X are from different datasets?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment