Weighting sample size to calculate the national estimate

Reza Hosseini

Join Date: Nov 2015

Posts: 36
#1

Weighting sample size to calculate the national estimate

17 Nov 2015, 19:51

Hi all,

I am analyzing a very large database which is the sample size of roughly 20% of all the hospitalizations in each year in the US. This database has a variable —DISCWT— which is used for weighting and producing the national estimates (after applying it should roughly make the population and descriptive data 5 times greater. for example if I have 8 million observations/cases in my data, then the national estimate should be about 5*8=40 million).

For weighting the data, I use the code below in STATA:

Code:

svyset HOSPID [pw=DISCWT], strata(NIS_STRATUM) pus(HOSPID) svy: mean AGE if mycases==1

mycases is the variable for the cases that I am interested in.
HOSPID is the variable that contains codes for the hospitals that the procedure has been done or the patient has been hospitalized.

There is a way provided by the database provider itself that quickly gives you the number of cases in the 'national estimate' (=weighted) level.
After applying the above code for weighting the data, although I get very close estimates, but unfortunately they are not EXACTLY the same as what the provider gives me—my UNWEIGHTED numbers are exactly the same so I think the problem must be in the way that I weight the data.

I have checked the website and it says the way they calculate the national estimate in SAS is as follows:

Code:

PROC SURVEYMEANS DATA=mycases SUM STD MEAN STDERR; VAR mycases; WEIGHT DISCWT; CLUSTER HOSPID; STRATA NIS_STRATUM; run;

I would be very grateful if anyone can help me in this regard.

Thank you very much!
Reza
Tags: stratification, svyset, syntax, weight, weighting
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

17 Nov 2015, 20:46

You are using the National Inpatient Survey (NIS) data set. If you correctly use the subpopulation option instead of "if mycases==1", then the output of svy:means will give you the estimated count for the entire population. See the entry in the Survey manual for "Subpopulation estimation for survey data". There is no "pus" option to svyset, so your command as written would have failed. The correct Stata code should be:

Code:

svyset hospid [pweight = discwt], strata(nis_stratum) svy, subpop(if mycases==1): mean AGE //assuming AGE is in upper case

That said, the word "cases" is ambiguous. So, show us the code you used to implement the data provider's instruction, and also, as FAQ 14 asks, show the Stata and SAS output between CODE delimiters.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment

Reza Hosseini

Join Date: Nov 2015
Posts: 36

17 Nov 2015, 21:55

Originally posted by Steve Samuels View Post

You are using the National Inpatient Survey (NIS) data set. If you correctly use the subpopulation option instead of "if mycases==1", then the output of svy:means will give you the estimated count for the entire population. See the entry in the Survey manual for "Subpopulation estimation for survey data". There is no "pus" option to svyset, so your command as written would have failed. The correct Stata code should be:

Code:

svyset hospid [pweight = discwt], strata(nis_stratum)
svy, subpop(if mycases==1): mean AGE //assuming AGE is in upper case

That said, the word "cases" is ambiguous. So, show us the code you used to implement the data provider's instruction, and also, as FAQ 14 asks, show the Stata and SAS output between CODE delimiters.

Thank you so much for the reply, Steve!
I read the subpopulation document and used it in the code below. This is the output from Stata—ran on the NIS_2007 database:

Code:

. gen asthma=0

. replace asthma=1 if DXCCS1==128
(81443 real changes made)

. svyset HOSPID [pweight = DISCWT], strata(NIS_STRATUM)

      pweight: DISCWT
          VCE: linearized
  Single unit: missing
     Strata 1: NIS_STRATUM
         SU 1: HOSPID
        FPC 1: <zero>

. svy, subpop(asthma): mean AGE
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =      60        Number of obs    =   8043250
Number of PSUs   =    1044        Population size  =  39541194
                                  Subpop. no. obs  =     81278
                                  Subpop. size     =  401334.3
                                  Design df        =       984

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
         AGE |   39.43323   .8418483      37.78121    41.08526
--------------------------------------------------------------

and below are the results that hcupnet.ahrq.gov gives you (wiht DISCWT):

Code:

CCS principal diagnosis category 
128 Asthma 


All discharges
402,088 (100.00%)
13,985

Age (mean)
39.43
0.84

The number of unweighted observations in both cases are the same (81443) but as you can see above, the total number of discharges are different.

I appreciate if you can help me in solving this issue.

Thanks!
Reza

Comment

Steve Samuels

Join Date: Mar 2014
Posts: 1786

18 Nov 2015, 02:29

The problem is that discharges with unknown age not counted in the svy: mean results. I can't duplicate your numbers for 2007 in HCUPnet (see below), but the percent with missing age (0.20%) is nearly identical to the discrepancy in your results (402,088-401,334.3)/402,088 = 0.19%

To find this out for yourself, you can:
1. "age category" to your hcuptnet search
or
2. Run Stata code like

Code:

codebook age
codebook age if asthma
total discwt if asthma  // independendent count
total discwt if asthma & age !=.

HCUPnet Search (The standard errors column is cut off by the forum software):

2007 National statistics - principal diagnosis only
CCS principal diagnosis category
128 Asthma


All discharges		387,880 (100.00%)	13,402
Age (mean)		39.45	0.84
Age group	<1	10,430 (2.69%)	780
1-17	110,303 (28.44%)	8,618
18-44	76,729 (19.78%)	2,588
45-64	108,874 (28.07%)	3,674
65-84	67,771 (17.47%)	2,147
85+	13,009 (3.35%)	478
Missing	764 (0.20%)	136

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2

Comment

Reza Hosseini

Join Date: Nov 2015
Posts: 36

19 Nov 2015, 15:09

Originally posted by Steve Samuels View Post

Code:

codebook age
codebook age if asthma
total discwt if asthma // independendent count
total discwt if asthma & age !=.

HCUPnet Search (The standard errors column is cut off by the forum software):

2007 National statistics - principal diagnosis only
CCS principal diagnosis category
128 Asthma


All discharges		387,880 (100.00%)	13,402
Age (mean)		39.45	0.84
Age group	<1	10,430 (2.69%)	780
1-17	110,303 (28.44%)	8,618
18-44	76,729 (19.78%)	2,588
45-64	108,874 (28.07%)	3,674
65-84	67,771 (17.47%)	2,147
85+	13,009 (3.35%)	478
Missing	764 (0.20%)	136

Thank you very much for your help!

Announcement