Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weighting sample size to calculate the national estimate

    Hi all,

    I am analyzing a very large database which is the sample size of roughly 20% of all the hospitalizations in each year in the US. This database has a variable —DISCWT— which is used for weighting and producing the national estimates (after applying it should roughly make the population and descriptive data 5 times greater. for example if I have 8 million observations/cases in my data, then the national estimate should be about 5*8=40 million).

    For weighting the data, I use the code below in STATA:
    Code:
    svyset HOSPID [pw=DISCWT], strata(NIS_STRATUM) pus(HOSPID)
    svy: mean AGE if mycases==1
    mycases is the variable for the cases that I am interested in.
    HOSPID is the variable that contains codes for the hospitals that the procedure has been done or the patient has been hospitalized.

    There is a way provided by the database provider itself that quickly gives you the number of cases in the 'national estimate' (=weighted) level.
    After applying the above code for weighting the data, although I get very close estimates, but unfortunately they are not EXACTLY the same as what the provider gives me—my UNWEIGHTED numbers are exactly the same so I think the problem must be in the way that I weight the data.

    I have checked the website and it says the way they calculate the national estimate in SAS is as follows:
    Code:
    PROC SURVEYMEANS DATA=mycases SUM STD MEAN STDERR;
    VAR mycases;
    WEIGHT DISCWT;
    CLUSTER HOSPID;
    STRATA NIS_STRATUM;
    run;
    I would be very grateful if anyone can help me in this regard.

    Thank you very much!
    Reza

  • #2
    You are using the National Inpatient Survey (NIS) data set. If you correctly use the subpopulation option instead of "if mycases==1", then the output of svy:means will give you the estimated count for the entire population. See the entry in the Survey manual for "Subpopulation estimation for survey data". There is no "pus" option to svyset, so your command as written would have failed. The correct Stata code should be:

    Code:
    svyset hospid [pweight = discwt], strata(nis_stratum)
    svy, subpop(if mycases==1): mean AGE  //assuming AGE is in upper case
    That said, the word "cases" is ambiguous. So, show us the code you used to implement the data provider's instruction, and also, as FAQ 14 asks, show the Stata and SAS output between CODE delimiters.

    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Originally posted by Steve Samuels View Post
      You are using the National Inpatient Survey (NIS) data set. If you correctly use the subpopulation option instead of "if mycases==1", then the output of svy:means will give you the estimated count for the entire population. See the entry in the Survey manual for "Subpopulation estimation for survey data". There is no "pus" option to svyset, so your command as written would have failed. The correct Stata code should be:

      Code:
      svyset hospid [pweight = discwt], strata(nis_stratum)
      svy, subpop(if mycases==1): mean AGE //assuming AGE is in upper case
      That said, the word "cases" is ambiguous. So, show us the code you used to implement the data provider's instruction, and also, as FAQ 14 asks, show the Stata and SAS output between CODE delimiters.




      Thank you so much for the reply, Steve!
      I read the subpopulation document and used it in the code below. This is the output from Stata—ran on the NIS_2007 database:

      Code:
      . gen asthma=0
      
      . replace asthma=1 if DXCCS1==128
      (81443 real changes made)
      
      . svyset HOSPID [pweight = DISCWT], strata(NIS_STRATUM)
      
            pweight: DISCWT
                VCE: linearized
        Single unit: missing
           Strata 1: NIS_STRATUM
               SU 1: HOSPID
              FPC 1: <zero>
      
      . svy, subpop(asthma): mean AGE
      (running mean on estimation sample)
      
      Survey: Mean estimation
      
      Number of strata =      60        Number of obs    =   8043250
      Number of PSUs   =    1044        Population size  =  39541194
                                        Subpop. no. obs  =     81278
                                        Subpop. size     =  401334.3
                                        Design df        =       984
      
      --------------------------------------------------------------
                   |             Linearized
                   |       Mean   Std. Err.     [95% Conf. Interval]
      -------------+------------------------------------------------
               AGE |   39.43323   .8418483      37.78121    41.08526
      --------------------------------------------------------------
      and below are the results that hcupnet.ahrq.gov gives you (wiht DISCWT):

      Code:
      CCS principal diagnosis category 
      128 Asthma 
      All discharges 402,088 (100.00%) 13,985
      Age (mean) 39.43 0.84
      The number of unweighted observations in both cases are the same (81443) but as you can see above, the total number of discharges are different.

      I appreciate if you can help me in solving this issue.

      Thanks!
      Reza

      Comment


      • #4
        The problem is that discharges with unknown age not counted in the svy: mean results. I can't duplicate your numbers for 2007 in HCUPnet (see below), but the percent with missing age (0.20%) is nearly identical to the discrepancy in your results (402,088-401,334.3)/402,088 = 0.19%

        To find this out for yourself, you can:
        1. "age category" to your hcuptnet search
        or
        2. Run Stata code like
        Code:
        codebook age
        codebook age if asthma
        total discwt if asthma  // independendent count
        total discwt if asthma & age !=.

        HCUPnet Search (The standard errors column is cut off by the forum software):



        2007 National statistics - principal diagnosis only
        CCS principal diagnosis category
        128 Asthma
        All discharges 387,880 (100.00%) 13,402
        Age (mean) 39.45 0.84
        Age group <1 10,430 (2.69%) 780
        1-17 110,303 (28.44%) 8,618
        18-44 76,729 (19.78%) 2,588
        45-64 108,874 (28.07%) 3,674
        65-84 67,771 (17.47%) 2,147
        85+ 13,009 (3.35%) 478
        Missing 764 (0.20%) 136
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Originally posted by Steve Samuels View Post
          The problem is that discharges with unknown age not counted in the svy: mean results. I can't duplicate your numbers for 2007 in HCUPnet (see below), but the percent with missing age (0.20%) is nearly identical to the discrepancy in your results (402,088-401,334.3)/402,088 = 0.19%

          To find this out for yourself, you can:
          1. "age category" to your hcuptnet search
          or
          2. Run Stata code like
          Code:
          codebook age
          codebook age if asthma
          total discwt if asthma // independendent count
          total discwt if asthma & age !=.

          HCUPnet Search (The standard errors column is cut off by the forum software):



          2007 National statistics - principal diagnosis only
          CCS principal diagnosis category
          128 Asthma
          All discharges 387,880 (100.00%) 13,402
          Age (mean) 39.45 0.84
          Age group <1 10,430 (2.69%) 780
          1-17 110,303 (28.44%) 8,618
          18-44 76,729 (19.78%) 2,588
          45-64 108,874 (28.07%) 3,674
          65-84 67,771 (17.47%) 2,147
          85+ 13,009 (3.35%) 478
          Missing 764 (0.20%) 136
          Thank you very much for your help!

          Comment

          Working...
          X