Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using frequency weights on graph bar to produce weighted averages.

    Hello. I am working with a dataset that has the number of births and number of preterm births in different facilities in different districts. I want to show each district's preterm birth rate on an hbar graph. The first way (and more intuitive way, to me) I tried this was to just create variables pretermtotal and totaldels as the total of preterm and total births, respectively, and then divide them to get the district preterm birth rate.

    I then thought, maybe it's more efficient to create one preterm birth variable, which in my code is pretermrate2 - the preterm birth rate for each individual observation - and then graph the mean of pretermrate2 using each observation's total deliveries as the frequency weight, which would in effect give me a weighted average. If it works I could cut out two lines of code and create fewer new variables.

    The problem is, when I run both versions of this code, on the final graph I get preterm birth numbers that are slightly different. In most cases they are off by between .05-.5, and in only one case is the number the same. I suspect this problem lies in the ado file for Stata weights, but I'm really not sure how to find out if that's true, and running this method on different data gave the same numbers for both graphs.

    If anyone knows why one kind of code produces different numbers than the other, I would greatly appreciate it!
    * note - in the sample code I gave, the final graphed averages are a bit farther apart than when using the full unedited dataset


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str10 district int(pregpreterm delinst)
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str10 district int(pregpreterm delinst)
    "York" 10 158
    "York" 18 272
    "York" 11 155
    "York" 13 153
    "York" 12 206
    "York" 14 321
    "York" 12 215
    "York" 12 222
    "York" 14 194
    "York" 18 208
    "Jersey" 15 220
    "Jersey" 18 299
    "Jersey" 12 146
    "Jersey" 16 175
    "Jersey" 10 181
    "Jersey" 13 179
    "Jersey" 12 175
    "Jersey" 15 274
    "Jersey" 17 189
    "Jersey"  9 160
    "Jersey" 12 139
    "Jersey" 16 210
    "Jersey" 14 171
    "Jersey" 14 207
    "Jersey"  . 114
    "Jersey"  .  84
    "Jersey"  .  69
    "Jersey"  .  88
    "Jersey"  .  75
    "Guernsey"  1  89
    "Guernsey"  1 138
    "Guernsey"  .  55
    "Guernsey"  .  96
    "Guernsey"  .  59
    "Guernsey"  . 102
    "Guernsey"  .  66
    "Guernsey"  1  76
    "Guernsey"  1  92
    "Guernsey"  1 114
    "Guernsey"  .  67
    "Guernsey"  1  72
    "Guernsey"  1 103
    "Guernsey"  .  44
    "Guernsey"  . 122
    "Guernsey"  . 117
    "Guernsey"  . 135
    "Guernsey"  .  57
    "Guernsey"  1  73
    "Mersey"  .  35
    "Mersey"  .  59
    "Mersey"  .  31
    "Mersey"  1  37
    "Mersey"  .  46
    "Mersey"  .  37
    "Mersey"  1  32
    "Mersey"  1  37
    "Mersey"  .  46
    "Mersey"  .  40
    "Mersey"  .  35
    "Mersey"  .  48
    "Mersey"  .  34
    "Mersey"  .  53
    "Mersey"  .  50
    "Mersey"  .  44
    "Mersey"  .  35
    "Mersey"  .  52
    "Mersey"  1  41
    "Mersey"  1  21
    "Mersey"  .  32
    "Mersey"  1  41
    "Mersey"  .  56
    "Mersey"  .  20
    "Mersey"  .  94
    "Mersey"  5 145
    "Mersey"  . 117
    "Mersey"  5 107
    "Mersey"  0  83
    "Mersey"  . 106
    "Mersey"  2  78
    "Mersey"  3  83
    "Mersey"  2 101
    "Mersey"  3 152
    "Percy"  2 102
    "Percy"  0  61
    "Percy"  0 152
    "Percy"  5 192
    "Percy"  5  95
    "Percy"  .  97
    "Percy"  5 103
    "Percy"  3 132
    "Percy"  3  67
    "Percy"  3  64
    "Percy"  3  65
    "Percy"  . 128
    "Percy"  5 138
    "Percy"  5  92
    "Percy"  0  45
    "Percy"  .  40
    "Percy"  .  49
    "Percy"  .  53
    end
    
            tempfile g1
            tempfile g2
            
            bys district: egen pretermtotal=total(pregpreterm), missing
            bys district: egen totaldels=total(delinst), missing
            gen pretermrate1=100*pretermtotal/totaldels
            sum delinst, d
            local myn `r(sum)'
            graph hbar pretermrate1, over(district) blabel(bar, ///
                size(small)) title("Preterm birth rate by district") ytitle("rate of preterm births") ///
                note("Source: MP HMIS data for FY '17-'18 and '18-'19, n = `:di %-12.0fc `myn''") ///
                bargap(40) saving(`g1', replace)
            
            gen pretermrate2=100* pregpreterm/delinst
            graph hbar (mean) pretermrate2 [fw=delinst], over(district) blabel(bar, ///
                size(small)) title("Preterm birth rate by district") ytitle("rate of preterm births") ///
                note("Source: MP HMIS data for FY '17-'18 and '18-'19, n = `:di %-12.0fc `myn''") ///
                bargap(40) saving(`g2', replace)
            
            graph combine "`g1'" "`g2'" // compare output from both methods

  • #2
    Thanks for providing a data example, and it's not a problem that the data don't look real.

    Good news: it's your bug. The graphs show the same results for York, where there are no missing values, but not otherwise: in all the other places missing values are present. So, my immediate guess is that the problem will lie in how missing values are handled in your code.

    The problem is that (inserting some spaces for readability) this code

    Code:
    bys district: egen pretermtotal = total(pregpreterm), missing
    bys district: egen totaldels = total(delinst), missing
    gen pretermrate1 = 100 * pretermtotal/totaldels
    does not ignore missings as you would wish. The missing option at most ensures that a result for the total of 0 if all values are missing is mapped to a result of missing. It doesn't ensure that missings are ignored completely when non-missings are present in a group. Nor is there any sense in which the egen code for one variable will look across at other variables not mentioned to see if there are missing values. (In summing values, missing values are ignored, so it is as if they zero: the sum of 42 and missing is 42, not missing.)

    It's really better to assume that Stata's official code has been banged on many more times than your code so that bugs have been found before you use the code -- and, dare I say, was written by people with more experience.

    Here's how to get the same results. In the home-grown calculations you must tell Stata to use only observations with non-missing values on both variables. I've simplified your code in some small respects.

    Code:
    gen touse = !missing(pregpreterm, delinst)
    bys district: egen pretermtotal = total(pregpreterm) if touse  
    bys district: egen totaldels = total(delinst) if touse
    gen pretermrate1 = 100 * pretermtotal/totaldels
    
    sum delinst, d
    local myn `r(sum)'
    
    graph hbar pretermrate1, over(district) blabel(bar, ///
    size(small)) title("Preterm birth rate by district") ytitle("rate of preterm births") ///
    note("Source: MP HMIS data for FY '17-'18 and '18-'19, n = `:di %-12.0fc `myn''") ///
    bargap(40) name(g1, replace)
            
    gen pretermrate2 = 100 * pregpreterm/delinst
    graph hbar (mean) pretermrate2 [fw=delinst], over(district) blabel(bar, ///
    size(small)) title("Preterm birth rate by district") ytitle("rate of preterm births") ///
    note("Source: MP HMIS data for FY '17-'18 and '18-'19, n = `:di %-12.0fc `myn''") ///
    bargap(40) name(g2, replace)
            
    graph combine g1 g2

    Comment


    • #3
      Thanks for your help. I understand now what you mean about the missing variables. So this new code should use the same observations for both approaches. However I lose some observations in the first part of the code, so I have to determine if that is a reasonable constraint to put on the data given what I know about it.

      Thanks again for the help!

      Comment


      • #4
        It's missing values, not missing variables, that are part of the problem. A variable is, in other terms, an entire column in the dataset.

        I don't understand the complaint in #3. If either the numerator or the denominator is missing, you can't calculate a ratio, or equivalently the ratio will be calculated as a missing value.

        Both methods ignore observations with any missing values: that is why they give the same result. I don't see that you lose any observations that aren't useless for the purpose. In

        Code:
         
         gen pretermrate2 = 100 * pregpreterm/delinst
        a missing value anywhere on the right-hand side results in missings for the result.

        Comment


        • #5
          Sorry if it seemed like a complaint, it was just a comment about the data I am working with. My point was that I know that in this dataset, if there are no preterm births in a given facility, preterm births may be set to missing instead of zero (this is a very poorly maintained data set). So I have to make a judgment call about whether to include observations where preterm births are missing but total births are not missing - do I believe that that missing actually denotes an observation of 0 preterm births, or is it legitimately a missing value and thus I should exclude it from this graph. If I write
          Code:
           bys district: egen pretermtotal=total(pregpreterm), missing        
          bys district: egen totaldels=total(delinst), missing        
          gen pretermrate1=100*pretermtotal/totaldels
          then I get to include every observation for which there is any value in either pregpreterm or delinst, which might be desirable.

          re missing values: yes, sorry for the brain fart.

          Comment

          Working...
          X