Using frequency weights on graph bar to produce weighted averages.

Max Bricker

Join Date: Jul 2019
Posts: 3

Using frequency weights on graph bar to produce weighted averages.

21 Jul 2019, 11:59

Hello. I am working with a dataset that has the number of births and number of preterm births in different facilities in different districts. I want to show each district's preterm birth rate on an hbar graph. The first way (and more intuitive way, to me) I tried this was to just create variables pretermtotal and totaldels as the total of preterm and total births, respectively, and then divide them to get the district preterm birth rate.

I then thought, maybe it's more efficient to create one preterm birth variable, which in my code is pretermrate2 - the preterm birth rate for each individual observation - and then graph the mean of pretermrate2 using each observation's total deliveries as the frequency weight, which would in effect give me a weighted average. If it works I could cut out two lines of code and create fewer new variables.

The problem is, when I run both versions of this code, on the final graph I get preterm birth numbers that are slightly different. In most cases they are off by between .05-.5, and in only one case is the number the same. I suspect this problem lies in the ado file for Stata weights, but I'm really not sure how to find out if that's true, and running this method on different data gave the same numbers for both graphs.

If anyone knows why one kind of code produces different numbers than the other, I would greatly appreciate it!
* note - in the sample code I gave, the final graphed averages are a bit farther apart than when using the full unedited dataset

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 district int(pregpreterm delinst)

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 district int(pregpreterm delinst)
"York" 10 158
"York" 18 272
"York" 11 155
"York" 13 153
"York" 12 206
"York" 14 321
"York" 12 215
"York" 12 222
"York" 14 194
"York" 18 208
"Jersey" 15 220
"Jersey" 18 299
"Jersey" 12 146
"Jersey" 16 175
"Jersey" 10 181
"Jersey" 13 179
"Jersey" 12 175
"Jersey" 15 274
"Jersey" 17 189
"Jersey"  9 160
"Jersey" 12 139
"Jersey" 16 210
"Jersey" 14 171
"Jersey" 14 207
"Jersey"  . 114
"Jersey"  .  84
"Jersey"  .  69
"Jersey"  .  88
"Jersey"  .  75
"Guernsey"  1  89
"Guernsey"  1 138
"Guernsey"  .  55
"Guernsey"  .  96
"Guernsey"  .  59
"Guernsey"  . 102
"Guernsey"  .  66
"Guernsey"  1  76
"Guernsey"  1  92
"Guernsey"  1 114
"Guernsey"  .  67
"Guernsey"  1  72
"Guernsey"  1 103
"Guernsey"  .  44
"Guernsey"  . 122
"Guernsey"  . 117
"Guernsey"  . 135
"Guernsey"  .  57
"Guernsey"  1  73
"Mersey"  .  35
"Mersey"  .  59
"Mersey"  .  31
"Mersey"  1  37
"Mersey"  .  46
"Mersey"  .  37
"Mersey"  1  32
"Mersey"  1  37
"Mersey"  .  46
"Mersey"  .  40
"Mersey"  .  35
"Mersey"  .  48
"Mersey"  .  34
"Mersey"  .  53
"Mersey"  .  50
"Mersey"  .  44
"Mersey"  .  35
"Mersey"  .  52
"Mersey"  1  41
"Mersey"  1  21
"Mersey"  .  32
"Mersey"  1  41
"Mersey"  .  56
"Mersey"  .  20
"Mersey"  .  94
"Mersey"  5 145
"Mersey"  . 117
"Mersey"  5 107
"Mersey"  0  83
"Mersey"  . 106
"Mersey"  2  78
"Mersey"  3  83
"Mersey"  2 101
"Mersey"  3 152
"Percy"  2 102
"Percy"  0  61
"Percy"  0 152
"Percy"  5 192
"Percy"  5  95
"Percy"  .  97
"Percy"  5 103
"Percy"  3 132
"Percy"  3  67
"Percy"  3  64
"Percy"  3  65
"Percy"  . 128
"Percy"  5 138
"Percy"  5  92
"Percy"  0  45
"Percy"  .  40
"Percy"  .  49
"Percy"  .  53
end

        tempfile g1
        tempfile g2
        
        bys district: egen pretermtotal=total(pregpreterm), missing
        bys district: egen totaldels=total(delinst), missing
        gen pretermrate1=100*pretermtotal/totaldels
        sum delinst, d
        local myn `r(sum)'
        graph hbar pretermrate1, over(district) blabel(bar, ///
            size(small)) title("Preterm birth rate by district") ytitle("rate of preterm births") ///
            note("Source: MP HMIS data for FY '17-'18 and '18-'19, n = `:di %-12.0fc `myn''") ///
            bargap(40) saving(`g1', replace)
        
        gen pretermrate2=100* pregpreterm/delinst
        graph hbar (mean) pretermrate2 [fw=delinst], over(district) blabel(bar, ///
            size(small)) title("Preterm birth rate by district") ytitle("rate of preterm births") ///
            note("Source: MP HMIS data for FY '17-'18 and '18-'19, n = `:di %-12.0fc `myn''") ///
            bargap(40) saving(`g2', replace)
        
        graph combine "`g1'" "`g2'" // compare output from both methods

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35688
#2

22 Jul 2019, 03:17

Thanks for providing a data example, and it's not a problem that the data don't look real.

Good news: it's your bug. The graphs show the same results for York, where there are no missing values, but not otherwise: in all the other places missing values are present. So, my immediate guess is that the problem will lie in how missing values are handled in your code.

The problem is that (inserting some spaces for readability) this code

Code:

bys district: egen pretermtotal = total(pregpreterm), missing bys district: egen totaldels = total(delinst), missing gen pretermrate1 = 100 * pretermtotal/totaldels

does not ignore missings as you would wish. The missing option at most ensures that a result for the total of 0 if all values are missing is mapped to a result of missing. It doesn't ensure that missings are ignored completely when non-missings are present in a group. Nor is there any sense in which the egen code for one variable will look across at other variables not mentioned to see if there are missing values. (In summing values, missing values are ignored, so it is as if they zero: the sum of 42 and missing is 42, not missing.)

It's really better to assume that Stata's official code has been banged on many more times than your code so that bugs have been found before you use the code -- and, dare I say, was written by people with more experience.

Here's how to get the same results. In the home-grown calculations you must tell Stata to use only observations with non-missing values on both variables. I've simplified your code in some small respects.

Code:

gen touse = !missing(pregpreterm, delinst) bys district: egen pretermtotal = total(pregpreterm) if touse bys district: egen totaldels = total(delinst) if touse gen pretermrate1 = 100 * pretermtotal/totaldels sum delinst, d local myn `r(sum)' graph hbar pretermrate1, over(district) blabel(bar, /// size(small)) title("Preterm birth rate by district") ytitle("rate of preterm births") /// note("Source: MP HMIS data for FY '17-'18 and '18-'19, n = `:di %-12.0fc `myn''") /// bargap(40) name(g1, replace) gen pretermrate2 = 100 * pregpreterm/delinst graph hbar (mean) pretermrate2 [fw=delinst], over(district) blabel(bar, /// size(small)) title("Preterm birth rate by district") ytitle("rate of preterm births") /// note("Source: MP HMIS data for FY '17-'18 and '18-'19, n = `:di %-12.0fc `myn''") /// bargap(40) name(g2, replace) graph combine g1 g2
1 like
Comment
Max Bricker

Join Date: Jul 2019

Posts: 3
#3

22 Jul 2019, 10:45

Thanks for your help. I understand now what you mean about the missing variables. So this new code should use the same observations for both approaches. However I lose some observations in the first part of the code, so I have to determine if that is a reasonable constraint to put on the data given what I know about it.

Thanks again for the help!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35688
#4

22 Jul 2019, 10:54

It's missing values, not missing variables, that are part of the problem. A variable is, in other terms, an entire column in the dataset.

I don't understand the complaint in #3. If either the numerator or the denominator is missing, you can't calculate a ratio, or equivalently the ratio will be calculated as a missing value.

Both methods ignore observations with any missing values: that is why they give the same result. I don't see that you lose any observations that aren't useless for the purpose. In

Code:

gen pretermrate2 = 100 * pregpreterm/delinst

a missing value anywhere on the right-hand side results in missings for the result.
Comment
Max Bricker

Join Date: Jul 2019

Posts: 3
#5

22 Jul 2019, 11:12

Sorry if it seemed like a complaint, it was just a comment about the data I am working with. My point was that I know that in this dataset, if there are no preterm births in a given facility, preterm births may be set to missing instead of zero (this is a very poorly maintained data set). So I have to make a judgment call about whether to include observations where preterm births are missing but total births are not missing - do I believe that that missing actually denotes an observation of 0 preterm births, or is it legitimately a missing value and thus I should exclude it from this graph. If I write

Code:

bys district: egen pretermtotal=total(pregpreterm), missing bys district: egen totaldels=total(delinst), missing gen pretermrate1=100*pretermtotal/totaldels

then I get to include every observation for which there is any value in either pregpreterm or delinst, which might be desirable.

re missing values: yes, sorry for the brain fart.
Comment

Announcement

Using frequency weights on graph bar to produce weighted averages.

Comment

Comment

Comment

Comment