Determining the proportion of units that meet a criteria

Jared Greathouse

Join Date: Sep 2021
Posts: 2172

Determining the proportion of units that meet a criteria

02 Jun 2023, 12:03

Hey everyone, say we have a dataset that looks like this

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(treat mediterranean) float indexedprice str7 yrweek byte id str9 fullname byte is_barcelona int time
0 1 110.5 "2011-00"  6 "Donor 6"   0 1
0 0 118.4 "2011-00" 14 "Donor 14"  0 1
0 0 107.6 "2011-00" 17 "Donor 17"  0 1
0 1  86.9 "2011-00" 19 "Donor 19"  0 1
0 0 105.8 "2011-00" 20 "Donor 20"  0 1
0 1  93.8 "2011-00" 28 "Barcelona" 1 1
0 1  99.2 "2011-00" 30 "Donor 30"  0 1
0 1   111 "2011-00" 31 "Donor 31"  0 1
0 0  91.4 "2011-00" 32 "Donor 32"  0 1
0 1 110.7 "2011-00" 36 "Donor 36"  0 1
0 0 107.6 "2011-00" 38 "Donor 38"  0 1
0 1  97.4 "2011-00" 40 "Donor 40"  0 1
0 1 144.8 "2011-00" 41 "Donor 41"  0 1
0 0  98.8 "2011-00" 45 "Donor 45"  0 1
0 1  87.1 "2011-00" 48 "Donor 48"  0 1
0 0  95.6 "2011-00" 49 "Donor 49"  0 1
0 0 101.7 "2011-00" 53 "Donor 53"  0 1
0 1    99 "2011-00" 54 "Donor 54"  0 1
0 0    92 "2011-00" 57 "Donor 57"  0 1
0 0 115.4 "2011-00" 59 "Donor 59"  0 1
0 0  96.4 "2011-00" 60 "Donor 60"  0 1
0 0  98.6 "2011-00" 62 "Donor 62"  0 1
0 1 106.8 "2011-00" 65 "Donor 65"  0 1
0 0   104 "2011-00" 72 "Donor 72"  0 1
0 0 136.6 "2011-00" 73 "Donor 73"  0 1
0 1 113.1 "2011-00" 76 "Donor 76"  0 1
0 0 110.6 "2011-00" 80 "Donor 80"  0 1
0 0 133.9 "2011-00" 83 "Donor 83"  0 1
end

I'm interested in determining the proportion of untreated units (that is, not Barcelona) that are on the Mediterranean Sea (mediterranean ==1). How would I do this? I know I can just take the average with collapse or something, but I was curious if there was a different way.

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2406

02 Jun 2023, 12:39

Code:

* using collapse - least flexible beause it relies on binary coded variables.
* by groups are another option to consider.
preserve
collapse (count) denom=is_barcelona (sum) num=mediterranean if is_barcelona==0 & !mi(mediterranean)
gen pr = num / denom
list
restore

* more general
gen byte want = mediterranean==1 if is_barcelona==0
tab want

Output:

Code:

. list

     +-------------------------+
     | denom   num          pr |
     |-------------------------|
  1. |    27    11   .40740741 |
     +-------------------------+

. tab want

       want |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         16       59.26       59.26
          1 |         11       40.74      100.00
------------+-----------------------------------
      Total |         27      100.00

Edit to add: Of course, there are yet more ways to go about this question and I assumed Jared was asking for a programming solution. Given this toy data, you could simply ask for the cross-tab directly.

Code:

tab mediterranean if is_barcelona==0

Last edited by Leonardo Guizzetti; 02 Jun 2023, 13:37.

Comment

Tiago Pereira

Join Date: Jan 2016

Posts: 408
#3

02 Jun 2023, 14:59

In very large datasets, why not this?

regress mediterranean if is_barcelona ==0
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#4

03 Jun 2023, 04:58

I didn't think about this, but using reg is a great idea!
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2406
#5

03 Jun 2023, 07:34

I would gently urge you away form the use of -regres- for this problem for 3 reasons.

1) -regress- is far slower than -tab-, so this would be wasteful of time with large data, especially if only point estimates are needed. If confidence intervals are needed, -prop- is a faster choice than -regress-.
On my machine, I expanded the dataex 10,000 times (N=280,000 total), then -regress- took ~2 seconds, <0.1 second for -tab-, and <0.5 second for -prop-. This is noticeable slowdown on already a modest dataset.

2) -regress- naturally fails if the proportion is 0% or 100%.

3) If regression is the preferred choice, then with small data you'll get narrower confidence intervals using -logit-.
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 408
#6

03 Jun 2023, 08:25

Indeed, I think Leonardo is right. -regress- should be slower than -tab- for large datasets. However, I have used -regress- because it is faster than -collapse-, at least for my settings.
The intercept in -regress- will not be affected if the proportion is 0 or 1.0. For very large datasets, the 95% CIs of any approach should be indistinguishable.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2406
#7

03 Jun 2023, 10:15

You’re right about the constant with regression. I had already jumped to logit in my mind. Another concern with regress is that confidence intervals may be out of bounds of the proportion is close to 0% or 100%.
Comment

Announcement

Determining the proportion of units that meet a criteria

Comment

Comment

Comment

Comment

Comment

Comment