Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Determining the proportion of units that meet a criteria

    Hey everyone, say we have a dataset that looks like this
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(treat mediterranean) float indexedprice str7 yrweek byte id str9 fullname byte is_barcelona int time
    0 1 110.5 "2011-00"  6 "Donor 6"   0 1
    0 0 118.4 "2011-00" 14 "Donor 14"  0 1
    0 0 107.6 "2011-00" 17 "Donor 17"  0 1
    0 1  86.9 "2011-00" 19 "Donor 19"  0 1
    0 0 105.8 "2011-00" 20 "Donor 20"  0 1
    0 1  93.8 "2011-00" 28 "Barcelona" 1 1
    0 1  99.2 "2011-00" 30 "Donor 30"  0 1
    0 1   111 "2011-00" 31 "Donor 31"  0 1
    0 0  91.4 "2011-00" 32 "Donor 32"  0 1
    0 1 110.7 "2011-00" 36 "Donor 36"  0 1
    0 0 107.6 "2011-00" 38 "Donor 38"  0 1
    0 1  97.4 "2011-00" 40 "Donor 40"  0 1
    0 1 144.8 "2011-00" 41 "Donor 41"  0 1
    0 0  98.8 "2011-00" 45 "Donor 45"  0 1
    0 1  87.1 "2011-00" 48 "Donor 48"  0 1
    0 0  95.6 "2011-00" 49 "Donor 49"  0 1
    0 0 101.7 "2011-00" 53 "Donor 53"  0 1
    0 1    99 "2011-00" 54 "Donor 54"  0 1
    0 0    92 "2011-00" 57 "Donor 57"  0 1
    0 0 115.4 "2011-00" 59 "Donor 59"  0 1
    0 0  96.4 "2011-00" 60 "Donor 60"  0 1
    0 0  98.6 "2011-00" 62 "Donor 62"  0 1
    0 1 106.8 "2011-00" 65 "Donor 65"  0 1
    0 0   104 "2011-00" 72 "Donor 72"  0 1
    0 0 136.6 "2011-00" 73 "Donor 73"  0 1
    0 1 113.1 "2011-00" 76 "Donor 76"  0 1
    0 0 110.6 "2011-00" 80 "Donor 80"  0 1
    0 0 133.9 "2011-00" 83 "Donor 83"  0 1
    end
    I'm interested in determining the proportion of untreated units (that is, not Barcelona) that are on the Mediterranean Sea (mediterranean ==1). How would I do this? I know I can just take the average with collapse or something, but I was curious if there was a different way.

  • #2
    Code:
    * using collapse - least flexible beause it relies on binary coded variables.
    * by groups are another option to consider.
    preserve
    collapse (count) denom=is_barcelona (sum) num=mediterranean if is_barcelona==0 & !mi(mediterranean)
    gen pr = num / denom
    list
    restore
    
    * more general
    gen byte want = mediterranean==1 if is_barcelona==0
    tab want
    Output:

    Code:
    . list
    
         +-------------------------+
         | denom   num          pr |
         |-------------------------|
      1. |    27    11   .40740741 |
         +-------------------------+
    
    . tab want
    
           want |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |         16       59.26       59.26
              1 |         11       40.74      100.00
    ------------+-----------------------------------
          Total |         27      100.00
    Edit to add: Of course, there are yet more ways to go about this question and I assumed Jared was asking for a programming solution. Given this toy data, you could simply ask for the cross-tab directly.

    Code:
    tab mediterranean if is_barcelona==0
    Last edited by Leonardo Guizzetti; 02 Jun 2023, 13:37.

    Comment


    • #3
      In very large datasets, why not this?

      regress mediterranean if is_barcelona ==0

      Comment


      • #4
        I didn't think about this, but using reg is a great idea!

        Comment


        • #5
          I would gently urge you away form the use of -regres- for this problem for 3 reasons.

          1) -regress- is far slower than -tab-, so this would be wasteful of time with large data, especially if only point estimates are needed. If confidence intervals are needed, -prop- is a faster choice than -regress-.
          On my machine, I expanded the dataex 10,000 times (N=280,000 total), then -regress- took ~2 seconds, <0.1 second for -tab-, and <0.5 second for -prop-. This is noticeable slowdown on already a modest dataset.

          2) -regress- naturally fails if the proportion is 0% or 100%.

          3) If regression is the preferred choice, then with small data you'll get narrower confidence intervals using -logit-.

          Comment


          • #6
            Indeed, I think Leonardo is right. -regress- should be slower than -tab- for large datasets. However, I have used -regress- because it is faster than -collapse-, at least for my settings.
            The intercept in -regress- will not be affected if the proportion is 0 or 1.0. For very large datasets, the 95% CIs of any approach should be indistinguishable.

            Comment


            • #7
              You’re right about the constant with regression. I had already jumped to logit in my mind. Another concern with regress is that confidence intervals may be out of bounds of the proportion is close to 0% or 100%.

              Comment

              Working...
              X