Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Flag unique values relative to another variable

    I have a deceivingly complicated (to me, at least!) two-part question.

    We have a large dataset where zones are smaller geographical areas than counties. Multiple zones can be within a single county. But, it is not clear whether or not zones can stretch across counties. In the first data example below, I've created one zone for which this is true: zone 16 is in both county 14 and county 38.

    Does anyone have an efficient solution to creating some sort of flag variable that will tell us when this happens?

    Data for the first issue:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(zone county wave) int id
    13 35 1 100
    13 35 1 101
    13 35 1 102
    19 35 1 103
    19 35 1 104
    19 35 1 105
     2 22 1 106
     2 22 1 107
     2 22 1 108
     2 22 1 109
     2 22 1 110
     6 90 1 111
     6 90 1 112
    16 14 1 113
    16 14 1 114
    16 38 1 115
    16 38 1 116
    25 38 1 117
    25 38 1 118
    25 38 1 119
    end

    Now, let's say that in our real data we find out that it is NOT true that zones can spread across multiple counties. We have zone data for two waves; but only county data for wave 1.

    Thus, if it is true that zones can only belong to a single county, we could then theoretically backfill wave 2 county based on wave 1 county -- zone information. However, because a) individuals can move between waves 1 and 2, and b) multiple zones can be inside one county, this isn't quite as simple (I think) as just doing a simple equivalency. You'd have to account for, e.g., the fact that zones 13 and 19 are both within county 35, and can't just rely on an individual's zone/county staying the same between wave 1 and wave 2.

    Data for the second issue, with 2 waves and where no zones spread across counties (so zone 16 now only belongs in county 14, you'll also note that a few individuals reside in new zones):

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(zone county wave) int id
    13 35 1 100
    13 35 1 101
    13 35 1 102
    19 35 1 103
    19 35 1 104
    19 35 1 105
     2 22 1 106
     2 22 1 107
     2 22 1 108
     2 22 1 109
     2 22 1 110
     6 90 1 111
     6 90 1 112
    16 14 1 113
    16 14 1 114
    16 14 1 115
    16 14 1 116
    25 38 1 117
    25 38 1 118
    25 38 1 119
    13  . 2 100
    13  . 2 101
    13  . 2 102
    19  . 2 103
    19  . 2 104
    19  . 2 105
     2  . 2 106
     2  . 2 107
     6  . 2 108
     25  . 2 109
     19  . 2 110
     6  . 2 111
     6  . 2 112
    16  . 2 113
    16  . 2 114
    16  . 2 115
    25  . 2 116
    25  . 2 117
    25  . 2 118
    25  . 2 119
    end

  • #2
    I can suggest a way to track the zone duplicates:

    Code:
    egen uniq_pair = tag(zone county)
    
    * To tabulate
    table zone, statistic(sum uniq_pair)
    
    * To create an indicator
    egen total_county = sum(uniq_pair), by(zone)
    list zone county if total_county > 1 & uniq_pair
    Not sure how to help with the rest, it seems more like a possible scenario than a question.

    Comment


    • #3
      Thanks for clear data example. Like Ken Chui I can wrap my head around the first question but the second is too hard for me to follow.

      Code:
      . bysort zone (county) : gen different = county[1] != county[_N]
      
      . 
      . tab zone if different 
      
             zone |      Freq.     Percent        Cum.
      ------------+-----------------------------------
               16 |          4      100.00      100.00
      ------------+-----------------------------------
            Total |          4      100.00
      See also https://www.stata.com/support/faqs/d...ions-in-group/

      Comment


      • #4
        Thanks both. For the second question, I'm basically just hoping to use Wave 1 information on zone --> county to fill in county, based on zone, in Wave 2. So if we observe in Wave 1 that zones 3, 10, and 15 are in county 25, then we'd ask Stata to fill in Wave 2 county == 25 if zone == 3 | zone == 10 | zone == 15.

        The contingency is: doing that would only be possible if zones cannot be in multiple counties (which is what I'm trying to figure out in the first part). Because if zone 3 crossed into county 25 and county 26, then knowing someone is in zone 3 in Wave 2 wouldn't be enough to slot them into one county--it could be either 25 or 26.

        The 2nd data example is set up as if it is true that zones cannot be in multiple counties.
        Last edited by Anne Todd; 01 Mar 2023, 18:16. Reason: Added detail

        Comment

        Working...
        X