Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Wanting to backfill observations for a gender variable (values M or F) by matching on an identifier.

    My dataset is several years of scholarship data, so enrollees are usually listed more than once. One of the years of data excludes the gender variable, I have it for all other years though, so I want to backfill the gender obs by matching on a unique id that I have created. In excel I would use an XLOOKUP. I want to do this in Stata today (better practice, right?). Example below - I cannot include more variables due to privacy concerns. Thanks for any guidance!

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str1 gender str62 concat float id_concat
    "F" "AHMAOGAKEDALEE"                    19
    "M" "ATOSRAYMOND CHRISTIAN"             56
    "M" "BARRAGAN-WILHELMOSCAR A"           69
    "F" "CHRISTENSENJAMIE LYNN"            138
    "F" "ELAVGAKGRETA LEE"                 212
    "M" "FOYLERUSSELL THOMAS"              256
    "F" "LEAVITTWILMA CARRIE"              450
    "F" "MERCULIEFBRITTE"                  507
    "M" "MONTATKELTY"                      522
    "F" "RICOTALICIA LITTLEFEATHER"        650
    "F" "SANDERSSHANA MARIE"               674
    "F" "SAXTONRAINA ROWENA"               682
    "M" "TAYLORRYAN BRADLEY"               769
    "F" "TELFAIRANGELICA DAWN"             770
    "F" "WILLIAMSALYSSA MICHELLE"          841
    "F" "AHSOAKAQPAURAQ MARY"               23
    "F" "BURNELLMONICA R"                  117
    "F" "DOWNEYAMBER"                      191
    "F" "DUNLAPJULIE ANNE"                 198
    "F" "DURIEZMARIE BETHANY"              200
    "F" "EDWARDSCAROLYN"                   201
    "F" "ELBERTAICLARISSA"                 214
    "M" "ELLIOTT JR.LEROY JOSEPH"          219
    "F" "FISCHERBERNADETTE AANAVAK"        241
    "F" "FRANTZDONNA KAY"                  268
    "F" "GLENNPATUK"                       280
    "F" "IVANOFFPATRICIA E"                355
    "F" "JUDKINSTENNESSEE DAWN"            371
    "M" "KAGAK-AGUILARALEXANDER THOMAS"    372
    "M" "KANAYURAKKEIFER ANGELO"           385
    "F" "MANUELGENESIS"                    477
    "F" "MAUPINSARAH MARY"                 491
    "F" "MCDERMOTTSHANNON LEIGH TUIGANA"   496
    "F" "MCFADDENKYMBERLY"                 498
    "F" "MEKIANABRITTANY ROSE"             503
    "M" "MONTATKELTY"                      522
    "F" "NESTEBYSERENA"                    567
    "F" "NINGEOKFLORENCE MARY"             572
    "M" "NUSUNGINYAVINCENT AARON"          576
    "F" "RUCKLEJENIFER LYNNE"              657
    "F" "RUSSELLLEANNA ROCHELLE"           662
    "F" "SANDERSSHANA MARIE"               674
    "F" "SANDERSJAIME LEE"                 673
    "F" "SHELLABARGERLAURIE KOLTES"        691
    "F" "SIMMONDSDOREEN N"                 704
    "F" "SOPLUARIEL R."                    725
    "F" "TINGOOKJANELLE ANN"               776
    "M" "TRUPPCONNOR"                      789
    "F" "WARRIORAURORA MARIE"              820
    "F" "WEBERKAITLANN JOY"                825
    "F" "WHITEMARTHA"                      833
    "F" "ABADCOLLEEN ESTELLE"                3
    "F" "AGUVLUKPIQUURAQ MARGARET"           8
    "F" "AHMAOGAKEDALEE"                    19
    "F" "ALBRIGHTALYSSA HAYLEY"             41
    "F" "AMPONSAHNAA AYELE"                 44
    "M" "AMPONSAHOHENEBA KOFI"              46
    "F" "APPLEGARTHSARAH MICHELLE"          54
    "M" "ATOSRAYMOND CHRISTIAN"             56
    "F" "AVALOSLORYN"                       58
    "F" "BAGGETTMIKAYLA"                    61
    "F" "BANKSTONALYNNE REANE"              67
    "F" "BANKSTONALISON R."                 66
    "M" "BARDONKAINOA JOHN LAAKEAOKAHONUA"  68
    "M" "BARRAGAN-WILHELMOSCAR A"           69
    "F" "BELZSYDNEY JOLYNN"                 74
    "F" "BORSTADMIKELLA KYLANDER"           84
    "M" "BOYNTONJUSTIN KADE"                89
    "M" "BROWERTHOMAS PANEATAQ"            106
    "M" "BROWERTERANCE JAMES"              104
    "F" "BROWERTERZA KASAK"                105
    "F" "BURNSTRACY LYNN"                  119
    "F" "CAINGABRIELLE SUZANNE"            122
    "M" "CARRPAUL ANDREW"                  132
    "F" "CARROLLKAYLA TAMARA"              130
    "F" "CLARKHANNAH"                      141
    "M" "COOKLESLIE UGAROOK"               146
    "M" "COOPERCORIN JAMES"                148
    "M" "CROSBYDARREN JAMES KUNNIAQ"       156
    "F" "DANNERMICHELE JOY"                170
    "F" "DANNERKORY ANN"                   167
    "M" "DARLINGRUSSELL JONAS"             172
    "F" "DEUBLERMEAGAN"                    184
    "F" "DEUBLERLAUREN"                    183
    "F" "DIAZLIZZIE"                       186
    "M" "DIAZJOSEPH MARIO"                 185
    "M" "DIEMERTJESSE DIEMERT"             187
    "F" "DINGMANHEATHER MARIE"             188
    "F" "DRISCOLLCASSANDRA RENEE"          196
    "F" "DUNLAPJULIE ANNE"                 198
    "F" "EICKHOLTAMBER LYNN"               210
    "M" "ELAVGAKDAVID FREDERICK"           211
    "F" "ELBERTAICLARISSA"                 214
    "M" "ELDER WATERSMICHAEL STEPHEN"      216
    "F" "ELDER WATERSRACHEL ELIZABETH"     217
    "M" "ERICKSONTALON JAMES"              227
    "F" "FAULKNERANGELENE DORIS"           232
    "F" "FERRISCARLY DANA"                 235
    "M" "FERRISJUSTIN BRADLEY"             236
    "F" "FILEFLORA KATE"                   238
    end

  • #2
    There is here no year variable and the data example doesn't include any missing values. Nevertheless, stripolate is available from SSC as part of the mipolate package.

    For more, see https://www.statalist.org/forums/for...-interpolation

    Here is an example.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(year id) str1 gender
    2020 1 "F"
    2021 1 "" 
    2022 1 "" 
    2020 2 "M"
    2021 2 "M"
    2022 2 "" 
    end
    
    -
    . stripolate gender year, by(id) gen(new) groupwise
    
    .. list, sepby(id)
    
         +--------------------------+
         | year   id   gender   new |
         |--------------------------|
      1. | 2020    1        F     F |
      2. | 2021    1              F |
      3. | 2022    1              F |
         |--------------------------|
      4. | 2020    2        M     M |
      5. | 2021    2        M     M |
      6. | 2022    2              M |
         +--------------------------+

    Comment


    • #3
      Thanks Nick. Here is a better dataex with the trouble year and a normal year provided. I'll work with your code and share any magic

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str1 gender str62 concat float id_concat str18 sq
      "" "KHANSELMA NADINE" 398 "Spring 2022 no DOB"
      "" "WAGNERANGELINA" 814 "Spring 2022 no DOB"
      "" "AWALINEDNA SUE" 59 "Spring 2022 no DOB"
      "" "KILLBEARKAYLENE" 403 "Spring 2022 no DOB"
      "" "DANNERERIK DARIAN" 165 "Spring 2022 no DOB"
      "" "PEREAJASMINE NICOLE TIGIQURAQ" 624 "Spring 2022 no DOB"
      "" "NYQUISTKIATCHA LOUISE" 577 "Spring 2022 no DOB"
      "" "ELDER WATERSMICHAEL STEPHEN" 216 "Spring 2022 no DOB"
      "" "HILLDONNA MAE" 322 "Spring 2022 no DOB"
      "" "WOLGEMUTHSARA LYNAE" 849 "Spring 2022 no DOB"
      "" "DESCOMBESLEYA MARIE" 181 "Spring 2022 no DOB"
      "" "WILLIAMSALYSSA MICHELLE" 841 "Spring 2022 no DOB"
      "" "LOWERYJERRY DILLON ROY" 470 "Spring 2022 no DOB"
      "" "TELFAIRANGELICA DAWN" 770 "Spring 2022 no DOB"
      "" "FISHERSADIE KARINA" 244 "Spring 2022 no DOB"
      "" "UPICKSOUNHILARY JOYCE" 803 "Spring 2022 no DOB"
      "" "GUTIERREZPAULINA ANGELICA ECKLEY" 300 "Spring 2022 no DOB"
      "" "STEWARTMADELEINE" 745 "Spring 2022 no DOB"
      "" "WELLSSTACEY" 829 "Spring 2022 no DOB"
      "" "CARROLLKAYLA TAMARA" 130 "Spring 2022 no DOB"
      "" "REXFORDLINDSAY" 642 "Spring 2022 no DOB"
      "" "SMITHMARRIAH SHEA" 713 "Spring 2022 no DOB"
      "F" "DAVISJAMIE LOREN" 173 "Summer 22"
      "M" "OMNIKJOHN ADAM" 588 "Summer 22"
      "M" "AKOOTCHOOKBERDELL PANINGUNA" 37 "Summer 22"
      "F" "GREENE- HULLSARA LOVE" 292 "Summer 22"
      "F" "ANDERSONMARLENE" 50 "Summer 22"
      "F" "LANEDAISY SUEYUK" 426 "Summer 22"
      "M" "TEVUKDWIGHT" 771 "Summer 22"
      "F" "NICELYMACKENZEE" 569 "Summer 22"
      "F" "MATTHEWSDIONDRA" 485 "Summer 22"
      "F" "SEARSNELLIE" 685 "Summer 22"
      "F" "JONESJAIME JOYCE" 366 "Summer 22"
      "F" "WIEHLMONICA" 839 "Summer 22"
      "M" "TOOYAK IIIANDREW" 784 "Summer 22"
      "F" "ALBRIGHTALYSSA HAYLEY" 41 "Summer 22"
      "F" "NYQUISTKIATCHA LOUISE" 577 "Summer 22"
      "F" "PONCESALENA" 629 "Summer 22"
      "F" "SOPLUARIEL" 724 "Summer 22"
      "F" "NEHERSARAH" 561 "Summer 22"
      "F" "DECKERKRISTINA MAY" 177 "Summer 22"
      "F" "MCALISTERKIMMIALUK" 494 "Summer 22"
      "F" "LEAVITT-MIGUELASIA MARIE" 436 "Summer 22"
      "F" "EDWARDSCAROLYN" 201 "Summer 22"
      "M" "LEMAYKEITH ROBERT" 453 "Summer 22"
      "F" "REICHBRANDYE" 636 "Summer 22"
      "M" "TOOYAKAARON KIPPOGAN" 786 "Summer 22"
      "M" "TOOYAK, JR.ANDREW" 785 "Summer 22"
      "F" "GLENNROBERTA" 281 "Summer 22"
      "M" "DANJINNICHOLAS" 163 "Summer 22"
      "F" "BOWENASHLEY" 86 "Summer 22"
      "M" "PANIGEOCHARLES" 607 "Summer 22"
      "F" "OUDEANANGELA LOUISE" 591 "Summer 22"
      "F" "CAINKRISTINA JENNIE" 123 "Summer 22"
      "F" "CRISTTOMMI" 155 "Summer 22"
      "F" "FLORESALICIA" 247 "Summer 22"
      "F" "FEATHERLYTAMARA" 233 "Summer 22"
      "M" "SANCHEZMATTHEW JAMES" 671 "Summer 22"
      "M" "MURANOPHILLIP JOSEPH" 541 "Summer 22"
      "F" "BROWERMURIEL D." 101 "Summer 22"
      "F" "MARTINROSEMARY" 482 "Summer 22"
      "M" "BAHRREMY" 62 "Summer 22"
      "F" "WELLSSTACEY" 829 "Summer 22"
      "F" "WAGNERANGELINA" 814 "Summer 22"
      "F" "PATKOTAKABRA" 612 "Summer 22"
      "F" "TOOVAKKAREN" 783 "Summer 22"
      "M" "NALIKAKURIAH RONALD" 545 "Summer 22"
      "F" "VALENZUELAMARIAM" 807 "Summer 22"
      "M" "SMITHZACKARY" 718 "Summer 22"
      "F" "PEREAJASMINE NICOLE TIGIQURAQ" 624 "Summer 22"
      "F" "HILLDONNA MAE" 322 "Summer 22"
      "F" "FRANKSONHANNAH" 258 "Summer 22"
      "F" "SMITHJESSI LYNN" 711 "Summer 22"
      "M" "TOOYAK IIIANDREW" 784 "Summer 22"
      "F" "TAYLORCASSIE" 767 "Summer 22"
      "F" "TELFAIRANGELICA DAWN" 770 "Summer 22"
      end
      Last edited by raniyah bakr; 01 Feb 2023, 12:24.

      Comment


      • #4
        Originally posted by Nick Cox View Post
        There is here no year variable and the data example doesn't include any missing values. Nevertheless, stripolate is available from SSC as part of the mipolate package.

        For more, see https://www.statalist.org/forums/for...-interpolation

        Here is an example.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(year id) str1 gender
        2020 1 "F"
        2021 1 ""
        2022 1 ""
        2020 2 "M"
        2021 2 "M"
        2022 2 ""
        end
        
        -
        . stripolate gender year, by(id) gen(new) groupwise
        
        .. list, sepby(id)
        
        +--------------------------+
        | year id gender new |
        |--------------------------|
        1. | 2020 1 F F |
        2. | 2021 1 F |
        3. | 2022 1 F |
        |--------------------------|
        4. | 2020 2 M M |
        5. | 2021 2 M M |
        6. | 2022 2 M |
        +--------------------------+
        Nick: to deconstruct the code, what is the purpose of adding the 'year' var? Is that to identify the var with missing obs? Thanks!

        Comment


        • #5
          In my understanding this kind of interpolation is always within sequences and some variable is needed to define at least the correct order and sometimes exactly when or where a value occurred.

          Comment


          • #6
            I would approach this differently from what Nick has suggested. First, since ordinarily gender would not change over time, to be safe, you should verify that the non-missing values of gender for each person are consistent. If they aren't, then it is not clear what to use to backfill the missing values. In my experience, when data on the same people are pulled from multiple sources, inconsistencies in variables like gender that should be constant are actually pretty common. Second, given that one will just be copying the one consistent non-missing value of gender, I don't think the time variable is actually needed here. I see this is a copying action, not an interpolation. So, I would do this as:

            Code:
            by id (gender), sort: assert gender == gender[_N] | missing(gender)
            by id (gender): replace gender = gender[_N]
            The logic behind this is that gender is a string variable, and in strings variables, missing sorts to the top and non-missing values to the bottom.

            Comment


            • #7
              Clyde Schechter is clearly correct in urging caution, but the approaches are closer than he implies, as it's documented that groupwise will not act if there is inconsistent information.

              groupwise specifies that non-missing values be copied to missing values if, and only if, just one distinct non-missing value occurs in each
              group. Thus a group of values "", "A", "", "" qualifies as "A" is not missing and is the only non-missing value in the group. Hence the
              missing values in that group will be replaced with "A" in the new variable. By the same rules "A", "", "A", "" qualifies but "A", "",
              "B", "" does not. Normally, but not necessarily, this option is used in conjunction with by:, which is how groups are specified;
              otherwise the (single) group is the entire set of observations being used.
              Besides, there are other options in stripolate.

              The syntax of stripolate was written to be consistent with other interpolation commands, the grandparent of all being ipolate. Note that a sequence variable really is needed for forward or backward copying to be applied. It's easy enough for data to get out of time order if they are sorted on some other variable, but any interpolation command corrects that.
              Last edited by Nick Cox; 01 Feb 2023, 17:23.

              Comment


              • #8
                I would do what Clyde suggests.

                On another note, one is learning something new every days. I thought that Stata always considered missing values larger than anything else. It is a surprise to me to learn that if you sort on a string variable the missings come first. I have never encountered this fact because I never work with string variables, except maybe when they are identifiers.

                And in the line of thought of avoiding working with strings, I would just define a variable say female, equal to 1 for females and 0 for males, and then I would use egen functions to spread out the sex across the missings by say -egen, max()-, or -egen, min()-.

                Comment


                • #9
                  Making a numeric indicator variable with values 0 and 1 consistent across groups is a simpler analogue of the problem in this thread, as good enough code for interpolating (or imputing) strings works with arbitrary string variables. You would often need to check that the minimum and maximum of a (0, 1) variable were identical before filling in the missings.

                  As already pointed out, the approaches of #2 and #6 are really identical. The history is relevant: I too posted variants of #6 over many years but the facts are that (1) it requires some fluency with by: and associated trickery to see how to do it, which is beyond what most new or occasional users see easily, and no criticism there (2) there were persistent requests for a string interpolation command, which I resisted until I wrote one, and it's really not a big deal.

                  Having two or more ways to do something beats not knowing any: no more, no less.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    I would approach this differently from what Nick has suggested. First, since ordinarily gender would not change over time, to be safe, you should verify that the non-missing values of gender for each person are consistent. If they aren't, then it is not clear what to use to backfill the missing values. In my experience, when data on the same people are pulled from multiple sources, inconsistencies in variables like gender that should be constant are actually pretty common. Second, given that one will just be copying the one consistent non-missing value of gender, I don't think the time variable is actually needed here. I see this is a copying action, not an interpolation. So, I would do this as:

                    Code:
                    by id (gender), sort: assert gender == gender[_N] | missing(gender)
                    by id (gender): replace gender = gender[_N]
                    The logic behind this is that gender is a string variable, and in strings variables, missing sorts to the top and non-missing values to the bottom.
                    This is great, thanks Clyde!

                    Comment

                    Working...
                    X