Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Obvious deviation resulted from customary 'encode' and 'recode' string variable

    Hi Community, I need to recode the original string variable ragender into numeric type and assign new values to it, namely, assigning 1 = 0 for Male and 2 = 1 for Female. But the customary code resulted in significant deviation in the appropriation of Male to Female after recoding. I did not see obvious logic errors in the coding. Any idea about it?

    //----- Tab original variable ---------------
    . tab ragender, m

    ragender | Freq. Percent Cum.
    ------------+-----------------------------------
    . | 11 0.04 0.04
    1.Male | 12,296 48.21 48.26
    2.FeMale | 1,472 5.77 54.03
    2.Female | 11,725 45.97 100.00
    ------------+-----------------------------------
    Total | 25,504 100.00


    //----------encode & recode the string variable ------
    Code:
    encode ragender, gen(encode_r1gender)
    recode encode_r1gender (1=0)(2=1)(else=.)
    lab var encode_r1gender "respondents' gender"
    lab val encode_r1gender r1gender_lab1
    lab def r1gender_lab1 0 "0 Male" 1 "1 Female"

    //---------Obvious deviation resulted from encode & recode---------
    . tab encode_r1gender, m

    respondents |
    ' gender | Freq. Percent Cum.
    ------------+-----------------------------------
    0 Male | 11 0.04 0.04
    1 Female | 12,296 48.21 48.26
    . | 13,197 51.74 100.00
    ------------+-----------------------------------
    Total | 25,504 100.00

  • #2
    Looks like your Males have all been given 1 Female as a value, your missing values have been recoded as 0 Male, and everything else is coded as missing. Is it just me, or do you have two separate values for female: labels FeMale and Female? My guess is that the numbers that actually represent these values are not what you expect. For instance, it looks like perhaps things labeled as missing actually have a value of 1 in the original variable. Notice that

    Code:
    tab ragender, m
    Should show you missing values as the last entry. This makes me think you have "." strings rather than missing values in your string variable.

    Comment


    • #3
      Thanks for replying, Daniel. You're right that the original string variable 'ragender' have "." strings rather than missing values. I remove the inconsistencies of lab values for Male and Female, but still I did not fix the issue. The obvious deviation still persists.

      . tab ragender, m

      ragender | Freq. Percent Cum.
      ------------+-----------------------------------
      . | 11 0.04 0.04
      1.Male | 12,296 48.21 48.26
      2.Female | 13,197 51.74 100.00
      ------------+-----------------------------------
      Total | 25,504 100.00

      //--------- Deviation persists ------------
      . encode ragender, gen(encode_r1gender)

      . recode encode_r1gender (1=0)(2=1)(else=.)
      (encode_r1gender: 25504 changes made)

      . lab var encode_r1gender "respondents' gender"

      . lab val encode_r1gender r1gender_lab1

      . lab def r1gender_lab1 0 "0 Male" 1 "1 Female"

      .
      . tab encode_r1gender, m

      respondents |
      ' gender | Freq. Percent Cum.
      ------------+-----------------------------------
      0 Male | 11 0.04 0.04
      1 Female | 12,296 48.21 48.26
      . | 13,197 51.74 100.00
      ------------+-----------------------------------
      Total | 25,504 100.00

      Comment


      • #4
        Hi Sophia,

        It doesn't looks like you've addressed everything I mentioned in #2 - though I am pleased to see you appear to have only one "female" level this time around. You acknowledge that "." strings exist in the input string column, but you haven't done anything to address that. Encode will not translate these strings into missing values. Please try the following code and post the results:

        Code:
        tab ragender, m
        encode ragender, gen(encode_r1gender)
        recode encode_r1gender (1=.)(2=0)(3=1)
        lab var encode_r1gender "respondents' gender"
        lab val encode_r1gender r1gender_lab1
        lab def r1gender_lab1 0 "0 Male" 1 "1 Female"
        tab encode_r1gender, m
        checkvar ragender encode_r1gender
        Last edited by Daniel Schaefer; 08 Aug 2023, 22:04.

        Comment


        • #5
          A secondary comment: longstanding advice is to name a (0, 1) indicator variable for the category coded 1. So here Female or female fits the bill better than some ambiguous name such as a variation on gender or sex.

          I have heard too many stories about presentations in which someone asked which way round a gender indicator (dummy variable) was coded, and the researcher couldn't answer in the heat of the moment. Nor do you want to make your readers (of a paper, report, dissertation, thesis, book) search hard for the definition (which here, and everywhere, is crucial to interpretation).

          This point is one of several gathered together in https://journals.sagepub.com/doi/pdf...36867X19830921 The custom goes way back before that paper and can be seen in Stata's auto data in which foreign is so named.

          Orthogonal yet again is whether a binary indicator is sufficient here, noting that missings are allowed. In practice most researchers here are downstream of someone else's decision on how to code up people's answers on gender, or on which categories were offered.

          Comment


          • #6
            Nick, thanks for providing one relevant reference (https://journals.sagepub.com/doi/pdf...36867X19830921). It refers to two related constructed variables to gender a new variable in the code "generate foreign_himpg = (foreign == 1) & (mpg > 30)". But in my case, there's only one indicator of 'ragender' in the wave 1, and none other related variables could be coupled to jointly define the Female or Male notion. Is it my misunderstanding of this piece or you meant other points in that reference? Otherwise, did you indicate the points in the [alternatives to 0 and 1 as codes for binary states] instead?

            Comment


            • #7
              Thanks for your input Daniel, the iteration of your suggestions did not show desirable results, though.

              //------------- results of your iteration -------------------
              . recode encode_r1gender (1=.)(2=0)(3=1)
              (encode_r1gender: 25504 changes made)

              .
              end of do-file

              . lab var encode_r1gender "respondents' gender"


              . lab val encode_r1gender r1gender_lab1


              . lab def r1gender_lab1 0 "0 Male" 1 "1 Female"


              . tab encode_r1gender, m

              respondents |
              ' gender | Freq. Percent Cum.
              ------------+-----------------------------------
              . | 25,504 100.00 100.00
              ------------+-----------------------------------
              Total | 25,504 100.00



              . checkvar ragender encode_r1gender
              command checkvar is unrecognized
              r(199);

              end of do-file

              r(199);

              //---------- Another try of mine as below -------------
              . encode ragender, gen(encode_r1gender)

              . recode encode_r1gender (1=0)(2=1)(.=.)
              (encode_r1gender: 12307 changes made)

              . lab var encode_r1gender "respondents' gender"

              . lab val encode_r1gender r1gender_lab1

              . lab def r1gender_lab1 0 "0 male" 1 "1 female"


              . tab encode_r1gender, m

              respondents |
              ' gender | Freq. Percent Cum.
              ------------+-----------------------------------
              0 male | 11 0.04 0.04
              1 female | 12,296 48.21 48.26
              3 | 13,197 51.74 100.00
              ------------+-----------------------------------
              Total | 25,504 100.00

              .

              Comment


              • #8
                I had two points in #5. The more important one, I guess, for your purposes is just to urge that you choose an informative name for your indicator variable. As said, this point is one of several made in my 2019 paper with Clyde Schechter, but was even then long since standard.

                The example you cite from the paper in which an indicator is constructed from two variables was not on my mind, was not referred in #5 and is not so far as I can see at all relevant to what you're trying, so I am puzzled why your raise it all.

                I hope that helps with what you found unclear.

                Comment


                • #9
                  Hi Sophia,

                  Oops, checkvar is on SSC. Can you show us the output of the following?

                  Code:
                  encode ragender, gen(encode_r1gender)
                  tab encode_r1gender, m
                  tab encode_r1gender, m nolab

                  Comment


                  • #10
                    I have to say, I'm really surprised this doesn't work:

                    Code:
                    encode ragender, gen(encode_r1gender)
                    recode encode_r1gender (1=.)(2=0)(3=1)
                    I'm struggling to think of a scenario where the recode line would result in all missing. It seems like this should only happen if ragender contains only one category, contains only the empty string, or contains only one category and the empty string. Otherwise, even if the line is incorrect and 1, 2, and 3 don't refer to the correct category values, it still shouldn't recode to a column with all values missing.

                    It really seems like you have a situation essentially like this:

                    Code:
                    clear
                    input str80(gender)
                    "."
                    "1.male"
                    "1.male"
                    "2.female"
                    "2.female"
                    "2.female"
                    end
                    Which results in a string variable with three levels:

                    Code:
                    . tab gender, m
                    
                                                     gender |      Freq.     Percent        Cum.
                    ----------------------------------------+-----------------------------------
                                                          . |          1       16.67       16.67
                                                     1.male |          2       33.33       50.00
                                                   2.female |          3       50.00      100.00
                    ----------------------------------------+-----------------------------------
                                                      Total |          6      100.00
                    
                    .
                    If you encode that variable, you get a categorical variable with three levels:

                    Code:
                    . encode gender, gen(egender)
                    
                    . tab egender, m
                    
                        egender |      Freq.     Percent        Cum.
                    ------------+-----------------------------------
                              . |          1       16.67       16.67
                         1.male |          2       33.33       50.00
                       2.female |          3       50.00      100.00
                    ------------+-----------------------------------
                          Total |          6      100.00
                    
                    . tab egender, m nolab
                    
                        egender |      Freq.     Percent        Cum.
                    ------------+-----------------------------------
                              1 |          1       16.67       16.67
                              2 |          2       33.33       50.00
                              3 |          3       50.00      100.00
                    ------------+-----------------------------------
                          Total |          6      100.00
                    
                    .
                    Note that none of the values are set to the Stata missing value (not even the "." string):

                    Code:
                    . count if missing(egender)
                      0
                    If I recode I get the expected result:

                    Code:
                    . recode egender (1=.) (2=0) (3=1)
                    (6 changes made to egender)
                    
                    . rename egender female
                    
                    . tab female, m nolab
                    
                         female |      Freq.     Percent        Cum.
                    ------------+-----------------------------------
                              0 |          2       33.33       33.33
                              1 |          3       50.00       83.33
                              . |          1       16.67      100.00
                    ------------+-----------------------------------
                          Total |          6      100.00
                    
                    .
                    And I can even reproduce the same patters you get in your alternate attempts. e.g.:

                    Code:
                    clear
                    input str80(gender)
                    "."
                    "1.male"
                    "1.male"
                    "2.female"
                    "2.female"
                    "2.female"
                    end
                    
                    encode gender, gen(egender)
                    recode egender (1=0)(2=1)(.=.)
                    lab val egender r1gender_lab1
                    lab def r1gender_lab1 0 "0 male" 1 "1 female"
                    tab egender, m
                    Code:
                    . tab egender, m
                    
                        egender |      Freq.     Percent        Cum.
                    ------------+-----------------------------------
                         0 male |          1       16.67       16.67
                       1 female |          2       33.33       50.00
                              3 |          3       50.00      100.00
                    ------------+-----------------------------------
                          Total |          6      100.00
                    
                    .
                    Last edited by Daniel Schaefer; 09 Aug 2023, 16:32. Reason: clearer example

                    Comment


                    • #11
                      Hello Daniel, I tried your iteration in #4 again and it worked for another data copy. It seems that earlier dta file was broken. The logic in the code "recode encode_r1gender (1=.)(2=0)(3=1)" did not make sense to me. It worked perfectly well with 'recode encode_r1gender (1=0)(2=1)(.=.)' in many other cases.

                      Besides, you're right about what my data source looks like. I guess the features of string types with prevalent string '.' affected the customary coding logic.

                      //--------Results in #4 ---

                      . encode ragender, gen(encode_r1gender)

                      . recode encode_r1gender (1=.)(2=0)(3=1)
                      (encode_r1gender: 25504 changes made)

                      . lab var encode_r1gender "respondents' gender"

                      . lab val encode_r1gender r1gender_lab1

                      . lab def r1gender_lab1 0 "0 Male" 1 "1 Female"

                      . tab encode_r1gender, m

                      respondents |
                      ' gender | Freq. Percent Cum.
                      ------------+-----------------------------------
                      0 Male | 12,296 48.21 48.21
                      1 Female | 13,197 51.74 99.96
                      . | 11 0.04 100.00
                      ------------+-----------------------------------
                      Total | 25,504 100.00

                      Comment


                      • #12
                        Hello Nick, thanks for your comment. I strongly agree that indicator variables make much more sense than dummy variables. Although your paper has several informative points, I did not find a transferable code usage case. I did not get your indication when you mentioned how the variable foreign was constructed in #5 and that was why I denied the relevance of 'generate foreign_himpg = (foreign == 1) & (mpg > 30)' in #6.

                        Comment


                        • #13
                          Hi Sophia,

                          I'm glad to hear you were able to solve this.

                          I guess the features of string types with prevalent string '.' affected the customary coding logic.
                          Exactly, It looks to me like the data file is the source of your troubles. The . character represents a missing numeric value. Strings use the empty string to represent missing data (literally a string without any characters in it, like this -> "" ). The dot character in a string (-> ".") looks like it should be missing, but it is actually a string like any other, not a missing value. I can definitely see why that might be confusing. It's a good think you caught the issue, since this is a silent error.

                          Comment


                          • #14
                            Sophia, are you really sure that "FeMale" is simply a spelling mistake (capital M instead of lower case m)? Questionnaires are currently widespread in which, when asking about gender, not only the answer options "male" and "female" are used, but also the option "non-binary". It may be that somebody tried to denote the latter by using a combination of Female and Male in the form "FeMale". That about 5.8% of a sample would choose "non-binary" is not exceptional.

                            Whether questions like this are useful, is a different matter (for example, we found that respondents who answered "non-binary" show more inattentive responding behavior and are fare more likely to report having committed violent acts), but if my guess is correct it would be a serious mistake to confuse "Female" and "FeMale".

                            Comment


                            • #15
                              Thanks for your input, Daniel. I think your comment #10 explained a lot.

                              Comment

                              Working...
                              X