Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Encoding strings stored as negative numbers

    Hello

    I noticed something which puzzled me when trying to convert a variable stored partly as negative numbers to a factor variable.

    The variable is cohort90, which is recorded as years from 1990, so -6 equals 1984,-4 equals 1986 and so on.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float caseid byte(score cohort90)
      339 49 -6
      340 18 -6
      345 46 -6
      346 43 -6
      352 17 -6
      353 29 -6
      354 15 -6
      361 19 -6
      362 45 -6
      363 12 -6
     6824  0 -4
     6826  0 -4
     6827 20 -4
     6828 32 -4
     6829  0 -4
     6834 24 -4
     6836 23 -4
    13206  7 -2
    13209 38 -2
    13215 46 -2
    13217 28 -2
    13218 32 -2
    18681 36  0
    18682 21  0
    18685 26  0
    18686 34  0
    26586 25  6
    26591 38  6
    26594 27  6
    26595 28  6
    31001 40  8
    31005 36  8
    31009 39  8
    31011 44  8
    end
    I've tried "manual' ways of generating a factor variable from cohort 90, which do work, such as

    Code:
    generate cohort90yr84 = cohort90==-6
    generate cohort90yr86 = cohort90==-4
    generate cohort90yr88 = cohort90==-2
    generate cohort90yr90 = cohort90==0
    generate cohort90yr96 = cohort90==6
    generate cohort90yr98 = cohort90==8
    or
    Code:
     generate cohort=1 if cohort90==-6
    replace cohort=2 if cohort90==-4
    replace cohort=3 if cohort90==-2
    replace cohort=4 if cohort90==0
    replace cohort=5 if cohort90==6
    replace cohort=6 if cohort90==8
    However I was looking for a quicker way with less typing.

    Code:
     quietly tabulate cohort90, generate(new_cohort)
    works but you get 6 dummy variables with this. Using

    Code:
     tostring cohort90, generate(another)
    encode another,gen(another_1)
    gives me what I want but the curious part of this I don't understand is that another_1 has the order of the factor variable changed:

    Code:
      tab another_1
    
         Cohort |      Freq.     Percent        Cum.
    ------------+-----------------------------------
             -2 |      5,245       15.43       15.43
             -4 |      6,325       18.61       34.04
             -6 |      6,478       19.06       53.10
              0 |      4,371       12.86       65.96
              6 |      4,244       12.49       78.45
              8 |      7,325       21.55      100.00
    ------------+-----------------------------------
          Total |     33,988      100.00
    So now 1988 is the base year instead of 1984 then 1986 then 1984.You have to be careful now when selecting your reference category when doing

    Code:
    regress score i.another_1
    Not sure what I have done wrong here but it's probably how Stata interprets the negative numbers.

    Regards

    Chris

  • #2
    Code:
    gen year= 1990+cohort
    *BASE YEAR=1984
    regress score ib1984.year
    *BASE YEAR=1986
    regress score ib1986.year
    See

    Code:
    help fvvarlist

    Comment


    • #3
      This is quite subtle and it's not surprising that you were bit by it.

      If you present "-6" "-4" "-2" "0" "2" "4" "6" as strings, then Stata treats them as strings. That's what you asked for.

      So sorts are resolved by encode dictionary fashion, sorting on each character in turn.

      It happens that minus sign sorts before any numbers. The ascii() function in Mata is convenient here.

      Code:
      . mata
      ------------------------------------------------- mata (type end to exit) -----------------------------------
      : ascii("-")
        45
      
      : ascii("0")
        48
      So, by accident of the ASCII code, all the minus numbers sort before zero and the positive numbers. (Unicode is naturally consistent, as play with uchar(45) and uchar(48) will confirm.)

      Then you break ties (strings with first character minus) by sorting on the next character.

      This is is all on all fours with the fact that "a6" "a4" "a2" would sort as "a2" to "a6". "-" is just another character.

      If the expectation, or hope, is that some intelligence in Stata will look inside the strings and think that "-6" is really a number, so it belongs before "-2", that isn't going to happen.

      The only way round this with encode is to insist on specific value labels.

      The underlying issue is clearly that

      Categorical variables to which factor-variable operators are applied must contain nonnegative integers with values in the range 0 to 32,740, inclusive.
      I am confident Chris and Andrew understand this, but it is the context for anyone wondering what this is all about. The title should be encoding negative numbers stored as strings, however.

      Comment


      • #4
        Thanks Nick and Andrew.
        Subtle indeed.
        I thought it might be something to do with how Stata interprets the minus sign in the string.
        Regards
        Chris

        Comment

        Working...
        X