Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A cross section Identifier should be string variable or numeric?

    Hello!

    At present, I am constructing a panel data set by appending a number of cross-sectional data sets. The cross-section identifiers here are Family identifiers. In some of the data sets, the family identifier (all numbers, no letters) is saved as a numeric variable, while in others, it is saved as a string variable. While I know I can destring or tostring these variables so there is no mismatch, I am not sure whether they should all be converted into strings or into numeric.

    What format should the identifiers be saved in for a panel data analysis?


    Thanks!

  • #2
    If you need to xtset the data, then the identifiers need to be numeric. See

    Code:
    help xtset

    Comment


    • #3
      Besides -xtset- there are other considerations that might push you one way or another on this question. If some of the digits of the identifier have a meaning of their own--say the last two digits encode a geographic region or something like that--then extracting it will be more straightforward if you go with a string version.

      There is also a possible safety issue. If you keep the ID variable as a string, were you to mistakenly code some meaningless operation like adding them, Stata will recognize the error and give you a -type mismatch- or -syntax error- message. That is, Stata will save you from your mistake. If you keep the ID variable as numeric, then Stata will let you blunder ahead if you do something meaningless like that--you are not safe. In most situations I strongly prefer safe code, but I have to say that I have rarely seen people make the mistake of trying to do calculations on ID variables, so I won't press too hard on the safety issue in this particular case.

      Then there is the precision issue. If the numbers are 16 digits long or shorter, then they will fit in a double, and if 9 digits or shorter, they will fit in a long, and you have no problem. But if they have more than 16 digits, attempting to turn them to numeric will result in loss of low-order digits, which, in turn, may result in different households having the same value of the ID variable. So with long IDs, you must store them as strings.

      If these considerations lead you to prefer a string variable, but you also need to -xtset- your data, you can keep the variable as a string, and create a numeric variable that is in 1-1 correspondence with the ID variable with -egen double n_ID = group(ID)-. Then you can -xtset- with n_ID, but you still have the string variable to work with in situations where that version is more convenient.

      Comment

      Working...
      X