Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • "Encode" with too many values

    Hi,

    I am a beginner in Stata. I have a problem with my 834,785 observations. They contain letter and number (-destring is not applicable, for sure). I want to encode them so i can get unique id. However, the limit for encode is about 65000. How can I solve that problem. I have been bothered by this problem for a long time and try to write a loop. Nothing works. May I have your help! Thanks a lot!!


    data:
    clear
    input str16 dm
    "zg4849878"
    "zg4351159"
    "zg4351124"
    "zg4350182"
    "zg3876991"
    "zg3837170"
    "zg3803974"
    "zg380386x"
    "ys0008263"
    "yn1169386"
    "yn1169298"
    "yg4342249"
    "yg4341254"
    "yg4339816"
    "yg4339808"
    "yg4339744"
    "yg4339728"
    "yg4339664"
    "xj8307030"
    "xj7882084"
    "xj6368403"
    "xj550924X"
    "xj4400764"
    "xj2996998"
    "xj2710478"
    "xj270000X"
    "xj2601458"
    "xj1809792"
    "xj1807624"
    "xj1380010"
    "xh142013x"
    "xh1420121"
    "xh1420113"
    "xh1420092"
    "xh1420068"
    "xh1420041"
    "xh1411102"
    "xa0857987"
    "x70087551"
    "x31855183"
    "x31750787"
    "x31746681"
    "x31727763"
    "x31724132"
    "x31677858"
    "x31634241"
    "x29948997"
    "x29709793"
    "x29024401"
    "x27440540"
    "x27329535"
    "x27289000"
    "x27288067"
    "x27248719"
    "x25551837"
    "x25105458"
    "x25105095"
    "x24192094"
    "x24191809"
    "x24144113"
    "x24131187"
    "x24124518"
    "x24116606"
    "x24104154"
    "x24102511"
    "x23991951"
    "x23962157"
    "x23945349"
    "x23939037"
    "x23920416"
    "x23907174"
    "x23711382"
    "x2371089x"
    "x23710144"
    "x23606080"
    "x23606048"
    "x2358669x"
    "x23525045"
    "x23310001"
    "x23285037"
    "x23272228"
    "x23135514"
    "x22295305"
    "x22063942"
    "x21909395"
    "x21403247"
    "x21399217"
    "x21286626"
    "x21166165"
    "x21042381"
    "x20939007"
    "x20934257"
    "x20851283"
    "x20723214"
    "x20703627"
    "x20702990"
    "x20701795"
    "x20490633"
    "x20002097"
    "x20000649"
    end

  • #2
    Unfortunately, this problem is too big for a labelled value, and certainly, it would be too unwieldy to have such a large list in a tabulation, for example. You have two simple options:
    1) don't encode the data at all, and just use the string values as they exist. There's nothing wrong with this, and it already uniquely identifies each individual (or whatever). Merges and joins can easily accommodate string-valued keys such as this as well.
    2) if you need to have numeric ids for some reason, then you can achieve this with any unique numbering system. One way to do this is with

    Code:
    egen newid = tag(dm)
    What this does is assign the numbers from 1 to N into -newid- for the unique values of -dm- sorted in alphanumeric order. Beware that this is not a label, and in no way maintains a link between -newid- and -dm-, it is simply a new variable that is based on the data that you have. If you later decide to merge or use data with a different set of key values, or even add observations with the same set of key values, -newid- will not be updated and it's value as an identifier will become lost, and further workarounds will be needed.

    Comment


    • #3
      Many Thanks, Guizzetti! Your reply just gave me new inspirations!

      Comment

      Working...
      X