Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bug: encoding ID string with tens of millions of unique IDs

    I have found a bug in Stata 17 MP, where my dataset has hundreds of millions of observations and tens of millions of unique ID strings (six letters, hashed). I want to change the ID string variable into a numeric variable, because my aim is to do a panel regression, although xtset requires the panel variable to be numeric. If one tries
    Code:
    encode id, g(id_num)
    Stata tells you that there are too many values. I tried
    Code:
    egen id_num = group(id)
    instead, but after running
    Code:
    format id_num %12.0f
    by id_num tax_year, sort: gen n = _n
    tab n, miss
    
    by id_num: egen n_max = max(n)
    br id tax_year id_num n n_max if n_max>1
    I see different id's for the same id_num!

    A colleague recommended slowing down the group() function:
    Code:
    cap drop id_num n n_max
    by id, sort: gen id_num=1  if _n==1
    replace id_num = sum(id_num)   // cumulative sum
    replace id_num =.  if missing(id)
    format id_num %12.0f
    
    by id_num tax_year, sort: gen n = _n
    tab n, miss
    
    by id_num: egen n_max = max(n)
    br id tax_year id_num n n_max if n_max>1
    However, I get the same problem. Can you confirm this, with a test dataset? I cannot share some of the data that I am doing this on, as it is sensitive.

    If this is a problem with the Stata program, is probably due to the vast number of unique IDs. Is this the correct platform to provide feedback to StataCorp about a shortcoming with the program?

    In Python, the following works (after which I need to re-save it in Stata, so that the xtset works properly):
    Code:
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder as le
    ip = pd.read_stata('Income_panel.dta')
    print(ip.columns)
    ip['id_num'] = le().fit_transform(ip['id'])
    ip.to_stata('Income_panel_Python.dta')

  • #2
    That is intended behavior, see help precision for more on what is going on. For a solution just create your variable as a double.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      It's not obviously a bug. It's more likely to be a well documented and sometimes biting precision issue that should be checked first.

      You need to specify a variable or storage type that is fit for your purpose.

      Tens of millions of identifiers imply to me that you might want identifiers such as 50000001 or 100000001.

      My Stata has a default numeric storage type float, and floats are not fit for that purpose. But longs work fine.

      That default is your default too unless you changed it.

      Code:
      . clear
      
      . set obs 2
      Number of observations (_N) was 0, now 2.
      
      . gen id = cond(_n == 1, 5e7 + 1, 10e7 + 1)
      
      . format id %10.0f
      
      . l
      
           +-----------+
           |        id |
           |-----------|
        1. |  50000000 |
        2. | 100000000 |
           +-----------+
      
      . gen long betterid = cond(_n == 1, 5e7 + 1, 10e7 + 1)
      
      . format betterid %10.0f
      
      . l
      
           +-----------------------+
           |        id    betterid |
           |-----------------------|
        1. |  50000000    50000001 |
        2. | 100000000   100000001 |
           +-----------------------+
      So, on this diagnosis, you should check whether longs will work for the maximum identifier you need and if so specify that to egen, or any work-around you choose.

      If you think about it, Stata has to have numeric variable types that could serve as unique observation identifiers for the dataset you have, which implies that it has variable types that could serve as identifiers for subsets.

      Comment


      • #4
        The option -group(), autotype- purports to circumvent this problem.

        Comment


        • #5
          Thank you very much for the feedback about long numbers (rather than float numbers). I will try this out next week, when the data lab is open.

          Comment


          • #6
            On a related note, if you are dealing with a relatively large dataset, check out the user contributed -gtools- suite. It typically speeds up the calculations substantially.

            For an example that I calibrated to your description of your data in #1, the native Stata -egen, group()- takes about 100 seconds, while -gegen, group()- from gtools takes only about 15 seconds.

            Comment

            Working...
            X