Bug: encoding ID string with tens of millions of unique IDs

Aidan Horn

Join Date: Sep 2017

Posts: 30
#1

Bug: encoding ID string with tens of millions of unique IDs

01 Apr 2023, 04:44

I have found a bug in Stata 17 MP, where my dataset has hundreds of millions of observations and tens of millions of unique ID strings (six letters, hashed). I want to change the ID string variable into a numeric variable, because my aim is to do a panel regression, although xtset requires the panel variable to be numeric. If one tries

Code:

encode id, g(id_num)

Stata tells you that there are too many values. I tried

Code:

egen id_num = group(id)

instead, but after running

Code:

format id_num %12.0f by id_num tax_year, sort: gen n = _n tab n, miss by id_num: egen n_max = max(n) br id tax_year id_num n n_max if n_max>1

I see different id's for the same id_num!

A colleague recommended slowing down the group() function:

Code:

cap drop id_num n n_max by id, sort: gen id_num=1 if _n==1 replace id_num = sum(id_num) // cumulative sum replace id_num =. if missing(id) format id_num %12.0f by id_num tax_year, sort: gen n = _n tab n, miss by id_num: egen n_max = max(n) br id tax_year id_num n n_max if n_max>1

However, I get the same problem. Can you confirm this, with a test dataset? I cannot share some of the data that I am doing this on, as it is sensitive.

If this is a problem with the Stata program, is probably due to the vast number of unique IDs. Is this the correct platform to provide feedback to StataCorp about a shortcoming with the program?

In Python, the following works (after which I need to re-save it in Stata, so that the xtset works properly):

Code:

import pandas as pd from sklearn.preprocessing import LabelEncoder as le ip = pd.read_stata('Income_panel.dta') print(ip.columns) ip['id_num'] = le().fit_transform(ip['id']) ip.to_stata('Income_panel_Python.dta')
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3441
#2

01 Apr 2023, 05:33

That is intended behavior, see help precision for more on what is going on. For a solution just create your variable as a double.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35594
#3

01 Apr 2023, 05:44

It's not obviously a bug. It's more likely to be a well documented and sometimes biting precision issue that should be checked first.

You need to specify a variable or storage type that is fit for your purpose.

Tens of millions of identifiers imply to me that you might want identifiers such as 50000001 or 100000001.

My Stata has a default numeric storage type float, and floats are not fit for that purpose. But longs work fine.

That default is your default too unless you changed it.

Code:

. clear . set obs 2 Number of observations (_N) was 0, now 2. . gen id = cond(_n == 1, 5e7 + 1, 10e7 + 1) . format id %10.0f . l +-----------+ | id | |-----------| 1. | 50000000 | 2. | 100000000 | +-----------+ . gen long betterid = cond(_n == 1, 5e7 + 1, 10e7 + 1) . format betterid %10.0f . l +-----------------------+ | id betterid | |-----------------------| 1. | 50000000 50000001 | 2. | 100000000 100000001 | +-----------------------+

So, on this diagnosis, you should check whether longs will work for the maximum identifier you need and if so specify that to egen, or any work-around you choose.

If you think about it, Stata has to have numeric variable types that could serve as unique observation identifiers for the dataset you have, which implies that it has variable types that could serve as identifiers for subsets.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#4

01 Apr 2023, 07:17

The option -group(), autotype- purports to circumvent this problem.
Comment
Aidan Horn

Join Date: Sep 2017

Posts: 30
#5

01 Apr 2023, 09:00

Thank you very much for the feedback about long numbers (rather than float numbers). I will try this out next week, when the data lab is open.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

02 Apr 2023, 01:58

On a related note, if you are dealing with a relatively large dataset, check out the user contributed -gtools- suite. It typically speeds up the calculations substantially.

For an example that I calibrated to your description of your data in #1, the native Stata -egen, group()- takes about 100 seconds, while -gegen, group()- from gtools takes only about 15 seconds.
Comment

Announcement

Bug: encoding ID string with tens of millions of unique IDs

Comment

Comment

Comment

Comment

Comment