I have found a bug in Stata 17 MP, where my dataset has hundreds of millions of observations and tens of millions of unique ID strings (six letters, hashed). I want to change the ID string variable into a numeric variable, because my aim is to do a panel regression, although xtset requires the panel variable to be numeric. If one tries
Stata tells you that there are too many values. I tried
instead, but after running
I see different id's for the same id_num!
A colleague recommended slowing down the group() function:
However, I get the same problem. Can you confirm this, with a test dataset? I cannot share some of the data that I am doing this on, as it is sensitive.
If this is a problem with the Stata program, is probably due to the vast number of unique IDs. Is this the correct platform to provide feedback to StataCorp about a shortcoming with the program?
In Python, the following works (after which I need to re-save it in Stata, so that the xtset works properly):
Code:
encode id, g(id_num)
Code:
egen id_num = group(id)
Code:
format id_num %12.0f by id_num tax_year, sort: gen n = _n tab n, miss by id_num: egen n_max = max(n) br id tax_year id_num n n_max if n_max>1
A colleague recommended slowing down the group() function:
Code:
cap drop id_num n n_max by id, sort: gen id_num=1 if _n==1 replace id_num = sum(id_num) // cumulative sum replace id_num =. if missing(id) format id_num %12.0f by id_num tax_year, sort: gen n = _n tab n, miss by id_num: egen n_max = max(n) br id tax_year id_num n n_max if n_max>1
If this is a problem with the Stata program, is probably due to the vast number of unique IDs. Is this the correct platform to provide feedback to StataCorp about a shortcoming with the program?
In Python, the following works (after which I need to re-save it in Stata, so that the xtset works properly):
Code:
import pandas as pd from sklearn.preprocessing import LabelEncoder as le ip = pd.read_stata('Income_panel.dta') print(ip.columns) ip['id_num'] = le().fit_transform(ip['id']) ip.to_stata('Income_panel_Python.dta')
Comment