Convert multiple variables from str to int respecting regularities (with more than 18million of unique values needed)

Marco Ven

Join Date: May 2022

Posts: 5
#1

Convert multiple variables from str to int respecting regularities (with more than 18million of unique values needed)

25 May 2022, 03:16

Dear all,
I have a dataset of two string variables: Input_address and Output_address. I need to convert them from string to int to save memory (about 100million rows).
The point is that some Input_address and output_address are the same, and I want them to remain the same also after encode the variables.
I mean, if Input_address (first column) is "hanb23bd4.."*, and in Output_address I have the same "hanb23bd4.."*, I want them to be converted into the same number once run encode.
I don't know how to do that.

*fictitious example

This was my previous post, and I received a good suggestion of using "multencode".
The problem is that I quickly reach the limit of unique values of encode (65 thousands more ore less, and I need approximately 18million of unique values).
How can solve it? I need something that works exactly in the same way of multencode, but with much more unique values.
I also have the need to export the resulting dataset in a .csv file, but doing multencode and exporting as a .csv I still have a file with the string instead of the integer.
Thank you in advance,
Marco

Last edited by Marco Ven; 25 May 2022, 03:28.
Tags: None

Maarten Buis

Join Date: Mar 2014
Posts: 3494

25 May 2022, 03:50

18 million distinct values cannot be stored in an int, for that you need a long.

Here is an attempt. With two reshapes it is probably slow in large datasets, which seem to be what you have.

Code:

// create some example data
clear
input str2 (addr_orig addr_dest)
"a" "g"
"b" "h"
"c" "i"
"d" "a"
"b" "i"
end

// create the new variables
gen id = _n
reshape long addr_, i(id) j(what) string
sort addr_
by addr_ : gen long code = _n == 1
gen byte touse = code
replace code = sum(code)

mata:
st_sview(lab="",.,"addr_", "touse")
st_view(val=.,.,"code","touse")
st_vlmodify("code_lb",val,lab)
end

label values code code_lb
drop touse
reshape wide addr_ code, i(id) j(what) string

list

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Marco Ven

Join Date: May 2022

Posts: 5
#3

25 May 2022, 09:13

Originally posted by Maarten Buis View Post

18 million distinct values cannot be stored in an int, for that you need a long.

Here is an attempt. With two reshapes it is probably slow in large datasets, which seem to be what you have.

Code:

// create some example data clear input str2 (addr_orig addr_dest) "a" "g" "b" "h" "c" "i" "d" "a" "b" "i" end // create the new variables gen id = _n reshape long addr_, i(id) j(what) string sort addr_ by addr_ : gen long code = _n == 1 gen byte touse = code replace code = sum(code) mata: st_sview(lab="",.,"addr_", "touse") st_view(val=.,.,"code","touse") st_vlmodify("code_lb",val,lab) end label values code code_lb drop touse reshape wide addr_ code, i(id) j(what) string list

Can you please tell me why I performed your example and worked well, while with my dataset I obtain this error?
Of course the variable id does not uniquely idenfity the observations, but how can I manage it?

reshape long addr_, i(id) j(what) string
(j = dest orig)
variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You specified i(id) and j(what). In the current wide form,
variable id should uniquely identify the observations. Remember this picture:

long wide
+---------------+ +------------------+
| i j a b | | i a1 a2 b1 b2 |
|---------------| <--- reshape ---> |------------------|
| 1 1 1 2 | | 1 1 3 2 4 |
| 1 2 3 4 | | 2 5 7 6 8 |
| 2 1 5 6 | +------------------+
| 2 2 7 8 |
+---------------+
Type reshape error for a list of the problem observations.

Last edited by Marco Ven; 25 May 2022, 09:36.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3494
#4

30 May 2022, 01:44

Did you create the variable id using the code I gave you?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Marco Ven

Join Date: May 2022

Posts: 5
#5

08 Jun 2022, 08:25

Yes, I used the code you gave me.
Sorry for the late answer.
I solved the problem using Python instead of Stata.
Thank you once again,
Marco
Comment

Announcement

Convert multiple variables from str to int respecting regularities (with more than 18million of unique values needed)

Comment

Comment

Comment

Comment