Appending with value labels

laura wolff

Join Date: Apr 2016

Posts: 16
#1

Appending with value labels

07 May 2016, 06:05

Hello,

I am working with a large dataset (15GB, 90mio observations) that is really slow to handle. I have thus split it up into four files and performed all my commands separately. However, I'm having issues with the value labels when I append the four files in the end:

My files contain string data (for example on company names), that I encode in each separate file, using

Code:

encode oldvar, generate(newvar)

.
Encode numbers the variables in each file from 1:n, so when I append the files, the 1s, 2s,...ns are included in the same value label (which is logical). However, the string values in my files don't overlap, so in file1 newvar with value 1 would be company A while in file2, newvar with value 1 would be company B.

Is there any way for append to recognize that the newvar==1 in file1 is not equal to the newvar==1 in file 2, even though they are stored in the same variable?

What I am doing now is to recode my values in the individual files by adding the number of observations in the previous file, and to then create a new value label in the merged file. This is a little cumbersome however...Leaving the variables as string and encoding in the final dataset is not an option, unfortunately, since it takes >3 hours per variable.

Thank you for your help!
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3842
#2

07 May 2016, 06:26

Did not read until the end, but are you aware of encode's label() option? You probably want something like

Code:

label define my_country_code /// 1 "Afghanistan" /// 2 "Albania" /// ... k "Zimbabwe" encode oldvar , generate(newvar) label(my_country_code)

Best
Daniel
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#3

07 May 2016, 08:05

A single number cannot have more than 1 label, so that will definitely bite you.

This is what I would do:

Encode and get newvar

Code:

encode oldvar, gen(newvar)

keep oldvar too

When you append, use the generate option (gen(ID) --now you have an ID number for each file)

Code:

append...,gen(ID)

If you are appending in a loop, you may need to do this on your own:

Code:

*open first file gen ID=1 local id=1 whatever loop to append { append.... replace ID=`id'+1 if ID==. }

Identify the largest number of digits that newvar could possibly be in a single file (let's say 5 digits)

Once you append all your files, create a new variable that combines ID & newvar:

Code:

gen newID=real(string(ID)+string(newvar, "%05.0f") ) // %05.0f will ensure leading zeros so that ID=1 & newvar=1 --> 100001

Now use labmask (from SSC as part of the -labutil- package ssc install labutil) to create labels from oldvar

Code:

labmask newID, values(oldvar)

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
1 like
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#4

07 May 2016, 10:31

Actually, an even quicker way to accomplish this:

1) encode in each dataset, keeping newvar and oldvar

2) append as usual

3) use egen group

Code:

egen newID=group(newvar oldvar)

4) then use labmask here

Code:

labmask newID, value(oldvar)

You could use the label option for egen group instead of labmask, but I think the labels will be nicer with labmask.

Last edited by Carole J. Wilson; 07 May 2016, 10:33.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

07 May 2016, 11:42

I don't see any reason to use encode for company names. Note that there's a limit of 65,536 codings within a value label. The more codings you generate, the longer your code will run. Here's a quick demonstration:

Code:

clear all
set obs 65000
gen s = "string " + string(_n)
expand 10

* order the data by the variable(s) to group
sort s

* using first principles, create groups of distinct values
timer on 1
by s: gen one = _n == 1
gen long sg = sum(one)
timer off 1

* do the same using -egen-, it's a bit slower
timer on 2
egen long sg2 = group(s)
timer off 2

* now try to encode instead
timer on 3
encode s, gen(ns) label(ns)
timer off 3

timer list

and here are the results:

Code:

. timer list
   1:      0.02 /        1 =       0.0230
   2:      0.13 /        1 =       0.1310
   3:     10.89 /        1 =      10.8890

No matter if you try to create a group identifier or an encoding, if the numeric values are to make sense across separate datasets, you need to combine the datasets in order to perform the grouping or encoding. This does not mean that you must append all your data, just the variable(s) that you are trying to group or encode. For example:

Code:

* say the standard auto dataset is a big dataset
sysuse auto, clear
gen id = _n
save "big_data.dta", replace

* split it in 2 parts
keep in 1/30
save "auto1.dta", replace
use "big_data.dta", clear
drop in 1/30
save "auto2.dta", replace

* extract the target string variable from the parts
use make using "auto1.dta", clear
save "make1.dta", replace
use make using "auto2.dta", clear
save "make2.dta", replace

* combine the value of the target variable across all datasets;
* (I repeat the process 4 times to simulate duplicates)
clear
append using "make1.dta"
append using "make2.dta"
append using "make1.dta"
append using "make2.dta"

* create a group identifier using first principles
bysort make: gen one = _n == 1
gen long makegroup = sum(one)

* or, if there's a reason to encode
encode make, gen(make_enc) label(make_enc)

* reduce to one obs per group
by make: keep if _n == 1
drop one
save "make_group.dta", replace

* You can now merge back the group identifiers or the encoding to each part
use "auto1.dta", clear
merge m:1 make using "make_group.dta", assert(match using) keep(master match) nogen

Comment

laura wolff

Join Date: Apr 2016

Posts: 16
#6

09 May 2016, 09:55

Thank you all so much for your help! The idea to split up the dataset into just the variables I need to encode and then to merge back was really good, since any kind of command I perform on my large dataset takes really long. I realize this would have also worked using the label option in encode, but typing up all the labels manually would have taken very long as well.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35637
#7

09 May 2016, 10:02

For future reference, note that multencode (SSC) is dedicated to encoding several string variables consistently.
Comment

Announcement