We need to append annual data to a large dataset and are running into memory constraints (the final dataset is ~22GB and our server has 24GB RAM). Historically, we have encoded strings to int to decrease dataset size but the value labels differ per dataset and this practice yielded data quality issues post-merge. This phenomenon is already documented on Statalist.
Essentially, I would like to expand Eric's work into a more generic solution that would allow us to convert strings to integers across datasets to decrease our dataset size. Please provide any feedback, advice, alternatives or things to consider. We process a lot of data with Stata and our use of encode has been sloppy. I am a new Stata user but have worked in Python, SAS, SQL and Powershell previously.
Input
Output
Requirements
Essentially, I would like to expand Eric's work into a more generic solution that would allow us to convert strings to integers across datasets to decrease our dataset size. Please provide any feedback, advice, alternatives or things to consider. We process a lot of data with Stata and our use of encode has been sloppy. I am a new Stata user but have worked in Python, SAS, SQL and Powershell previously.
Input
- Dataset(s) with matching string variables
- Optional: list of string variables to generate/apply labels for
Output
- .Do file containing fully specified value label for each string variable
- Updated dataset(s) with strings converted to integers with attached value labels
Requirements
- Automatically generate value labels for all string variables in a dataset unless variables are specified
- Automatically apply value labels for "all strings" or "specified strings" in specified datasets
- Give int-like variables logical value labels during encode
- Check for an existing do file containing value labels and append new value labels to it so variables are nor recoded year-to-year
- Throw-error when appending and value labels don't match
Code:
**solution**
//1. mk lookup table of values//
clear
save "lookuptable.dta", emptyok replace
foreach file in using master /* put all your files here */ {
append using "`file'.dta", keep(citation)
}
duplicates drop
g citation_number = _n
l
save "lookuptable.dta", emptyok replace
cap which labmask
if _rc ssc install labutil, replace
labmask citation_number, value(citation) lblname(cit)
la save cit using "citationlabels.do" , replace
//2. mk final tbl w. citation_number, not citation//
clear
save "final.dta", emptyok replace
foreach file in using master /* put all your files here */ {
u "`file'.dta", clear
merge 1:1 citation using "lookuptable.dta"
drop if _m!=3
drop _m
drop citation
sa "`file'_encoded.dta", replace
append using "final.dta"
sa "final.dta", replace
}
//3. apply labels to citation_number//
l
fre citation
do "citationlabels.do"
lab val citation_number cit
fre citation
