Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Test

    We need to append annual data to a large dataset and are running into memory constraints (the final dataset is ~22GB and our server has 24GB RAM). Historically, we have encoded strings to int to decrease dataset size but the value labels differ per dataset and this practice yielded data quality issues post-merge. This phenomenon is already documented on Statalist.

    Essentially, I would like to expand Eric's work into a more generic solution that would allow us to convert strings to integers across datasets to decrease our dataset size. Please provide any feedback, advice, alternatives or things to consider. We process a lot of data with Stata and our use of encode has been sloppy. I am a new Stata user but have worked in Python, SAS, SQL and Powershell previously.

    Input
    • Dataset(s) with matching string variables
    • Optional: list of string variables to generate/apply labels for

    Output
    • .Do file containing fully specified value label for each string variable
    • Updated dataset(s) with strings converted to integers with attached value labels

    Requirements
    • Automatically generate value labels for all string variables in a dataset unless variables are specified
    • Automatically apply value labels for "all strings" or "specified strings" in specified datasets
    • Give int-like variables logical value labels during encode
    • Check for an existing do file containing value labels and append new value labels to it so variables are nor recoded year-to-year
    • Throw-error when appending and value labels don't match


    Code:
     
    **solution**
    
    //1.  mk lookup table of values//
    clear
    save "lookuptable.dta", emptyok replace
    foreach file in using master   /* put all your files here */  {
        append using "`file'.dta", keep(citation)
        }
        duplicates drop
        g citation_number = _n
        l
        save "lookuptable.dta", emptyok replace
        cap which labmask
        if _rc ssc install labutil, replace
        labmask citation_number, value(citation) lblname(cit)
        la save cit  using "citationlabels.do" , replace
        
        
        
    //2. mk final tbl w. citation_number, not citation//
    clear
    save "final.dta", emptyok replace
    
    foreach file in using master  /* put all your files here */  {
        u "`file'.dta", clear
        merge 1:1 citation using "lookuptable.dta"
        drop if _m!=3
        drop _m
        drop citation
        sa "`file'_encoded.dta", replace
        append using "final.dta"
        sa "final.dta", replace
        }
    
    //3.  apply labels to citation_number//
    l
    fre citation
    do "citationlabels.do"
    lab val citation_number cit
    fre citation
Working...
X