Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Appending with value labels

    Hello,

    I am working with a large dataset (15GB, 90mio observations) that is really slow to handle. I have thus split it up into four files and performed all my commands separately. However, I'm having issues with the value labels when I append the four files in the end:

    My files contain string data (for example on company names), that I encode in each separate file, using
    Code:
    encode oldvar, generate(newvar)
    .
    Encode numbers the variables in each file from 1:n, so when I append the files, the 1s, 2s,...ns are included in the same value label (which is logical). However, the string values in my files don't overlap, so in file1 newvar with value 1 would be company A while in file2, newvar with value 1 would be company B.

    Is there any way for append to recognize that the newvar==1 in file1 is not equal to the newvar==1 in file 2, even though they are stored in the same variable?

    What I am doing now is to recode my values in the individual files by adding the number of observations in the previous file, and to then create a new value label in the merged file. This is a little cumbersome however...Leaving the variables as string and encoding in the final dataset is not an option, unfortunately, since it takes >3 hours per variable.

    Thank you for your help!

  • #2
    Did not read until the end, but are you aware of encode's label() option? You probably want something like

    Code:
    label define my_country_code ///
        1 "Afghanistan" ///
        2 "Albania"     ///
        ...
        k "Zimbabwe"
        
    encode oldvar , generate(newvar) label(my_country_code)
    Best
    Daniel

    Comment


    • #3

      A single number cannot have more than 1 label, so that will definitely bite you.

      This is what I would do:

      Encode and get newvar
      Code:
      encode oldvar, gen(newvar)
      keep oldvar too

      When you append, use the generate option (gen(ID) --now you have an ID number for each file)
      Code:
      append...,gen(ID)
      If you are appending in a loop, you may need to do this on your own:
      Code:
      *open first file
      gen ID=1
      local id=1
      whatever loop to append {
         append....
         replace ID=`id'+1 if ID==.
         }
      Identify the largest number of digits that newvar could possibly be in a single file (let's say 5 digits)

      Once you append all your files, create a new variable that combines ID & newvar:
      Code:
      gen newID=real(string(ID)+string(newvar, "%05.0f") )                 // %05.0f will ensure leading zeros so that ID=1 & newvar=1 --> 100001
      Now use labmask (from SSC as part of the -labutil- package ssc install labutil) to create labels from oldvar
      Code:
      labmask newID, values(oldvar)








      Stata/MP 14.1 (64-bit x86-64)
      Revision 19 May 2016
      Win 8.1

      Comment


      • #4
        Actually, an even quicker way to accomplish this:

        1) encode in each dataset, keeping newvar and oldvar

        2) append as usual

        3) use egen group
        Code:
        egen newID=group(newvar oldvar)
        4) then use labmask here
        Code:
        labmask newID, value(oldvar)
        You could use the label option for egen group instead of labmask, but I think the labels will be nicer with labmask.
        Last edited by Carole J. Wilson; 07 May 2016, 10:33.
        Stata/MP 14.1 (64-bit x86-64)
        Revision 19 May 2016
        Win 8.1

        Comment


        • #5
          I don't see any reason to use encode for company names. Note that there's a limit of 65,536 codings within a value label. The more codings you generate, the longer your code will run. Here's a quick demonstration:

          Code:
          clear all
          set obs 65000
          gen s = "string " + string(_n)
          expand 10
          
          * order the data by the variable(s) to group
          sort s
          
          * using first principles, create groups of distinct values
          timer on 1
          by s: gen one = _n == 1
          gen long sg = sum(one)
          timer off 1
          
          * do the same using -egen-, it's a bit slower
          timer on 2
          egen long sg2 = group(s)
          timer off 2
          
          * now try to encode instead
          timer on 3
          encode s, gen(ns) label(ns)
          timer off 3
          
          timer list
          and here are the results:
          Code:
          . timer list
             1:      0.02 /        1 =       0.0230
             2:      0.13 /        1 =       0.1310
             3:     10.89 /        1 =      10.8890
          No matter if you try to create a group identifier or an encoding, if the numeric values are to make sense across separate datasets, you need to combine the datasets in order to perform the grouping or encoding. This does not mean that you must append all your data, just the variable(s) that you are trying to group or encode. For example:

          Code:
          * say the standard auto dataset is a big dataset
          sysuse auto, clear
          gen id = _n
          save "big_data.dta", replace
          
          * split it in 2 parts
          keep in 1/30
          save "auto1.dta", replace
          use "big_data.dta", clear
          drop in 1/30
          save "auto2.dta", replace
          
          * extract the target string variable from the parts
          use make using "auto1.dta", clear
          save "make1.dta", replace
          use make using "auto2.dta", clear
          save "make2.dta", replace
          
          * combine the value of the target variable across all datasets;
          * (I repeat the process 4 times to simulate duplicates)
          clear
          append using "make1.dta"
          append using "make2.dta"
          append using "make1.dta"
          append using "make2.dta"
          
          * create a group identifier using first principles
          bysort make: gen one = _n == 1
          gen long makegroup = sum(one)
          
          * or, if there's a reason to encode
          encode make, gen(make_enc) label(make_enc)
          
          * reduce to one obs per group
          by make: keep if _n == 1
          drop one
          save "make_group.dta", replace
          
          * You can now merge back the group identifiers or the encoding to each part
          use "auto1.dta", clear
          merge m:1 make using "make_group.dta", assert(match using) keep(master match) nogen

          Comment


          • #6
            Thank you all so much for your help! The idea to split up the dataset into just the variables I need to encode and then to merge back was really good, since any kind of command I perform on my large dataset takes really long. I realize this would have also worked using the label option in encode, but typing up all the labels manually would have taken very long as well.

            Comment


            • #7
              For future reference, note that multencode (SSC) is dedicated to encoding several string variables consistently.

              Comment

              Working...
              X