Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • To append or to merge

    I would like to create a long panel from the 17 waves of panel data from the HILDA survey but am struggling with two key issues:
    (1) code to append (or merge) the 17 waves, and
    (2) dealing with the variable names, which include a prefix letter representing the number wave, e.g. ahgage - age in wave 1, bhgage - age in wave 2, chgage - age in wave 3, hence prefix 'a' = 1, 'b' = 2, etc. In this case, I believe I can use rename group (?) to remove the wave reference, but not sure of the exact code, and secondly, would I do so while appending the waves?

    Most of the variables are in all waves (such as age, gender, education, wages), however there are three (categorical) variables only in waves 4, 7, 10 , 14 (hence the separate append).

    My code for appending the waves follows:

    clear
    set memory 1g
    set more off
    use "C:\data\Combined_a170c.dta" // original data file
    keep xwaveid ahgsex ahgage aeduc awage

    tempfile master
    save "`master'", replace // save to temp data file

    // add in data from other waves (in and not in wave 1)

    * Cleaning data for waves 2 to 17 // (excl waves 4, 7, 10, 14 as these include specific data not in other waves)

    local wave b c e f h i k l m o p q
    foreach x of local wave {
    use "C:\data\Combined_`x'170c.dta", clear
    keep xwaveid `x'hgsex `x'hgage `x'educ `x'wage
    save "`master'", replace

    use "C:\data\Hilda\Combined_`x'170c.dta", clear
    append using "`master'"
    save "C:\data\basedata.dta", replace // new data file
    }

    * Cleaning data for waves 4, 7, 10, 14

    local wave2 d g j n // (waves 4, 7, 10, 14)
    foreach y of local wave2 {
    use "C:\data\Combined_`y'170c.dta", clear
    keep xwaveid `y'hgsex `y'hgage `y'educ `y'wage `y'reltype `y'relimp `y'relat //
    save "`master'", replace

    use "C:\data\Combined_`y'170c.dta", clear
    append using "`master'"
    save "C:\data\basedata.dta", replace
    }

    //
    Stata responded with "invalid file specification"
    r(198);


    Your kind assistance is appreciated as always.
    Last edited by Chris Boulis; 23 Sep 2019, 03:13.

  • #2
    You can find out which -save- or -use- statement has the invalid file specification with
    Code:
    set trace on
    before the -for- loop.

    Note also that you can -append- to an empty workspace, no need to treat "a" differently from "b...q".

    Comment


    • #3
      Using set trace will ineed help pinpoint the problem. I do not immediately see anything out of the ordinary with the way you call files.

      For renaming, prior to appending files, you can do, e.g.,
      Code:
      local wave b c e f h i k l m o p q
      foreach x of local wave {
      use "C:\data\Combined_`x'170c.dta", clear
      keep xwaveid `x'hgsex `x'hgage `x'educ `x'wage
      ren `x'hgsex hgsex
      ren `x'hgage some_common_varname
      }
      Also, do note that your current append and save ordering, i.e.:
      Code:
      local wave b c e f h i k l m o p q
      foreach x of local wave {
      use "C:\data\Combined_`x'170c.dta", clear
      keep xwaveid `x'hgsex `x'hgage `x'educ `x'wage
      save "`master'", replace
      
      use "C:\data\Hilda\Combined_`x'170c.dta", clear
      append using "`master'"
      save "C:\data\basedata.dta", replace // new data file
      }
      you are just overwriting: you append "C:\data\Combined_b170c.dta" with "C:\data\Combined_b170c.dta", save, and then overwrite the basedata.dta file
      In the next loop, you append "C:\data\Combined_c170c.dta" with "C:\data\Combined_c170c.dta", save, and then overwrite (not append) the basedata.dta file
      So you keep ovewriting that file in the loop and will end up with the last iteration of the loop only.

      You probably want, roughly:
      Code:
      use "C:\data\Combined_a170c.dta" // original data file
      keep xwaveid ahgsex ahgage aeduc awage
      
      tempfile master
      save "`master'", replace // save to temp data file
      
      local wave b c e f h i k l m o p q
      foreach x of local wave {
      use "C:\data\Combined_`x'170c.dta", clear
      keep xwaveid `x'hgsex `x'hgage `x'educ `x'wage
      append using "`master'"
      save "`master'", replace // save to temp data file
      }
      save "C:\data\basedata.dta", replace // new data file
      And note the advice in #2 on appending to an empty file, meaning you can simplify to:
      Code:
      tempfile master
      save "`master'", replace emptyok // save to temp data file
      
      local wave a b c e f h i k l m o p q
      foreach x of local wave {
      use "C:\data\Combined_`x'170c.dta", clear
      keep xwaveid `x'hgsex `x'hgage `x'educ `x'wage
      append using "`master'"
      save "`master'", replace // save to temp data file
      }
      save "C:\data\basedata.dta", replace // new data file

      Edit:
      and on the append vs merge idea: both would work, but append is advisable here. You could merge and then reshape, but it's less intuitive and likely quite a bit slower. Append requires the variable names to be the same, but that is dealt with easily enough, as in above example.
      Last edited by Jorrit Gosens; 23 Sep 2019, 06:25.

      Comment


      • #4
        Thank you Jorrit Gosens. As suggested, I renamed the vars in wave 1, then waves 2-17, then for the remaining 4 waves. I then appended these files as per your last code without success. Here's my updated code below:

        // renaming vars

        use "C:\data\Hilda\Combined_a170c.dta"
        keep xwaveid ahgsex ahgage aedhigh1 awsfei amrcurr aesbrd anlreapm
        rename ahgsex hgsex
        rename ahgage hgage
        rename aedhigh1 educ
        rename awsfei wage
        rename amrcurr mrcurr
        rename aesbrd empstat
        rename anlreapm preg

        tempfile master
        save "`master'", replace emptyok

        * rename vars to remove wave-prefix letter and append waves b-q (2-17), excl 4, 7, 10, 14,

        local wave b c e f h i k l m o p q
        foreach x of local wave {
        use "C:\data\Hilda\Combined_`x'170c.dta", clear
        keep xwaveid `x'hgsex `x'hgage `x'educ `x'wage
        rename `x'hgsex hgsex
        rename `x'hgage hgage
        rename `x'edhigh1 educ
        rename `x'wsfei wage
        rename `x'mrcurr mrcurr
        rename `x'esbrd empstat
        rename `x'nlreapm preg
        save "`master'", replace
        }
        // This was met with error r(111) "variable beduc not found".

        * rename vars to remove wave-prefix letter and append waves 4, 7, 10, 14,

        local wave2 d g j n
        foreach y of local wave2 {
        use "C:\data\Hilda\Combined_`y'170c.dta", clear
        keep xwaveid `y'hgsex `y'hgage `y'edhigh1 `y'wsfei `y'mrcurr `y'religb `y'relimp `y'relat `y'esbrd `y'nlreapm
        rename `y'hgsex hgsex
        rename `y'hgage hgage
        rename `y'edhigh1 educ
        rename `y'wsfei wage
        rename `y'mrcurr mrcurr
        rename `y'esbrd empstat
        rename `y'nlreapm preg
        rename `y'religb religb
        rename `y'relimp relimp
        rename `y'relat relat
        }

        // appending waves

        local wave a b c e f h i k l m o p q
        foreach x of local wave {
        use "C:\data\Combined_`x'170c.dta", clear
        keep xwaveid hgsex hgage educ wage mrcurr empstat preg
        append using "`master'"
        save "`master'", replace emptyok // temp data file
        }
        save "C:\data\basedata.dta", replace // new data file

        local wave2 d g j n // (waves 4, 7, 10, 14)
        foreach y of local wave2 {
        use "C:\data\Combined_`y'170c.dta", clear
        keep xwaveid hgsex hgage educ wage mrcurr empstat preg reltyp relimp relat
        append using "`master'"
        save "`master'", replace // temp data file
        }
        save "C:\data\basedata.dta", replace // new data file

        I can see I am saving over each wave of renaming, but not sure how to deal with renaming over the three iterations (wave 1, waves 2+ and waves 4, 7, 10, 14).

        Again I appreciate any guidance.

        Comment


        • #5
          I should have actually suggested a more simple way of renaming these many variables.
          This can be done in one go with:
          Code:
          rename `x'* *
          The error // This was met with error r(111) "variable beduc not found". has to do with the data, not the code: your dataset b does not have a variable beduc. Adn Stata complains it cannot keep a non-existent variable.


          And for the rest of your problem, I would do:
          Code:
          tempfile master
          save "`master'", replace emptyok
          
          local wave a b c e f h i k l m o p q d g j n
          foreach x of local wave {
          use "C:\data\Hilda\Combined_`x'170c.dta", clear
          rename `x'* *
          append using "`master'" // meaning include all data from previous iterations of the loop
          save "`master'", replace // save to temp data file
          }
          keep xwaveid hgsex hgage educ wage mrcurr empstat preg reltyp relimp relat
          So note: the idea of saving an empty tempfile first is to prevent the need for creating a tempfile like you do with your code for file a. So once we use that, that first block of code can go.
          Second, by loading all variables from every set, and only keeping selected variables after the append loop, you can save anther block of code. Append is happy to append even if the new set has fewer or more variables.

          Then the question: why did it not work originally? Because here:
          Code:
          local wave b c e f h i k l m o p q
          foreach x of local wave {
          use "C:\data\Hilda\Combined_`x'170c.dta", clear
          keep xwaveid `x'hgsex `x'hgage `x'educ `x'wage
          rename `x'hgsex hgsex
          rename `x'hgage hgage
          rename `x'edhigh1 educ
          rename `x'wsfei wage
          rename `x'mrcurr mrcurr
          rename `x'esbrd empstat
          rename `x'nlreapm preg
          save "`master'", replace
          }
          You load dataset a, make changes, and overwrite the master file.
          The next iteration, you load dataset b, make changes, and overwrite the master file.
          So you end up with the loop doing about 12 iterations, but in the end you end up with just the data from the last iteration.
          You have to include the step append before you save, as in the first block of code in this reply. That way you include the data from the previous iteration.

          You could also do the renaming in one loop, and the appending in another. If so, you could load dataset `x', make changes, and overwrite dataset `x' (but thats a bad idea), or save each dataset with a suffix (e.g., save "C:\data\Hilda\Combined_`x'170c_common_var_names.d ta"

          Comment


          • #6
            Thank you very much Jorrit Gosens that worked perfectly. Very clear and tidy code too. A nice way to end the day off . Now on to the next bit of code ...

            Comment


            • #7
              Memory space issue:

              Further to #5, I have run into a memory issue in Stata error code "r(900) no room to add more variables". This is stopping me from running my program. The code in the first part of #5 Jorrit Gosens, but because of the very large number of variables in each wave, it is causing a memory space problem and stops.

              I need to merge partner data and responding person date for all waves. Do you think the best way to do this is to merge these two for each wave separately, then append or should I append each of these two datafiles then merge the two? Can I run a loop to do either of these?

              (Note, I have some variables are not in all waves, hence the beauty of Jorrit's code to 'keep' the vars at end of program)?

              Help kindly appreciated.

              Chris

              Here's my code - I know something's not right, but I don't know what. Please help:
              Code:
               
              tempfile append
              save `writedatadir'/append, replace emptyok
              
              // merge partner and respondent data from wave 1
              use `origdatadir'\Combined_a170c.dta, clear 
              rename a* p_*
              rename xwaveid hhpxid
              sort hhpxid
              save `writedatadir'/temp1, replace
              
              use `origdatadir'\Combined_a170c.dta, clear
              rename `x'* *
              drop if hhpxid=="" 
              
              sort hhpxid
              merge 1:1 hhpxid using `writedatadir'/temp1, replace
              save `writedatadir'/append, replace emptyok
              
              * Apppend the rperson-partner data for all remaining waves
              
              local wave b c d e f g h i j k l m n o p q
              foreach x of local wave {
              use `origdatadir'\Combined_`x'170c.dta, clear
              rename `x'* p_* // replaces wave-prefix with Partner prefix
              rename xwaveid hhpxid
              sort hhpxid
              save `writedatadir'/temp2, replace
              
              use `origdatadir'\Combined_`x'170c.dta, clear
              rename `x'* *
              drop if hhpxid=="" 
              
              sort hhpxid
              merge 1:1 hhpxid using `writedatadir'/temp2, replace
              
              append using `writedatadir'/append 
              save `writedatadir'/append, replace 
              }
              
              keep xwaveid hhpxid hgage hgsex esbrd edhigh1 wsfei mrcurr esbrd nlreapm jbhruc religb relimp relat ///
              p_hgage p_hgsex p_esbrd p_edhigh1 p_wsfei p_mrcurr p_esbrd p_nlreapm p_jbhruc p_religb p_relimp p_relat // vars of interest
              
              save `newdatadir'\partner_rperson.dta, replace emptyok
              Last edited by Chris Boulis; 12 Oct 2019, 20:33. Reason: Adding in code

              Comment


              • #8
                I'm surprised there would be an issue with variable creation.
                If you look at
                Code:
                help maxvar
                You will see that you can have 2048 variables maximum as a default. There's options to increase that as well, so you could try doing that.

                But your code also indicates you wont need that many variables. Your line with keep makes clear you need about 2 dozen vars.
                If there are a whole lot more variables in your original datasets, then a good moment to reduce the amount of variables used would be just prior to the line:
                Code:
                save `writedatadir'/temp2, replace
                If there are some difficulties with different sets of variables in each dataset, meaning different lists of variables might need to be retained, then that might need a fix. But you'll have to specify what sort of logic those vars have.


                Also, I dont think you got the point made in post #5 just yet:
                the idea of saving an empty tempfile first is to prevent the need for creating a tempfile like you do with your code for file a. So once we use that, that first block of code can go.
                So by the looks of it, your code could be simplified to:
                Code:
                tempfile append
                save `writedatadir'/append, replace emptyok
                
                * Apppend the rperson-partner data
                local wave a b c d e f g h i j k l m n o p q
                foreach x of local wave {
                use `origdatadir'\Combined_`x'170c.dta, clear
                rename `x'* p_* // replaces wave-prefix with Partner prefix
                rename xwaveid hhpxid
                sort hhpxid
                save `writedatadir'/temp2, replace
                
                use `origdatadir'\Combined_`x'170c.dta, clear
                rename `x'* *
                drop if hhpxid=="" 
                
                sort hhpxid
                merge 1:1 hhpxid using `writedatadir'/temp2, replace
                
                append using `writedatadir'/append 
                save `writedatadir'/append, replace 
                }
                
                keep xwaveid hhpxid hgage hgsex esbrd edhigh1 wsfei mrcurr esbrd nlreapm jbhruc religb relimp relat ///
                p_hgage p_hgsex p_esbrd p_edhigh1 p_wsfei p_mrcurr p_esbrd p_nlreapm p_jbhruc p_religb p_relimp p_relat // vars of interest
                
                save `newdatadir'\partner_rperson.dta, replace emptyok

                Comment


                • #9
                  Thank you for your help Jorrit Gosens. Yes that is much better. I have further developed this now and have decided that I need to create a wave variable before I remove the wave prefix. I'm still unsure how to do this correctly. I've derived the code from a post using the same dataset but it does not work in its current form. Also, I suspect I have written over 'hhpxid' as I save this in both the partner and respondent data. Help is appreciated.

                  Code:
                  tempfile mergedata
                  save "`mergedata'", replace emptyok
                  
                  local wave = substr(c(alpha), 1
                      foreach waveprefix in `wave'{
                          quietly{
                              use `origdatadir'/Combined_`waveprefix'170c.dta, clear
                              renpfix `waveprefix'
                              local list ""
                              foreach variable in xwaveid `variables'{
                                  capture desc `variable', varlist
                                  if !_rc {
                                      local list "`list' `r(varlist)'"
                                  }
                              }
                              keep `list'
                              gen wave = index("abcdefghijklmnopqrstuvwxyz","`waveprefix'")
                              lab var wave "Wave of survey"
                              sort hhpxid
                              save "`mergedata'", replace
                          }
                          disp in green "Wave `waveprefix' completed"
                      }
                  
                  * partner data
                  local wave a b c d e f g h i j k l m n o p q
                  foreach `waveprefix' of local wave {
                  use `origdatadir'/Combined_`x'170c.dta, clear
                  rename `x'* p_*                 // replaces wave-prefix with partner prefix
                  rename p_hhpxid hhpxid            
                  rename xwaveid hhpxid        // xwaveid is unique individual identifier for all waves // hhpxid - partner's cross-wave identifier
                  
                  sort hhpxid
                  
                  * respondent data
                  use `origdatadir'/Combined_`x'170c.dta, clear
                  rename `x'* *                    // respondent data
                  drop if hhpxid=="" 
                  sort hhpxid
                  merge 1:1 hhpxid using "`mergedata'", replace 
                  
                  tempfile append
                  append using "`append'", replace emptyok
                  save "`append'", replace
                  }
                  keep xwaveid hhpxid age sex empstat educ wage mrcurr relb relimp relat ///
                  p_age p_sex p_empstat p_educ p_wage p_mrcurr p_relb p_relimp p_relat // vars of interest
                  
                  destring hhpxid, replace
                  sort hhpxid
                  save `newdatadir'/partner_rperson.dta, replace emptyok

                  Comment


                  • #10
                    Well, that's a fair bit of code to go over. So to narrow it down a bit, you are saying your current issue is creating the wave variable, which you are doing with:
                    Code:
                    gen wave = index("abcdefghijklmnopqrstuvwxyz","`waveprefix'")
                    If I read the rest of that loop and understand the purpose correctly, it can probably just be dealt with with:
                    Code:
                    gen wave = "`wave'"
                    Does that help? If not, please make the problem a bit more specific.


                    Edit:
                    Also, with this bit here:
                    Code:
                    local wave = substr(c(alpha), 1
                        foreach waveprefix in `wave'{
                    Are you trying to create a local that holds all letters of the alphabet?
                    I'm not so sure of the logic and it looks like it is missing at least one bracket, but I would do:
                    Code:
                    local wave `c(alpha)'
                    foreach waveprefix in `wave'{
                    Last edited by Jorrit Gosens; 18 Oct 2019, 04:50.

                    Comment


                    • #11
                      Thank you Jorrit Gosens yes that is a big help. (yes I did leave out a bracket at the end of that line of code). Ok, so for my purposes I don't need the substring fn. Thank you very much. I'll work through this and see how I go. Kind regards, Chris

                      Comment


                      • #12
                        Chris: Please close down two of these three threads with a cross-reference to one that you are still running. Or perhaps even better: close them all down and start a new thread with a full explanation of the problems as you now understand them.

                        Comment


                        • #13
                          Yes will do thank you Nick Cox.

                          Comment

                          Working...
                          X