Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Value Labels and Appending Data

    I have a question about the appending data from multiple rounds of surveys across 10 countries from the Demographic and Health surveys data.

    When I append multiple rounds/surveys, all my value labels are messed up. For example, the "region" variable seemed to include only the value labels from the last data appended. So, if country A has regions 1 2 3, and country B has regions 3 4 5, I would expect the appended data to include all 6 regions. But in my case, only regions 3 4 5 are populated.

    Do you have any hints, strategies to synchronize the value labels, given your experience?

    Than, Yawo

  • #2
    Perhaps the discussion at

    https://www.statalist.org/forums/for...nding-datasets

    will be useful.

    Comment


    • #3
      Here is the outline of an approach that may be more effective.
      Code:
      cls
      
      // create two example datasets
      
      clear
      input float country
      1
      end
      label define country 1 "USA"
      label values country country
      tempfile data1
      save `data1'
      
      clear
      input float country
      2
      end
      label define country 2 "Canada"
      label values country country
      tempfile data2
      save `data2'
      
      // code starts here
      
      use `data1', clear
      tempfile label1
      label save country using `label1'
      label drop country
      
      append using `data2'
      label list country
      
      do `label1'
      label list country
      Code:
      . append using `data2'
      
      . label list country
      country:
                 2 Canada
      
      . 
      . do `label1'
      
      . label define country 1 `"USA"', modify
      
      . 
      end of do-file
      
      . label list country
      country:
                 1 USA
                 2 Canada

      Comment


      • #4
        There is also the possibility that the labeling schemes in the different waves are not only incomplete, but might be inconsistent. So before using the approach suggested in #3, go through each wave and examine the labeling schemes for this. If the same country is never assigned to a different number, nor vice versa, in the different labels, then the code in #3 will do the job very nicely. But if there are inconsistencies, that approach will end up with some observations mislabeled. In that case, you have to do something a little more complicated:

        Code:
        clear*
        tempfile building
        save `building', emptyok
        local n_waves 5 // OR HOWEVER MANY WAVES THERE ARE
        forvalues i = 1/`n_waves' {
            use data_from_wave_`i', clear
            decode country, gen(_country)
            drop country
            append using `building'
            save `"`building'"', replace
        }
        
        encode _country, gen(country)
        The end result will be all of the data sets appended together, and with a single consistent and complete labeling of the country variable.

        Comment


        • #5
          Thanks for all your suggestions: I agree the decode-encode sequence will work for my needs.

          I was going to go via a manual process, open each dataset and execute the decode/encode sequence.

          But given that each the variables are about 90% similar across datasets (there are few questions that were country-specific), I think it is feasible to use Clyde's approach - which automates the process.

          so, just to be sure I am getting his suggestion, right, I will annotate the suggested code below - and I will be very grateful for any clarifications



          My Questions:

          Code:
          clear*
          tempfile building
          save `building', emptyok

          My Question: instead of an empty dataset, I can start with data for Country1, right? *


          Code:
           local n_waves 5 // OR HOWEVER MANY WAVES THERE ARE

          My Question: since I have 10 countries with 18 rounds - some have 1, some 2, my N-waves will be 18, is that right

          Code:
          forvalues i = 1/`n_waves'
          { use data_from_wave_`i', clear

          My Question: this will call and cycle through each of the n_waves data. Can I rename n_waves n_surveys?

          Code:
          decode country,
          gen(_country) drop country

          Comment: I am not sure about this line: Do I place all the variables across all datasets here, even if some are missing in some countries?

          Code:
          append using `building'
          save `"`building'"',
          replace
          }
          encode _country, gen(country)

          what do I place here for the append command - the names of one or all the datasets ? or do I have multiple append commands?



          Thanks very much, ... I look forward to further comments on this.

          best - Yy
          Last edited by Yawo Kokuvi; 02 May 2019, 10:06.

          Comment


          • #6
            Line 1 and 2: instead of an empty dataset, I can start with data for Country1, right? *
            I won't say you can't, but it will make things more complicated, because at the top of the loop you read in a new data set. In order for the first data set to be included, you would have to add code to the loop to avoid overwriting it.

            Also, even if you do make that modification, you still need the tempfile building to accumulate the results as each data set is appended.

            Line 4: since I have 10 countries with 18 rounds - some have 1, some 2, my N-waves will be 18, is that right
            Your n_waves will be the number of data sets. It isn't clear to me from your description how many that will be. Look, using -forvalues- loop may not be the best approach here. I gave that as an example because survey data sets usually have names that include the round number or the year number or something like that, which makes it easy. But if your data sets' names do not include a round or year number, then you might be better off using the -local: dir- command to create a local macro containing the names of the files, and then doing a -foreach- loop over that local macro instead.

            Line 5-6: this will call and cycle through each of the n_waves data. Can I rename n_waves n_surveys?
            You can call it anything you like, as long as you do so consistently in both places, and so long as you do not use the name of some other local macro that is active at that point.

            Line 7: I am not sure about this line: Do I place all the variables across all datasets here, even if some are missing in some countries?
            In your problem description you referred to only a single problematic variable, country, and the code reflects that. If there are several variables that present this same problem, then you need to have a separate -decode- command for each of them (and, correspondingly, a separate -encode- command at the end). -encode- and -decode- only take one variable at a time. If the number of variables you have to deal with in this way is large, then rather than writing them out one by one, you would use a another loop here.

            Line 9: what do I place here - the names of all the datasets ? or do I have multiple append commands?
            Don't change that line! Use it exactly as you see it. The file `building' keeps growing as the code runs. At first it contains only the results from the first file, then next time through it contains the results from both the first and second files. And on and on until finally, when the loop terminates, it contains the results from all of the files.

            Comment


            • #7
              Thanks very much, Clyde and others:

              Given that my data has multiple variables that needed to be decoded, I am following up your suggestion to use a loop for the decode.
              My approach is to first get the variables that have value labels (by use of -ds- command), then immediately use those saved variables (from the r-macro)s. But I received an error: invalid name

              Here is the my code for the decode portion. I intend to employ the same to encode these same variables in the appended dataset.

              Code:
              set more off
              ds, has(vallabel)
              local vars `r(varlist)'
              foreach v of varlist 'vars'{
              decode `v', gen(s_`v')
              }
              I will appreciate some help to diagnose this problem.

              Thanks - Yy

              Comment


              • #8
                You used the wrong character (') to start the reference to local macro vars in your -foreach- command. It should be:
                Code:
                foreach v of varlist `vars'{

                Comment


                • #9
                  Thanks, I made the correction and it worked.

                  now all my datasets are in a single directory / folder. I want to use a loop to call each of them, and then go through the foreach. Here is an extract of the dataset names ... there are 30 of them.

                  Do i have to do a double foreach, a loop within a loop ?

                  Thanks - Yy

                  Attached Files
                  Last edited by Yawo Kokuvi; 04 May 2019, 10:03.

                  Comment


                  • #10
                    No. This can be accomplished in a single loop. If these are all the files you need, and if there are no other .dta files in that directory, then, with the current working directory set here you can do this:

                    Code:
                    clear*
                    tempfile building
                    save `building', emptyok
                    local filenames: dir "." files "*.dta"
                    foreach f of local filenames {
                        use `"`f'"', clear
                        // CODE TO CLEAN UP THE VARIABLES GOES HERE
                        append using `building'
                        save `"`building'"', replace
                    }

                    Comment


                    • #11
                      Thanks. I want to drop the variable that were in the varlist after they were decoded. I think the right place to issue this after the decode command

                      Is it OK to still refer to the r(varlist) at this stage. Would this this the code: drop vars r(varlist)?

                      Thanks. Yy

                      Comment


                      • #12
                        Well, it might work, but it's not a good idea. It will work provided that no commands between -ds...- and -drop `r(varlist)'- do anything that overwrites r(). I think that's the case for the code shown in #7 (after correction as per #8). But even if it does, you might come back later and decide to change the code in some way that causes it to break. And then you will be mystified that something that worked perfectly well before suddenly throws error messages! It can be very difficult to perceive what has changed, because the list of commands that overwrite -r()- is very large, but not systematic enough to easily remember. So if you want to re-use the contents of r(varlist) it is better to store it in a named local macro that you create, and then refer to that local macro later.

                        Alternatively, if all you are concerned about is dropping those variables, you can also do that by putting -drop `v'- right after the -decode- command inside the loop.

                        Comment


                        • #13
                          Thanks.. So is this the full code then?

                          Code:
                          set more off
                          clear*
                          tempfile building
                          save `building', emptyok
                          local filenames: dir "." files "*.dta"
                          foreach f of local filenames {
                          use `"`f'"', clear
                          ds, has(vallabel)
                          local vars `r(varlist)'
                          foreach v of varlist `vars'{
                          decode `v', gen(s_`v')
                          drop `v'
                          }
                          append using `building'
                          save `"`building'"', replace
                          }

                          Comment


                          • #14
                            Yes, that looks right.

                            You should get into the habit of indenting the code inside loops, as I have done in my responses. Although it makes no difference to Stata, it makes it easier to read the code and see what is going on. It also makes it much easier to debug issues like unbalanced curly braces. It's just a matter of style, but good programming style will save you time and trouble in the long run.

                            Comment


                            • #15
                              is this better:

                              Code:
                              set more off
                              tempfile building
                              save `building', emptyok
                              local filenames: dir "." files "*.dta"
                              foreach f of local filenames {
                                  use `"`f'"', clear
                                  ds, has(vallabel)
                                  local vars `r(varlist)'
                                     foreach v of varlist `vars'{
                                     decode `v', gen(s_`v')
                                     drop `v'
                                     append using `building'
                                     save `"`building'"', replace
                                     }
                              }

                              Comment

                              Working...
                              X