Value Labels and Appending Data

Yawo Kokuvi

Join Date: May 2015

Posts: 137
#1

Value Labels and Appending Data

01 May 2019, 12:42

I have a question about the appending data from multiple rounds of surveys across 10 countries from the Demographic and Health surveys data.

When I append multiple rounds/surveys, all my value labels are messed up. For example, the "region" variable seemed to include only the value labels from the last data appended. So, if country A has regions 1 2 3, and country B has regions 3 4 5, I would expect the appended data to include all 6 regions. But in my case, only regions 3 4 5 are populated.

Do you have any hints, strategies to synchronize the value labels, given your experience?

Than, Yawo
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

01 May 2019, 14:22

Perhaps the discussion at

https://www.statalist.org/forums/for...nding-datasets

will be useful.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

01 May 2019, 14:38

Here is the outline of an approach that may be more effective.

Code:

cls

// create two example datasets

clear
input float country
1
end
label define country 1 "USA"
label values country country
tempfile data1
save `data1'

clear
input float country
2
end
label define country 2 "Canada"
label values country country
tempfile data2
save `data2'

// code starts here

use `data1', clear
tempfile label1
label save country using `label1'
label drop country

append using `data2'
label list country

do `label1'
label list country

Code:

. append using `data2'

. label list country
country:
           2 Canada

. 
. do `label1'

. label define country 1 `"USA"', modify

. 
end of do-file

. label list country
country:
           1 USA
           2 Canada

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30029
#4

01 May 2019, 17:30

There is also the possibility that the labeling schemes in the different waves are not only incomplete, but might be inconsistent. So before using the approach suggested in #3, go through each wave and examine the labeling schemes for this. If the same country is never assigned to a different number, nor vice versa, in the different labels, then the code in #3 will do the job very nicely. But if there are inconsistencies, that approach will end up with some observations mislabeled. In that case, you have to do something a little more complicated:

Code:

clear* tempfile building save `building', emptyok local n_waves 5 // OR HOWEVER MANY WAVES THERE ARE forvalues i = 1/`n_waves' { use data_from_wave_`i', clear decode country, gen(_country) drop country append using `building' save `"`building'"', replace } encode _country, gen(country)

The end result will be all of the data sets appended together, and with a single consistent and complete labeling of the country variable.
Comment
Yawo Kokuvi

Join Date: May 2015

Posts: 137
#5

02 May 2019, 09:50

Thanks for all your suggestions: I agree the decode-encode sequence will work for my needs.

I was going to go via a manual process, open each dataset and execute the decode/encode sequence.

But given that each the variables are about 90% similar across datasets (there are few questions that were country-specific), I think it is feasible to use Clyde's approach - which automates the process.

so, just to be sure I am getting his suggestion, right, I will annotate the suggested code below - and I will be very grateful for any clarifications

My Questions:

Code:

clear* tempfile building save `building', emptyok

My Question: instead of an empty dataset, I can start with data for Country1, right? *

Code:

local n_waves 5 // OR HOWEVER MANY WAVES THERE ARE

My Question: since I have 10 countries with 18 rounds - some have 1, some 2, my N-waves will be 18, is that right

Code:

forvalues i = 1/`n_waves' { use data_from_wave_`i', clear

My Question: this will call and cycle through each of the n_waves data. Can I rename n_waves n_surveys?

Code:

decode country, gen(_country) drop country

Comment: I am not sure about this line: Do I place all the variables across all datasets here, even if some are missing in some countries?

Code:

append using `building' save `"`building'"', replace } encode _country, gen(country)

what do I place here for the append command - the names of one or all the datasets ? or do I have multiple append commands?

Thanks very much, ... I look forward to further comments on this.

best - Yy

Last edited by Yawo Kokuvi; 02 May 2019, 10:06.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30029
#6

02 May 2019, 10:14

Line 1 and 2: instead of an empty dataset, I can start with data for Country1, right? *

I won't say you can't, but it will make things more complicated, because at the top of the loop you read in a new data set. In order for the first data set to be included, you would have to add code to the loop to avoid overwriting it.

Also, even if you do make that modification, you still need the tempfile building to accumulate the results as each data set is appended.

Line 4: since I have 10 countries with 18 rounds - some have 1, some 2, my N-waves will be 18, is that right

Your n_waves will be the number of data sets. It isn't clear to me from your description how many that will be. Look, using -forvalues- loop may not be the best approach here. I gave that as an example because survey data sets usually have names that include the round number or the year number or something like that, which makes it easy. But if your data sets' names do not include a round or year number, then you might be better off using the -local: dir- command to create a local macro containing the names of the files, and then doing a -foreach- loop over that local macro instead.

Line 5-6: this will call and cycle through each of the n_waves data. Can I rename n_waves n_surveys?

You can call it anything you like, as long as you do so consistently in both places, and so long as you do not use the name of some other local macro that is active at that point.

Line 7: I am not sure about this line: Do I place all the variables across all datasets here, even if some are missing in some countries?

In your problem description you referred to only a single problematic variable, country, and the code reflects that. If there are several variables that present this same problem, then you need to have a separate -decode- command for each of them (and, correspondingly, a separate -encode- command at the end). -encode- and -decode- only take one variable at a time. If the number of variables you have to deal with in this way is large, then rather than writing them out one by one, you would use a another loop here.

Line 9: what do I place here - the names of all the datasets ? or do I have multiple append commands?

Don't change that line! Use it exactly as you see it. The file `building' keeps growing as the code runs. At first it contains only the results from the first file, then next time through it contains the results from both the first and second files. And on and on until finally, when the loop terminates, it contains the results from all of the files.
Comment
Yawo Kokuvi

Join Date: May 2015

Posts: 137
#7

04 May 2019, 09:38

Thanks very much, Clyde and others:

Given that my data has multiple variables that needed to be decoded, I am following up your suggestion to use a loop for the decode.
My approach is to first get the variables that have value labels (by use of -ds- command), then immediately use those saved variables (from the r-macro)s. But I received an error: invalid name

Here is the my code for the decode portion. I intend to employ the same to encode these same variables in the appended dataset.

Code:

set more off ds, has(vallabel) local vars `r(varlist)' foreach v of varlist 'vars'{ decode `v', gen(s_`v') }

I will appreciate some help to diagnose this problem.

Thanks - Yy
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30029
#8

04 May 2019, 09:45

You used the wrong character (') to start the reference to local macro vars in your -foreach- command. It should be:

Code:

foreach v of varlist `vars'{
Comment
Yawo Kokuvi

Join Date: May 2015

Posts: 137
#9

04 May 2019, 09:49

Thanks, I made the correction and it worked.

now all my datasets are in a single directory / folder. I want to use a loop to call each of them, and then go through the foreach. Here is an extract of the dataset names ... there are 30 of them.

Do i have to do a double foreach, a loop within a loop ?

Thanks - Yy

Attached Files

Last edited by Yawo Kokuvi; 04 May 2019, 10:03.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30029
#10

04 May 2019, 11:08

No. This can be accomplished in a single loop. If these are all the files you need, and if there are no other .dta files in that directory, then, with the current working directory set here you can do this:

Code:

clear* tempfile building save `building', emptyok local filenames: dir "." files "*.dta" foreach f of local filenames { use `"`f'"', clear // CODE TO CLEAN UP THE VARIABLES GOES HERE append using `building' save `"`building'"', replace }
Comment
Yawo Kokuvi

Join Date: May 2015

Posts: 137
#11

04 May 2019, 11:54

Thanks. I want to drop the variable that were in the varlist after they were decoded. I think the right place to issue this after the decode command

Is it OK to still refer to the r(varlist) at this stage. Would this this the code: drop vars r(varlist)?

Thanks. Yy
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30029
#12

04 May 2019, 12:48

Well, it might work, but it's not a good idea. It will work provided that no commands between -ds...- and -drop `r(varlist)'- do anything that overwrites r(). I think that's the case for the code shown in #7 (after correction as per #8). But even if it does, you might come back later and decide to change the code in some way that causes it to break. And then you will be mystified that something that worked perfectly well before suddenly throws error messages! It can be very difficult to perceive what has changed, because the list of commands that overwrite -r()- is very large, but not systematic enough to easily remember. So if you want to re-use the contents of r(varlist) it is better to store it in a named local macro that you create, and then refer to that local macro later.

Alternatively, if all you are concerned about is dropping those variables, you can also do that by putting -drop `v'- right after the -decode- command inside the loop.
Comment

Yawo Kokuvi

Join Date: May 2015
Posts: 137

#13

04 May 2019, 15:05

Thanks.. So is this the full code then?

Code:

set more off
clear*
tempfile building
save `building', emptyok
local filenames: dir "." files "*.dta"
foreach f of local filenames {
use `"`f'"', clear
ds, has(vallabel)
local vars `r(varlist)'
foreach v of varlist `vars'{
decode `v', gen(s_`v')
drop `v'
}
append using `building'
save `"`building'"', replace
}

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30029
#14

04 May 2019, 18:03

Yes, that looks right.

You should get into the habit of indenting the code inside loops, as I have done in my responses. Although it makes no difference to Stata, it makes it easier to read the code and see what is going on. It also makes it much easier to debug issues like unbalanced curly braces. It's just a matter of style, but good programming style will save you time and trouble in the long run.
Comment

Yawo Kokuvi

Join Date: May 2015
Posts: 137

#15

04 May 2019, 21:20

is this better:

Code:

set more off
tempfile building
save `building', emptyok
local filenames: dir "." files "*.dta"
foreach f of local filenames {
    use `"`f'"', clear
    ds, has(vallabel)
    local vars `r(varlist)'
       foreach v of varlist `vars'{
       decode `v', gen(s_`v')
       drop `v'
       append using `building'
       save `"`building'"', replace
       }
}

Announcement

Value Labels and Appending Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment