How to loop over a merge of inconsistent variables over time

Chris Boulis

Join Date: Feb 2019
Posts: 368

How to loop over a merge of inconsistent variables over time

13 Oct 2019, 20:21

Dear StataList.

I need to merge two sets of data (partner + respondent) per wave for many waves. The complication is that while the control variables appear in each wave (though some begin in wave 2, not wave 1), some variables (of interest) appear irregularly and inconsistently (every 3 or 4 years). How can I deal with this merge and apply a loop for all waves? I read over https://www.statalist.org/forums/new-content/51 and @Cox and Kantor suggested there may be better options than the "if, then" command (which I attempted using without success below).
Note: this is a very large dataset (000s of variables) ...

Code:

*partner data
local p = substr("abcdefghijklmnopqrstuv",`wave',1)
if wave == d g j n {
use waveid age sex empstat educ inc marstat nlpreg workhr relb relimp relat using "c:/data/Combined_a170c.dta", clear
}else { if wave == a b c e f h i k l m o p q     // all waves a-q, excl d, g, j, n
use waveid age sex empstat educ inc marstat nlpreg workhr using "c:/data/Combined_a170c.dta", clear
}rename `p'* p_* // replace wave with partner data prefix
  rename waveid hhidsort hhpxid save "c:/data/temp", replace

*respondent data
local p = substr("abcdefghijklmnopqrstuv",`wave',1)
if wave == d g j n {
use waveid hhid age sex empstat educ inc marstat nlpreg workhr relb relimp relat using "c:/data/Combined_a170c.dta", clear
}else { if wave == a b c e f h i k l m o p q     // all waves a-q, excl d, g, j, n
use waveid hhid age sex empstat educ inc marstat nlpreg workhr using `readdatadir'/Combined_q170c.dta, clear
}rename `p'* *
drop if hhid=="" sort hhid

merge 1:1 hhid using "c:/data/temp", replace

// I cannot figure out how to then loop this for all other waves?

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30147

13 Oct 2019, 21:09

Well, this code is riddled with invalid syntax and logical errors. So before we worry about embedding it in a loop over waves (which will be very simple to do) you need to get this fixed up first. I can't really tell from the code, nor from your description, what's going on and what you are trying to do. So what I've put below is your code annotated with my comments pointing out problems with it. Look it over and get it fixed up. Then post back with your fixed-up code, and try to explain a little more clearly where you want to go.

Code:

*partner data

// WHY ARE YOU USING LETTER DESIGNATIONS FOR THE WAVES?
// IS THERE SOME IMPORTANT REASON? IT'S GOING TO MAKE LIFE
// HARDER IN MANY WAYS THAN HAVING IT BE A NUMBER.
local p = substr("abcdefghijklmnopqrstuv",`wave',1)

// THE NEXT COMMAND IS NOT LEGAL SYNTAX. TAKE A LOOK AT THE -inlist()- FUNCTION
// ALSO IF, AS WOULD APPEAR FROM THE PRECEDING -local p = ...- COMMAND, wave IS
// A LOCAL MACRO, THEN YOU MUST REFER TO IT AS `wave', NOT wave.

// THEN AGAIN, THE WHOLE IF-ELSE IF STRUCTURE HERE SEEMS POINTLESS: YOU DO
// EXACTLY THE SAME THING REGARDLESS OF WHICH BRANCH YOU ARE ON.
if wave == d g j n {
use waveid age sex empstat educ inc marstat nlpreg workhr relb relimp relat using "c:/data/Combined_a170c.dta", clear
}
else {
if wave == a b c e f h i k l m o p q // all waves a-q, excl d, g, j, n
use waveid age sex empstat educ inc marstat nlpreg workhr using "c:/data/Combined_a170c.dta", clear
}

// TAKE A LOOK AT YOUR -use- COMMAND. THERE AREN'T GOING TO BE ANY VARIABLES
// THAT START WITH `p' EXCEPT SPORADICALLY AND BY COINCIDENCE ON CERTAIN WAVES
// I'M REALLY NOT SURE WHAT YOU EVEN INTEND HERE.
rename `p'* p_* // replace wave with partner data prefix
rename waveid hhid
// AGAIN LOOK AT YOUR -use- AND -rename- COMMANDS: THERE IS NO VARIABLE hhpxid.
sort hhpxid
save "c:/data/temp", replace

// THIS LOOKS REMARKABLY SIMILAR TO WHAT YOU DID ABOVE. AND IT REUSES THE SAME DATA
// FILES. SO IT DOESN'T LOOK LIKE THIS CODE SUCCESSFULLY DISTINGUISHES RESPONDENTS
// FROM PARTNERS. SOMETHING IS MISSING.
*respondent data
local p = substr("abcdefghijklmnopqrstuv",`wave',1)
if wave == d g j n {
use waveid hhid age sex empstat educ inc marstat nlpreg workhr relb relimp relat using "c:/data/Combined_a170c.dta", clear
}
else {
if wave == a b c e f h i k l m o p q // all waves a-q, excl d, g, j, n
// OK, SO MAYBE THIS IS A DIFFERENT FILE THAN USED INH THE CODE ABOVE. IT APPEARS TO BE
// IN A DIFFERENT DIRECTORY. IS THE FILENAME REALLY Combined_q170c? OR IS THE q A
// TYPO FOR a?
use waveid hhid age sex empstat educ inc marstat nlpreg workhr using `readdatadir'/Combined_q170c.dta, clear
}
rename `p'* *
drop if hhid==""
sort hhid

// THIS COMMAND WILL FAIL BECAUSE THERE IS NO hhid VARIABLE IN c:/data/temp
merge 1:1 hhid using "c:/data/temp", replace

Last edited by Clyde Schechter; 13 Oct 2019, 21:12.

Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

16 Oct 2019, 22:30

Thank you very much Clyde Schechter for your comments. My updated code: (this appears to work, but I feel something is wrong. e.g. I am confident I have lost hhid in the partner data)

Thank you in advance for any guidance.

Code:

* partner data
local wave a b c d e f g h i j k l m n o p q  // All vars contain a prefix that denotes the wave (letters not numbers), i.e. age in wave 1 is aage, in wave 2 is bage, etc
foreach x of local wave {
use "c:/data/Combined_`x'170c.dta, clear
bysort `x'hhid: gen wave=_n   // hhid has a wave prefix
rename `x'* p_*     // replaces wave identifier with partner identifier
rename p_hhid hhid          
rename xwaveid hhid  // xwaveid has no wave prefix
sort hhid         // xwaveid - individual unique identifier // hhid - partner's cross wave identifier

tempfile mergedata
save "`mergedata'", replace emptyok

use "c:/data/Combined_`x'170c.dta, clear
rename `x'* *     // pulling in respondent data
drop if hhid=="" 
sort hhid
merge 1:1 hhid using "`mergedata'", replace 

tempfile append
append using "`append'", replace emptyok
save "`append'", replace
}

keep xwaveid hhid age sex empstat educ wage marstat relb relimp relat p_age p_sex p_empstat p_educ p_wage p_marstat p_relb p_relimp p_relat // vars of interest for respondent and partner (p_) // vars relg relimp relat are only in waves d g j n

save "c:/newdata/rpdata.dta, replace emptyok

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

16 Oct 2019, 22:46

Well, I can see one problem with this code. When you -rename `x'* p_*- or -rename `x'* *-, think about what happens when x = a (which is the first time through the loop.) The variable age is going to be renamed p_ge or ge. And, probably more serious a problem, when x = h, hhid will be renamed p_hid or hid, and in both cases a crucial identifier variable is being missed.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#5

17 Oct 2019, 01:22

Thanks for your reply Clyde Schechter. I thought that this will only remove the wave prefix, say for example in wave 1, 'aage' would become 'p_age' or 'age', 'asex' would become 'p_sex' and 'sex'.

Are there any other issues in my code in #3? In the 'partner' data I have hhid (which is really xwaveid), I also have hhid in the 'respondent' data, but when I merge the two, I will lose one of the two. Can anyone help with how to deal with this problem?

Thank you.

Last edited by Chris Boulis; 17 Oct 2019, 01:25.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#6

18 Oct 2019, 12:43

Well, let me clarify. I made certain assumptions about what variables are in your data sets based on the code in #1. But whether my assumptions are right or wrong, either way you have a problem, though the two cases present different problems.

If the names of the variables in the data set for wave x are named xage xsex xempstat xeduc, etc., then your -rename `x'* p_*- command will appropriately rename them to p_age, p_sex, etc. If, however there are variables whose names do not start with x, such as, I inferred, hhid, they will get mishandled when `x' == h. Also, if the variable names really do start with `x', then your -use waveid age sex empstat educ inc marstat nlpreg///- command will fail because the variables age sex empstat, etc. will not exist in the data set. So either way, something goes awry.

In the 'partner' data I have hhid (which is really xwaveid), I also have hhid in the 'respondent' data, but when I merge the two, I will lose one of the two. Can anyone help with how to deal with this problem?

I don't follow this. I guess I don't understand what you're doing, but I thought you are using -hhid- as the -merge- key when you merge the partner and respondent data. If that's the case, then the hhid will be the same in both data sets, so no information is lost by not having the same information in two separate variables after the merge runs. So there is no problem to deal with. What am I missing here?
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#7

19 Oct 2019, 01:52

Chris: Best to run on thread at a time. I wasn’t following this one but several of the points made here by Clyde have been made by others in other threads.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#8

20 Oct 2019, 23:22

Yes thank you Nick Cox. And thank you Clyde Schechter - you have made some very good points which I will address in my code. As 'x' applies to the 'local wave' list (a-q) we wouldn't reach 'x' right? so that shouldn't be an issue. The partner cross-wave identifier (hhid) also contains the wave-prefix so that should not be a problem either - hopefully that came through in the code update shown in #3. What I am worried about is that in #3, I use hhid as the merge key - in the partner data part of the code I "rename xwaveid hhid" so then I thought hhid in the partner data is really xwaveid, so when I merge on hhid from the respondent data - i thought I would be merging two different id sets (xwaveid from partner and hhid from respondent)? Is this not of concern?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#9

23 Oct 2019, 08:05

As 'x' applies to the 'local wave' list (a-q) we wouldn't reach 'x' right?

Well, the value of local macro x will never get to the letter x, but that's not what I was referring to. Rather my point is that, for example, on the first iteration when `x' is the letter a, the variable age will be renamed to p_ge. Now maybe there is no variable age: it's really aage, with the first a being the wave prefix. Fine, but then in that case your original -use- command that mentions the variable age will fail, because it has to use aage instead. The code in #3 seems to have overcome some of this.

If you are trying to merge the files, yes, the merge key variable(s) must have the same name in both files. There is a problem in #3, though because you rename p_hhid to hhid and then you rename xwaveid to hhid. Well, you can't do both. When you reach the second of those -rename- commands, Stata will stop and remind you that hhid already exists. So you need to figure out which of those two original variables is the one you need to use as the merge key, and then rename only that one to hhid. The merged dataset will then contain only one copy of hhid, but that makes complete sense because the observations that get paired up in the merge always have the same value of hhid, so there is nothing lost by having only one copy of it in the end.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#10

28 Oct 2019, 21:35

Thank you very much for your help Clyde Schechter.
Comment

Announcement

How to loop over a merge of inconsistent variables over time

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment