Merging respondent-helper level data with respondent level data

Angela Jiang

Join Date: Feb 2023

Posts: 22
#1

Merging respondent-helper level data with respondent level data

30 Aug 2023, 18:17

Hi all,

I am trying to see if I can simplify some code I was given to improve readability and efficiency.

My goal is to merge HRS (Health and Retirement Study) respondent-helper level files with respondent level files and then create some aggregate information on the care respondents get.

The respondent-helper level file looks like the following, each respondent can have multiple helpers. The helperid is not unique but the combination of helperid and id is. The helper level file has more information about that helper, like whether they are paid.

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(id helperid) float paid
1 21 1
1 11 0
2 21 0
end

The respondent level files look like the following, h1 and h2 are the helperids for helper 1 and helper 2. Here person 2 only has one helper, in the real data there are also other information we want to keep from the respondent level file that is not in the helper level file, hence the need to merge.

* Example generated by -dataex-. For more info, type help dataex
clear
input byte id float(h1 h2)
1 21 11
2 21 .
end

The initial merged product might look like the following:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte id float(h1 h2 h1_paid h2_paid)
1 21 11 1 0
2 21 . 0 .
end

The final product should look like the following, where any_paid is whether the respondent had any helper that was paid in that period.

* Example generated by -dataex-. For more info, type help dataex
clear
input byte id float any_paid
1 1
2 0
end

The current code I have look something like below. The idea is to use each helperfile, rename the helperid variable to h1 h2 etc and the characteristics to h1_paid h2_paid etc , save as new file, and then merge with the original respondent file. This works but the dofile can quickly become confusing to read (in reality we don't just have h1 h2 but ha1 ha2..etc and hb1 and hb2 etc) and is not very intuitive (as we are saving the same file over and over again as different things). I would really appreciate any suggestions on how to simply this process, let me know if you need any further clarification.

//rename helperfiles

forval x=1/2 {
use `helperfile.dta", clear
gen h`x'=helperid
gen h`x'_paid = paid
tempfile h`x'
save "h`x'", replace
}

//merge with respondent file and get final information

use "respondentfile.dta"

forval x=1/2 {
merge 1:1 id h`x' using "h`x'"
}

forval x=1/2 {
gen any_paid = 1 if h`x'_paid==1
}

keep id any_paid
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

30 Aug 2023, 19:30

This appears complicated because you are thinking in terms of a dysfunctional wide data layout, with a single observation per respondent and the various helpers laid out side by side. Most Stata commands work best, or only, with data in long layout, where there is a single observation per respondent-helper pair (just like your helper file).

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte id float(h1 h2) 1 21 11 2 21 . end tempfile respondent_file save `respondent_file' * Example generated by -dataex-. For more info, type help dataex clear input byte(id helperid) float paid 1 21 1 1 11 0 2 21 0 end tempfile helper_file save `helper_file' // RESHAPE THE RESPONDENT FILE TO LONG use `respondent_file', clear reshape long h, i(id) drop _j rename h helperid drop if missing(helperid) // COMBINE WITH THE HELPER FILE merge 1:1 id helperid using `helper_file' // CALCULATE THE ANY PAID VARIABLE by id (helperid), sort: egen byte any_paid = max(paid) drop _merge

Evidently, replace `respondent_file' and `helper_file' by the actual names of those files.

Notice how the -merge- is completely transparent this way, and the calculation of the any_paid variable is a one-liner. Notice, too, that no explicit loops appear in the code anywhere. And not only will this work regardless of how many helpers any respondent has, you don't even have to know in advance what the largest number of helpers is: the code figures it all out for you.

Now, if you know for a fact that you will be using Stata commands that work better (or at all) in wide layout, you can -reshape- this data set to wide and you will have your stated final product. But given how Stata works, you will likely regret doing that. You are almost certainly better off leaving it the way it is, in long layout. Nearly everything will be easier this way, just as this is.

Last edited by Clyde Schechter; 30 Aug 2023, 19:33.
Comment
Angela Jiang

Join Date: Feb 2023

Posts: 22
#3

31 Aug 2023, 18:37

Hi Clyde,

Thanks so much for this. I started rewriting the code with your suggestion and I think the final product will look much better. I think it just never occurred to me to try making the file long form first because it wasn't longitudinal, so in this sense I learnt a new way of using reshape.

Best,
Angela

Last edited by Angela Jiang; 31 Aug 2023, 18:40.
Comment

Announcement

Merging respondent-helper level data with respondent level data

Comment

Comment