Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging respondent-helper level data with respondent level data

    Hi all,

    I am trying to see if I can simplify some code I was given to improve readability and efficiency.

    My goal is to merge HRS (Health and Retirement Study) respondent-helper level files with respondent level files and then create some aggregate information on the care respondents get.

    The respondent-helper level file looks like the following, each respondent can have multiple helpers. The helperid is not unique but the combination of helperid and id is. The helper level file has more information about that helper, like whether they are paid.

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(id helperid) float paid
    1 21 1
    1 11 0
    2 21 0
    end

    The respondent level files look like the following, h1 and h2 are the helperids for helper 1 and helper 2. Here person 2 only has one helper, in the real data there are also other information we want to keep from the respondent level file that is not in the helper level file, hence the need to merge.

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id float(h1 h2)
    1 21 11
    2 21 .
    end

    The initial merged product might look like the following:

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id float(h1 h2 h1_paid h2_paid)
    1 21 11 1 0
    2 21 . 0 .
    end


    The final product should look like the following, where any_paid is whether the respondent had any helper that was paid in that period.

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id float any_paid
    1 1
    2 0
    end



    The current code I have look something like below. The idea is to use each helperfile, rename the helperid variable to h1 h2 etc and the characteristics to h1_paid h2_paid etc , save as new file, and then merge with the original respondent file. This works but the dofile can quickly become confusing to read (in reality we don't just have h1 h2 but ha1 ha2..etc and hb1 and hb2 etc) and is not very intuitive (as we are saving the same file over and over again as different things). I would really appreciate any suggestions on how to simply this process, let me know if you need any further clarification.


    //rename helperfiles

    forval x=1/2 {
    use `helperfile.dta", clear
    gen h`x'=helperid
    gen h`x'_paid = paid
    tempfile h`x'
    save "h`x'", replace
    }


    //merge with respondent file and get final information

    use "respondentfile.dta"

    forval x=1/2 {
    merge 1:1 id h`x' using "h`x'"
    }

    forval x=1/2 {
    gen any_paid = 1 if h`x'_paid==1
    }


    keep id any_paid


















  • #2
    This appears complicated because you are thinking in terms of a dysfunctional wide data layout, with a single observation per respondent and the various helpers laid out side by side. Most Stata commands work best, or only, with data in long layout, where there is a single observation per respondent-helper pair (just like your helper file).
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id float(h1 h2)
    1 21 11
    2 21 .
    end
    tempfile respondent_file
    save `respondent_file'
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(id helperid) float paid
    1 21 1
    1 11 0
    2 21 0
    end
    tempfile helper_file
    save `helper_file'
    
    // RESHAPE THE RESPONDENT FILE TO LONG
    use `respondent_file', clear
    reshape long h, i(id)
    drop _j
    rename h helperid
    drop if missing(helperid)
    
    // COMBINE WITH THE HELPER FILE
    merge 1:1 id helperid using `helper_file'
    
    // CALCULATE THE ANY PAID VARIABLE
    by id (helperid), sort: egen byte any_paid = max(paid)
    drop _merge
    Evidently, replace `respondent_file' and `helper_file' by the actual names of those files.

    Notice how the -merge- is completely transparent this way, and the calculation of the any_paid variable is a one-liner. Notice, too, that no explicit loops appear in the code anywhere. And not only will this work regardless of how many helpers any respondent has, you don't even have to know in advance what the largest number of helpers is: the code figures it all out for you.

    Now, if you know for a fact that you will be using Stata commands that work better (or at all) in wide layout, you can -reshape- this data set to wide and you will have your stated final product. But given how Stata works, you will likely regret doing that. You are almost certainly better off leaving it the way it is, in long layout. Nearly everything will be easier this way, just as this is.
    Last edited by Clyde Schechter; 30 Aug 2023, 19:33.

    Comment


    • #3
      Hi Clyde,

      Thanks so much for this. I started rewriting the code with your suggestion and I think the final product will look much better. I think it just never occurred to me to try making the file long form first because it wasn't longitudinal, so in this sense I learnt a new way of using reshape.

      Best,
      Angela


      Last edited by Angela Jiang; 31 Aug 2023, 18:40.

      Comment

      Working...
      X