Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Demographic surveys: Adding data from adult records to child records

    Hi there,

    I have a question relating to a dataset that looks something like this:
    ID caregiver_id_c gender_c gender_a gender_caregiver
    1 6 1
    2 6 0
    3 7 1
    4 8 1
    5 9 1
    6 0
    7 1
    8 0
    9 0
    The dataset has been created by merging two datasets from a national demographic survey: one with child-related information (collected via a survey where the respondents are knowledgeable adults) and the other with adult-related information (collected via addressing surveys to the adults themselves). The *_c* and *_a* of the variable names indicates which of the two datasets the variable comes from. ID is the variable which uniquely identifies observations - each child and each adult has a unique ID number. In the dataset above, IDs 1-5 are children; 6-9 are adults.

    For my research I am interested primarily in child outcomes, such as anthropometric measures and school attendance. In analysing these variables, I need to add data about the caregiver of the child to each child's record. For example, in the child dataset, the ID number of the child's primary caregiver ('caregiver_id_c') is recorded, whereas the adult dataset contains data corresponding to that ID number (e.g. 'gender_a'). So what I would need to do is fill the final column above by creating a new variable, 'gender_caregiver', with values which are identical to the values of 'gender_a' when 'ID' is equal to 'caregiver_id_c'. For example, the first two rows include data for children with IDs 1 and 2. These children's caregiver has an ID of 6, and from the adult dataset, the gender of adult with ID 6 takes a value of 0. I need to add this 0 to the 'gender_caregiver' variable for children with IDs 1 and 2. I would have to repeat the process for children with IDs 3-5 using data from adults with IDs 7-9.

    I've looked in the forums and Stata help without success. Generally what I have found is how to generate a new variable from a second variable using values of a third variable when the third variable takes on specific values (e.g. gen mpg2 = mpg if foreign==0). What I need is something like gen gender_caregiver = gender_a if caregiver_id_c = ID. I.e., gen var1 = var2 when var3 and var4 take on identical values, whatever that value may be.

    I am using Stata version 13.0. This is my first post and I'm not that experienced with Stata, so apologies in advance if I haven't searched the forums and the Stata help sufficiently, or if this query is too basic.

    Thanks,
    Zoheb Khan
    PhD candidate, Development Studies
    University of Johannesburg

  • #2
    Note that it would help if simple data examples were created using dataex (from SSC, see the recent announcement here). It would have saved me some editing time putting together the example below and made the question a bit clearer.

    I think this is what you are looking for

    Code:
    clear
    input ID caregiver_id_c gender_c gender_a
    1 6 1 .
    2 6 0 .
    3 7 1 .
    4 8 1 .
    5 9 1 .
    6 . . 0
    7 . . 1
    8 . . 0
    9 . . 0
    end
    tempfile f
    save "`f'"
    
    * retain adult information and rename to match
    keep ID gender_a
    rename ID caregiver_id_c
    rename gender_a gender_caregiver
    
    *  merge with main dataset
    merge 1:m caregiver_id_c using "`f'", keep(match using) nogen
    
    * make it pretty
    sort ID
    order ID caregiver_id_c gender_c gender_a 
    list

    Comment


    • #3
      Thank you Robert, this is exactly what I needed. And noted - in future I will make use of dataex. Thanks again!

      Comment


      • #4
        Dear all

        I have an update to the query above, relating to a dataset of children that is unique on the ID of the child (child_id_c). The dataset includes information about each child's carer - eg gender and carer ID (caregiver_id_c) - merged in from a dataset of adults as per Robert's suggestion above. The suffix _c or _a indicates which of the original datasets (child or adult) a given variable comes from (ie before merging).

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str4 child_id_c byte(caregiver_id_c gender_c gender_a state_support_c) int(weight_c weight_a) byte(province cluster)
        "1"  6 1 0 1 1440  357 1 12
        "2"  6 0 0 0 1220  357 1 12
        "3"  7 1 1 0 1220 1000 3 16
        "4"  8 1 0 0 1600 1100 5 15
        "5"  9 1 0 0 1500  900 4 86
        "6" 10 0 1 1 1500  750 8  2
        "7" 10 1 1 1 1650  750 8  2
        end

        I am currently doing analysis of this dataset using survey methods. For, eg, the proportion of children whose carers receive state support on their behalf, the commands are simple.

        Code:
        svyset cluster [pweight=weight_c], strata (province)
        svy: prop state_support_c
        These commands lead to appropriate estimates which correspond to official national estimates of both children and children with carers receiving state support on their behalf.

        However, I would also like to analyse the carers. I'm unsure of how to do this using a dataset arranged around children without including duplicates, which would also lead to strange weighted estimates. If I'm interested in the proportion of carers who are male (as opposed to the number of children with male carers), is there a way of counting only values unique on caregiver_id_c in the analysis? In the example above, children with IDs 1 and 2 both have the same carer, with an ID of 6. Values are therefore identical for all carer-related variables for children 1 and 2, and I don't wish to double-count that information.

        One way around this would be to simply drop duplicates using the duplicates drop command on caregiver_id_c and to analyse carers in a separate dataset. I imagine this would be OK for analysis at the level of the carer, and using carers' weights (weight_a). But can anyone recommend a better option? I also don't wish to drop duplicates and lose that data, because while carer variables will contain the same data for carers with the same ID number, the corresponding child variables will often contain differing values (eg carer with ID 10 who is the carer of both child 6 and child 7).


        Many thanks - I hope I've been clear enough.
        Zoheb

        Comment


        • #5
          I think a simpler question might be, how to use duplicates drop without actually dropping observations, but rather tagging one set of observations of a group of observations which are duplicates on a certain variable (in this case caregiver_id_c). In this case this would mean if I am calculating the proportion of caregivers who are male, I would need to include in that calculation only one set of observations in each group of observations with the same caregiver ID. So for rows 6 and 7 (children with caregiver 10), I'd need to count gender_a=1 only once. Is this possible?

          Comment

          Working...
          X