Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging Cross-sections of a Panel Data Set

    Hello everyone,

    This is my first post on the forum, so I hope that I will be able to provide all the information necessary to answer my question. Please bear with me if this is not the case.

    I am currently trying to merge a cross-section of a panel data set called the Indonesian Family Life Survey (https://www.rand.org/well-being/soci.../FLS/IFLS.html). The original data set contains different dta.files on different questions of the survey. The data may have different levels of observations, e.g. the individual, the household, physical activity of an individual, individual interview time, etc. Accordingly, there are different identifiers for each level of observation. The number of identifiers may also change between different levels of observations. For example, the household only has one identifier (hhid14), whereas the individual has two identifiers for one unique observation (hhid14, pid14). Most other variables have 3 different identifiers for one unique observation, e.g. for individual interview time, it's hhid14, pid14, and time_occ. However, the third identifier may not the be same for different levels of observations. For an individual's physical activity, we would they would be hhid14, pid14, and kktype. All observations have hhid14 as ONE identifier.

    My goal is to have a fully merged data set that contains all questions of the cross-section. I have tried to use the merge command but I am not quite which dta.file I should use as a master file or which option (1:1, 1:m, m:1, m:m) is correct for the different levels of observation. I am wary of the m:m option, but I think it may actually be the appropriate choice here.

    If I use a dataset as a master file where the household is the level of observation (and hence hhid14 the unique ID), I am not quite sure what my code should look like.

    Here an example of my code:

    merge 1:1 hhid14 using b3a_cov, nogen (household-level data, only one unique identifiers)

    merge 1:m hhid14 using b3a_dl1, nogen (individual-level data, not quite sure how to tell STATA that pid14 is the other ID)

    merge 1:m pid14 hhid14 using b3a_dl2, nogen (three unique identifiers)

    merge m:m pid14 hhid14 using b3a_dl3, nogen (three unique identifiers, but a different one from the line before. Since I already have three different IDs in the data set, I use m:m)

    I am not quite sure if this is the correct approach as I am pretty new to merging data sets. Would anyone be willing to help me out here?

    Thank you for your time.

    Best,

    Jérôme

  • #2
    merge 1:1 hhid14 using b3a_cov, nogen (household-level data, only one unique identifiers)
    So far, so good.

    merge 1:m hhid14 using b3a_dl1, nogen (individual-level data, not quite sure how to tell STATA that pid14 is the other ID)
    No need to specify the other ID at this point since you are merging 1:m; the second identifier in ba3_dl1 is not relevant at this point. This command is fine as written.

    merge 1:m pid14 hhid14 using b3a_dl2, nogen (three unique identifiers)
    Again, since there is a 1:m match and the combination of pid14 and hhid14 uniquely identify the observations in memory, the third identifier in b3a_dl2 is of no concern. For reference in the discussion below, let's call the data set after this -merge- the "incompletely merged data set."

    merge m:m pid14 hhid14 using b3a_dl3, nogen (three unique identifiers, but a different one from the line before. Since I already have three different IDs in the data set, I use m:m)
    ABSOLUTELY NOT! The m:m match will just pair each observation in memory with some arbitrarily chosen observation from b3a_dl3: it will take into account no other variables in doing this and the resulting pairing is meaningless. It's data salad. It can't be right.

    So you have to figure out what is going on in this data so you can create a meaningful pairing. I have several questions about the data:

    1. It is possible that pid14 and hhid14 still uniquely identify the observations in the incompletely merged data set even though there is also a third variable that could be used as an identifier. You can check that by running -isid pid14 hhid14- on that data. If Stata gives you no output in response, then they do uniquely identify, and you can proceed to -merge 1:1 pid14 hhid14 using b3a_dl3-.

    2. It is possible that pid14 and hhid14 uniquely identify observations in ba3_dl3. Try -isid pid14 hhid14- on this data. Again, if you get no output from that, you have unique identification by those two variables and you can proceed with -merge m:1 pid14 hhid14 using b3a_dl3-.

    3a. If neither of those works, then you need to get a deeper understanding of your data. You want to pair up observations in the incompletely merged data set with observations in b3a_dl3. You want the paired observations to have the same values of pid14 and hhid14. But for each observation in the incompletely merged data set there are many choices of observation in b3a_dl3. So, imagine that you were going to do this by hand. (Think as if you had printouts of both data sets and you would do the pairing by drawing lines connecting the appropriate pairs of data sets.) How would you go about doing that? Is there information in these data sets that allows you to pick out the right match(es) for each observation? What variables give you that information? Most likely the solution will then by to include some or all of those variables as part of the merge key variable list and proceed with 1:1, 1:m, or m:1 merging.

    3b. Now, an alternative possibility is that there is no such information and that what you really want to do is pair every observation in the incompletely merged data set having given values of pid14 and hhid14 with every observation in b3a_dl3 that has the same values of pid14 and hhid14. That is legitimate (though the resulting data set may be huge). If that's the case, then this task is not accomplished with -merge-; this is done with the -joinby- command. -joinby pid14 hhid14 using b3a_dl3-. Important: read -help joinby- before using this, paying careful attention to how to set the -unmatched()- option to get the results you need.

    Comment

    Working...
    X