Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging to cross sections to create panel dataset

    Hi all,

    I have a question regarding the merging of two cross sectional datasets. I am using a survey which was carried out in 2006 and again in 2012. Approximately 50% of those interviewed in the first year were reinterviewed in the second year the survey was done. I am planning to do some difference-in-difference analysis to basically compare the individual in 06 to himself in 12 and how changes in that person’s resources affects my dependent variable (a score predicted using factor analysis). However, I am not sure if I am correctly merging the data to create a panel.

    In the cross section data, one variable uniquely identify each person: the individual ID.

    These ID's are not the same in the 2006 and 2012 cross sections. In the merged dataset I have two variables: indid and indid_06. If a person has a value for both, then they were interviewed in 2012 and 2006. If they don’t have a value for indid_06 then they were only interviewed in 2012. Please see below using dataex:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double indid long indid_06 float year byte(urban sex)
    12010409102         . 2012 0 2
    12010550102 601072302 2012 0 2
    12011172102 621546503 2012 0 2
    12011173102 601016704 2012 0 2
    12011025102 601089002 2012 0 2
    12011176102 601024503 2012 0 2
    12011070102         . 2012 0 2
    12011180102 624655703 2012 0 2
    12011184102         . 2012 0 2
    12010046102 601079402 2012 0 2
    12010886102         . 2012 0 2
    12010755102         . 2012 0 2
    12010882102 601103802 2012 0 2
    12010516102 601073202 2012 0 2
    12011127102 612233903 2012 0 2
    12010623102 601055605 2012 0 2
    12010481102 601100102 2012 0 2
    12010830102 601013702 2012 0 2
    12010567102         . 2012 0 2
    12010353102 601033103 2012 0 2
    12011008102 601061602 2012 0 2
    12010665102         . 2012 0 2
    12010329102 601105602 2012 0 2
    12010870102         . 2012 0 2
    12010868102         . 2012 0 2
    12011171102 622588912 2012 0 2
    12011032102 601004802 2012 0 2
    12010477102 601073702 2012 0 2
    12011035102 601004603 2012 0 2
    12010936102         . 2012 0 2
    12011157102 601108503 2012 0 2
    12010473102 601027302 2012 0 2
    12010855102 601011902 2012 0 2
    12011153102 601012503 2012 0 2
    12011040102 601088102 2012 0 2
    12010591102 601071502 2012 0 2
    12011143102         . 2012 0 2
    12010031102 621569503 2012 0 2
    12010443102 601101202 2012 0 2
    12010179102 601056502 2012 0 2
    12010445102 601101002 2012 0 2
    12011140102         . 2012 0 2
    12010896102         . 2012 0 2
    12011129102 601012406 2012 0 2
    12010721102 601095102 2012 0 2
    12010931102         . 2012 0 2
    12011124102 601002103 2012 0 2
    12011046102 601003802 2012 0 2
    12010606102 601018902 2012 0 2
    12011110102 601085001 2012 0 2
    12010933102         . 2012 0 2
    12011092102 601086602 2012 0 2
    12010461102         . 2012 0 2
    12010492102 601098402 2012 0 2
    12011083102 601002502 2012 0 2
    12011076102         . 2012 0 2
    12010319102 601107202 2012 0 2
    12010866102 601012404 2012 0 2
    12010614102 601070402 2012 0 2
    12010750102         . 2012 0 2
    12011112102 601084502 2012 0 2
    12010556102         . 2012 0 2
    12010616102 601070202 2012 0 2
    12010414102 601030402 2012 0 2
    12010456102         . 2012 0 2
    12010865102         . 2012 0 2
    12010208102 601041802 2012 0 2
    12010619102 601019304 2012 0 2
    12010527102 601023802 2012 0 2
    12010280102 601110704 2012 0 2
    12010279102 601110706 2012 0 2
    12010277102 601075801 2012 0 2
    12010621102         . 2012 0 2
    12010125102 601112702 2012 0 2
    12010416102 601030202 2012 0 2
    12011037102 601088302 2012 0 2
    12010014102 601054402 2012 0 2
    12010553102 601097205 2012 0 2
    12010088102 601047802 2012 0 2
    12011017102         . 2012 0 2
    12010565102         . 2012 0 2
    12010108102         . 2012 0 2
    12010004102         . 2012 0 2
    12010563102 601097202 2012 0 2
    12010667102         . 2012 0 2
    12010971102         . 2012 0 2
    12010961102 601062702 2012 0 2
    12010105102 601046502 2012 0 2
    12010656102         . 2012 0 2
    12010564102 601097203 2012 0 2
    12010862102 601065402 2012 0 2
    12010784102         . 2012 0 2
    12011156102 601099405 2012 0 2
    12010169102 601044203 2012 0 2
    12010482102 601100002 2012 0 2
    12010261102 601037302 2012 0 2
    12010736102         . 2012 0 2
    12010985102 601090602 2012 0 2
    12010197102         . 2012 0 2
    12010807102         . 2012 0 2
    end
    label values urban urban
    label values sex q101
    label def q101 2 "female", modify


    The command I used was:

    Code:
    append using "C:\Users\Shellfile\Dropbox\Data\cross section 2006.dta"

    With the master data being the 12 cross section. My question is, is there a way to check that I have merged the data correctly? Am I wrong to be using append instead of the merge command?

    I am relatively new to stata so any help is appreciated.
    Thanks.

    Last edited by Sherine Maui; 10 Jul 2018, 06:18.

  • #2
    I'm confused. First you say you used -append- and then you ask whether you -merge-d the data correctly? Which was it, -append- or -merge-?

    This kind of data where a different ID is used for the same person in different waves of a survey is always a bit confusing to work with. I don't really understand why people create data sets in this way. But it is a fact of life that we have to deal with.

    In general, when, as here, you have each wave in a separate data set, and more or less the same variables are measured in both waves, then -append- is the preferred approach. It directly stacks one wave's data "on top" of the other and leaves you with a long-layout data set that is well arranged for data analysis in Stata. (Using -merge- would be difficult to do in the first place because of the clash of ID variables, and would require renaming the variables in one of the data sets so that both wave's values would be included. And, after going to all that unnecessary trouble, you would be left with a wide layout that makes analysis difficult to impossible, so you would then have to -reshape- it.)

    The major issue face having appended these data sets is the two IDs. You are fortunate that each data set apparently contains both IDs for those people who were in both waves. (Most data sets do not have this and you have to then use a third data set that crosswalks the id variables to resolve the inconsistency.) So what I would do from here is:

    Code:
    drop if missing( indid, indid06) 
    by indid (indid06), sort: assert indid06[1] == indid06[_N]
    by indid06 (indid), sort: assert indid[1] == indid[_N]
    by indid06 indid: assert _N == 2
    The first line eliminates any observation that is missing either of the id variables. Otherwise put, we are retaining only those who were included in both waves and have an id in both waves. The second and third lines verify that there is a one to one correspondence between indid and indid06, i.e. we are checking for inconsistencies in which value of indid06 go with any value of indid and vice versa. Finally, the last line verifies that there are two observations for each person, no more and no less.

    If your data passes this test, you can be confident that you have correctly combined the two waves into a single longitudinal data set. If the tests fail, it is, in my experience, far more likely to be a problem with the data than with the way in which you have combined them (assuming you actually did the -append- as you showed it.) Even data sets curated by the best in the business often contain errors. If these tests fail, you will have probably uncovered some such errors and you will have to consult the documentation that came with the data, or possibly contact the data supplier(s) directly to resolve them.

    Comment


    • #3
      Thank you for the quick response Clyde. Apologies about the confusion as I was initially confused between the two concepts or merging and appending datasets. I have appended the data sets using the code provided earlier, however, after running the code provided it returns the following error just below:

      Code:
      by indid06 indid: assert _N == 2
      "4,947 contradictions in 4,947 by-groups
      assertion is false
      r(9);"

      where 4,947 is essentially my whole sample. Following from what you said, is it then a case of being an issue with the dataset itself?

      Comment


      • #4
        Probably. So the first thing I would do is figure out whether this assertion is failing because there are duplicated observations, or whether there are just singletons.

        Code:
        by indid06 indid, sort: gen group_count = _N if _n == 1
        tab group_count
        and it will tell you how many indid06 indid pairs are represented once, three times, four times, etc. If you find that every group is represented only once, then it means that the id numbers in the two data sets don't overlap, suggesting some inconsistency in the coding of the id variables. If you find that you have groups represented more than 2 times, then there must be some duplicate observations in one or both data sets. In that case:

        Code:
        use name_of_2012_dataset, clear
        drop if missing(indid06, indid)
        duplicates tag indid06 indid, gen(flag)
        browse if flag
        will show you duplicate observations in the 2012 data set. Then do the same for the 2006 data set. Once you have identified the observations, you will have to figure out why they are there and what, if anything, to do about them.

        Comment


        • #5
          Clyde thank you so much, you've been of great help I appreciate you taking the time to write out the code.

          Comment

          Working...
          X