Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create unique identifier for two datasets for merging

    Dear all,
    I have two datasets (A &B) from two sections (A &B) of the same survey (Ghana Living Standard Survey, GLSS 6). Dataset A is on FINANCIAL SERVICES of households while dataset B is on AS SETS AND DURABLE CONSUMER GOODS of these households
    But each of this datasets has no unique identifier. I want to know the proportion of households owing various assets and consumer durables by locality.
    1. Please How do I create ID that uniquely identify observations for the these two datasets so as merge them?

    2. Please help me understand what could possibly went wrong if you do crosstabs and some of your observations dropped, is it from dataset or stata command issues?
    for example

    tab loc5 (where var loc5 means various levels of locality)

    loc5 Freq. Percent
    ---------------+-----------------------------------
    Accra (GAMA) | 1,697 10.12
    Other Urban | 5,748 34.27
    Rural Coastal | 1,156 6.89
    Rural Forest | 3,863 23.03
    Rural Savannah | 4,308 25.69
    ---------------+-----------------------------------
    Total | 16,772 100.00


    tab s12aq16 (where var s12aq16 means do you have an insurance policy?

    s12aq16 Freq. Percent
    ------------+-----------------------------------
    yes | 13,822 22.54
    no | 47,500 77.46
    ------------+-----------------------------------
    Total | 61,322 100.00

    And then I do crosstabs of loc5 by s12aq16 or the other way round but some of my observations dropped. Such as below.
    tab loc5 s12aq16



    | s12aq16
    Loc5 yes no | Total
    ---------------+----------------------+----------
    Accra (GAMA) | 553 1,141 | 1,694
    Other Urban | 2,129 3,608 | 5,737
    Rural Coastal | 237 918 | 1,155
    Rural Forest | 873 2,989 | 3,862
    Rural Savannah | 589 3,708 | 4,297
    ---------------+----------------------+----------
    Total | 4,381 12,364 16,745

    Ideally the minimum total should have been 16,772 and not 16,745.
    So what went wrong and 27 observations were dropped?
    Please sort me out.

  • #2
    #2 Look out for missing values. tabulate has an option to show them as such.

    Comment


    • #3
      sorry the tables were not properly displayed it should rather look at these:

      tab loc5 (where var loc5 means various levels of locality)
      loc5 Freq. . Percent
      Accra (GAMA) 1,697 10.12
      Other Urban 5,748 34.27
      Rural Coastal 1,156 6.89
      Rural Forest 3,863 23.03
      Rural Savannah 4,308 25.69
      Total 16,772 100.00










      tab s12aq16 (where var s12aq16 means do you have an insurance policy?
      s12aq16 Freq. . Percent
      yes 13,822 22.54
      no 47,500 77.46
      Total 61,322 100






      And then I do crosstabs of loc5 by s12aq16 or the other way round but some of my observations dropped. Such as below.
      tab loc5 s12aq16
      loc5 yes no Total
      Accra (GAMA) 1,697 10.12 1,694
      Other Urban 5,748 34.27 5,737
      Rural Coastal 1,156 6.89 1,155
      Rural Forest 3,863 23.03 3,862
      Rural Savannah 4,308 25.69 4,297
      Total 4,381 12,364 16,745










      Ideally the minimum total should have been 16,772 and not 16,745.
      So what went wrong and 27 observations are dropped?
      Please sort me out

      Comment


      • #4
        How do I create ID that uniquely identify observations for the these two datasets so as merge them?
        This may or may not be possible. One possibility is that the two data sets come with corresponding observations in the same order. This is unusual, but the documentation that accompanied the survey data would explain this if it is so. In that case -merge 1:1 _n- will do the job.

        Another possibility is that although there is no single identifying variable in either data set, there is some combination of variables, common to both datasets, that together uniquely identify respondents. There might be, for example, a household identifier variable in combination with another variable that identifies individual people within the household. In a situation like that you could use -merge 1:1 household_id individual_id- to combine the data sets.

        If neither of those situations obtains, then you cannot merge them. The best you could hope to do is an approximate merge. Even this would require a set of common variables that almost identify individual respondents. For example, the combination of a household identifier and date of birth would usually, but not always, identify individuals (but would fail to distinguish two household members with the same date of birth.) In that case you could combine the data sets using those variables, but you would not have certainty that all of your pairings are correct.

        If you don't even have a set of common variables that come close to identifying the individuals, then I don't think there is anything at all you can do.

        Comment


        • #5
          HI,

          (2) has been answered already: It is tabulate's option ", missing" that can be used to include all observations in the tables.

          Regarding your question (1):
          According to the online documentation of the named survey, both the dataset on ASSETS AND DURABLE CONSUMER GOODS and the dataset on FINANCIAL SERVICES contain a variable "HID", the unique household number. This is your primary merging variable, and it will allow you to join your information together.

          However, I assume what irritates you is the fact that, inside of each of both data files, this HID variable is not uniquely identifying observations; without having access to example data from both files (by the way: please read the FAQ, especially the section about how to post example data using -dataex-, to enable us giving the best possible answers to your question) it is hard to tell, but my assumption is that each file contains more than one observation per household because it lists several assets (or financial services, respectively) in a long data format.

          Thus, the solution would be to reduce each file to the information you're interested in (like, for instance, the total value of all assets per household), so that each datafile is shrunk to one observation per household, before merging them.

          Regards
          Bela

          Comment


          • #6
            Thank you all for your tremendous suggestions on unique ID problem. I has been able to successful merge the datasets using your suggestions.
            Kindly permit me to share this with you.
            First, I created an ID unique to both datasets and then merged based on this unique ID..
            The second option which equally worked perfectly was that, I realized there was a particular ID (i.e., person id) common to both datasets but labeled differently (i.e., one with lower case and the other with upper case). so the simplest thing I did was to change the labeling such that it suits both datasets. After this then I merged the two datasets. To be sure I did the right thing, I replicated a couple of 10 tables of interest in the Main report (GLSS 6 main report). Fortunately, there were consistencies indicating that I am at least save with the merging.
            Once again, thank you all.

            Comment

            Working...
            X