Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging individuals and domiciles

    Hi,
    Can somebody please advise me on the best way to merge information I have on individuals and their domiciles? For instance, my analysis will be about the "reference person" in each domicile, and there might be many people living in each domicile. When the data was collected, there were 2 different questionnaires, one about the domicile (location, aggregate income and expenditure, etc.) and the other about the individuals (personal and income and expenditure, height, age, etc.). The original dataset I have access to has it all together, all the information about both domiciles and individuals (approximately 2,5GB and 5 million observations). I do not have access to a unique id, but rather 5 or 6 variables that, together, identify the person/domicile.
    Any advice on how I can prepare the data to get rid of the excess data and keep only the information on the reference person of each domicile as well as all the information about the domicile itself?
    Thanks a lot in advance. Any suggestions will be greatly appreciated.

  • #2
    Welcome to Statalist.

    It is not at all clear how your dataset is organized.

    Does your dataset have one observation for each individual, containing that individuals's questionnaire data and the data from the questionnaire about the domicile? Then in that case you need to identify the "reference person" in each domicile and use the keep command to retain their observations and get rid of the observations from the other individuals in the same domicile. Suppose the variable reference is 1 for the reference person and some other number for the other individuals in the same domicile. Then
    Code:
    keep if reference==1
    would do what you need.

    But if that is not what you have, you will need to be clearer about what it is that you do have. Please take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

    The more you help others understand your problem, the more likely others are to be able to help you solve your problem.
    Last edited by William Lisowski; 27 Jan 2019, 17:20.

    Comment


    • #3
      Thank you, William. I am sorry if I did not make myself clear enough. I'll read the FAQ and link you mentioned before trying what you've suggested and especially before posting here again.

      Comment


      • #4
        First of all I'd like to say that code line has helped me. Thank you for that and for trying to understand my problem despite the fact I was unable to express myself properly.

        The reason why the code line helped me was because I was able to find a different data file where the data were arranged in a way similar to that you had imagined.

        Now, even though I've moved on (and in fact am now struggling with different matters), I'd like to try to explain my problem once again because I think it's something potentially useful to know and it might help others who eventually come across this thread in the future.

        Let's say I only have access to two variables: SEX and QTY_BATH (quantity of bathrooms in the domicile). Sex is clearly a variable that relates only to individuals, not domiciles; whereas QTY_BATH only sheds light on domiciles, not on individuals. Also, as soon as I open the data file, Stata says there are 223 variables and 4,939,885 observations.

        When I type (on Windows 10 using Stata 14):
        Code:
        sum SEX
        I get:

        Variable | Obs Mean Std. Dev. Min Max
        -------------+---------------------------------------------------------
        SEX | 190,159 1.510015 .499901 1 2

        That is, I have about 190,000 observations on individuals. Similarly, when I type:
        Code:
        sum QTY_BATH
        I get:

        Variable | Obs Mean Std. Dev. Min Max
        -------------+---------------------------------------------------------
        QTY_BATH | 55,970 1.230749 .6946201 0 9

        Thus I conclude I have information on about 56,000 domiciles.

        Then, if I use the suggested code (keep if reference==1), Stata deletes 4,883,794 observations and when I run the same commands again I get:

        Variable | Obs Mean Std. Dev. Min Max
        -------------+---------------------------------------------------------
        SEX | 56,091 1.305236 .4605115 1 2


        Variable | Obs Mean Std. Dev. Min Max
        -------------+---------------------------------------------------------
        QTY_BATH | 0

        Which in my head makes sense: I have lost all the information on the domiciles.

        So, trying my best, my original question could be rephrased as: how could I get rid of all the excess information (millions of observations) without losing important information on the domiciles? Would it be possible and make logical sense to sort of "match" the reference person in each domicile to all the information on that domicile? So that when I dropped all the other people in the domicile I would still be able to access the information on the domicile (as well as on the reference person, obviously)?

        Thank you (all) for taking the time to read and trying to help.

        Comment


        • #5
          Hi,

          Have you tried this command?
          Code:
          drop if QTY_BATH==.
          And yes, it is possible to match info of individual and domicile if you have ID. It is impossible if you "only have access to 2 var: SEX and QTY_BATH"
          Last edited by Leon Nguyen; 29 Jan 2019, 00:07.

          Comment


          • #6
            I cannot imagine what your dataset is like. You apparently have 55,970 domiciles (assuming there are no domiciles for which QTY_BATH is missing) and 190,159 individuals (assuming there are no individuals for which SEX is missing). I cannot imagine how those were combined to produce a dataset with 4,939,885 observations. And really, I have a very vivid imagination.

            In no case would I work with such a dataset. I would, as you did, discard it and find something more suitable.

            In general I would expect to start with two datasets:
            • 55,970 domiciles, with a set of variables that distinctly identify each domicile
            • 190,159 individuals, with a set of variables that identify the domicile associated with the individual, and a set of variables that when taken together with the domicile identification identifies the individual.
            I would merge using the domicile identifiers common to the two datasets, achieving a dataset of 190,159 observations of individuals with the variables for their domiciles now included as part of the individual's observation. This is what I described in post #2, and it appears to be what you have found.

            Comment


            • #7
              Thank you, Leon. Unfortunately, typing the command you suggested makes me lose the information on all individuals and retain only the information on domiciles.

              Thank you, William, for another enlightening answer.

              Comment

              Working...
              X