Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a variable without duplicates

    Dear all,
    I have a variable (household No.) that assigns an id number to each household. Within each household I have 2 people interviewed. The data is in long format. Hence, each Household No. is repeated twice.

    Household No. Respondent Number
    1 1
    1 2
    2 1
    2 2
    3 1
    3 2
    4 1
    4 2

    I would like to see how the households are distributed across geographical areas and other aspects. In order to do this, I need each household to be listed only once.

    How can I create a variable with includes the household ID only once and not twice?

    Or do you have other suggestions on how to go about it?

    Trying to change the data to wide format gives me the following error:

    values of variable RespondentNO not unique within HouseholdNo
    Your data are currently long. You are performing a reshape
    wide. You specified i(HouseholdNo) and j(RespondentNO).
    There are observations within i(HouseholdNo) with the same
    value of j(RespondentNO). In the long data, variables i()
    and j() together must uniquely identify the observations.


    Thank you for your help.

  • #2
    since this problem does not exist in the sample data you show (please read the FAQ and use -dataex-), no one can be certain but the following is one way (note that this may impede other goals of your project, none of which you have shared with us):
    Code:
    keep if resp_no==1
    duplicates drop
    note that I called your variable "resp_no" - please change that to the "real" name

    clearly there is something incorrect about either your data or your understanding of the data so be very careful; you might want to check things:
    Code:
    ta resp_no
    duplicates report house_no resp_no
    and other checks you can probably think of given your knowledge of your data
    Last edited by Rich Goldstein; 10 Sep 2018, 05:40.

    Comment


    • #3
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int HouseholdNo float RespondentNO
        2 1
        2 2
        6 1
        6 2
        7 1
        7 2
       12 1
       12 2
       15 1
       15 2
       21 1
       21 2
       24 1
       24 2
       31 1
       31 2
       32 1
       32 2
       33 1
       33 2
       35 1
       35 2
       36 1
       36 2
       37 1
       37 2
       43 1
       43 2
       44 1
       44 2
       45 1
       45 2
       47 1
       47 2
       51 1
       51 2
       51 2
       53 1
       53 2
       56 1
       56 2
       57 1
       57 2
       62 1
       62 2
       63 1
       63 2
       65 1
       65 2
       71 1
       71 2
       72 1
       72 2
       73 1
       73 2
       77 1
       77 2
       81 1
       81 2
       87 1
       87 2
       94 1
       94 2
      102 1
      102 2
      106 1
      106 2
      112 1
      112 2
      113 1
      113 2
      115 1
      115 2
      121 1
      121 2
      122 1
      122 2
      125 1
      125 2
      127 1
      127 2
      131 1
      131 2
      133 1
      133 2
      141 1
      141 2
      142 1
      147 1
      147 2
      151 1
      151 2
      152 1
      152 2
      153 1
      153 2
      154 1
      154 2
      155 1
      155 2
      end

      Comment


      • #4
        Thank you.

        Above is the data generated with dataex.

        I do not notice the presence of the issue described in the error message stata gives me when converting from long to wide format.

        Also, I would not like to drop the data which are duplicates, because in fact they have information on different interviewees that I will need later on. I just would like to create a new variable that has only one household number per household and possibly a missing for the other, so that when I explore the distribution of households across clusters I only consider each household one time using "if HouseholdNo==." for example.

        Thank you.

        Comment


        • #5
          If I understood correctly, you may need something like:

          Code:
          . duplicates report
          
          Duplicates in terms of all variables
          
          --------------------------------------
             copies | observations       surplus
          ----------+---------------------------
                  1 |           98             0
                  2 |            2             1
          --------------------------------------
          
          . duplicates list
          
          Duplicates in terms of all variables
          
            +----------------------------+
            | obs:   Househ~o   Respon~O |
            |----------------------------|
            |   36         51          2 |
            |   37         51          2 |
            +----------------------------+
          
           
          . duplicates tag, gen(dup)
          
          Duplicates in terms of all variables
          
          . tab dup
          
                  dup |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    0 |         98       98.00       98.00
                    1 |          2        2.00      100.00
          ------------+-----------------------------------
                Total |        100      100.00
          
          . duplicates drop
          
          Duplicates in terms of all variables
          
          (1 observation deleted)
          
          . tab dup
          
                  dup |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    0 |         98       98.99       98.99
                    1 |          1        1.01      100.00
          ------------+-----------------------------------
                Total |         99      100.00
          
          . duplicates report
          
          Duplicates in terms of all variables
          
          --------------------------------------
             copies | observations       surplus
          ----------+---------------------------
                  1 |           99             0
          --------------------------------------
          Actually, you just need - duplicates drop - or - duplilcates tag - commands, but I thought it would be interesting to show you several possibilities to tackle the issue.
          Best regards,

          Marcos

          Comment


          • #6
            Unless I'm missing something, you do not need to drop observations or to reshape to wide to do this. Just bless one household member to represent the household and tabulate as needed.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input int HouseholdNo float RespondentNO str2 state
             2 1 "MI"
             2 2 "MI"
             6 1 "CA"
             6 2 "CA"
             7 1 "CA"
             7 2 "CA"
            12 1 "MA"
            12 2 "MA"
            15 1 "MI"
            15 2 "MI"
            end
            
            * verify that observations are uniquely identified; leave the data sorted
            isid HouseholdNo RespondentNO, sort
            
            * tag the first obs per household; it will represent the household
            by HouseholdNo: gen hhtag = _n == 1
            
            * make sure that state is a household-level variable
            by HouseholdNo: assert state == state[1]
            
            * tabulate households per state
            tab state if hhtag
            and the results:
            Code:
            . tab state if hhtag
            
                  state |      Freq.     Percent        Cum.
            ------------+-----------------------------------
                     CA |          2       40.00       40.00
                     MA |          1       20.00       60.00
                     MI |          2       40.00      100.00
            ------------+-----------------------------------
                  Total |          5      100.00
            
            .

            Comment


            • #7
              Thank you very much. What you stated, Robert, is exactly what I need to do.

              Unfortunately through, STATA reports the error that variables HouseholdNo RespondentNO do not uniquely identify the observations. I thought they did. How can I go about checking and resolving this? Thanks again.

              Comment

              Working...
              X