Creating a variable without duplicates

Allie Sun

Join Date: Jul 2018

Posts: 16
#1

Creating a variable without duplicates

10 Sep 2018, 05:35

Dear all,
I have a variable (household No.) that assigns an id number to each household. Within each household I have 2 people interviewed. The data is in long format. Hence, each Household No. is repeated twice.

Household No. Respondent Number
1 1
1 2
2 1
2 2
3 1
3 2
4 1
4 2

I would like to see how the households are distributed across geographical areas and other aspects. In order to do this, I need each household to be listed only once.

How can I create a variable with includes the household ID only once and not twice?

Or do you have other suggestions on how to go about it?

Trying to change the data to wide format gives me the following error:

values of variable RespondentNO not unique within HouseholdNo
Your data are currently long. You are performing a reshape
wide. You specified i(HouseholdNo) and j(RespondentNO).
There are observations within i(HouseholdNo) with the same
value of j(RespondentNO). In the long data, variables i()
and j() together must uniquely identify the observations.

Thank you for your help.
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4459
#2

10 Sep 2018, 05:38

since this problem does not exist in the sample data you show (please read the FAQ and use -dataex-), no one can be certain but the following is one way (note that this may impede other goals of your project, none of which you have shared with us):

Code:

keep if resp_no==1 duplicates drop

note that I called your variable "resp_no" - please change that to the "real" name

clearly there is something incorrect about either your data or your understanding of the data so be very careful; you might want to check things:

Code:

ta resp_no duplicates report house_no resp_no

and other checks you can probably think of given your knowledge of your data

Last edited by Rich Goldstein; 10 Sep 2018, 05:40.
Comment

Allie Sun

Join Date: Jul 2018
Posts: 16

10 Sep 2018, 08:26

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int HouseholdNo float RespondentNO
  2 1
  2 2
  6 1
  6 2
  7 1
  7 2
 12 1
 12 2
 15 1
 15 2
 21 1
 21 2
 24 1
 24 2
 31 1
 31 2
 32 1
 32 2
 33 1
 33 2
 35 1
 35 2
 36 1
 36 2
 37 1
 37 2
 43 1
 43 2
 44 1
 44 2
 45 1
 45 2
 47 1
 47 2
 51 1
 51 2
 51 2
 53 1
 53 2
 56 1
 56 2
 57 1
 57 2
 62 1
 62 2
 63 1
 63 2
 65 1
 65 2
 71 1
 71 2
 72 1
 72 2
 73 1
 73 2
 77 1
 77 2
 81 1
 81 2
 87 1
 87 2
 94 1
 94 2
102 1
102 2
106 1
106 2
112 1
112 2
113 1
113 2
115 1
115 2
121 1
121 2
122 1
122 2
125 1
125 2
127 1
127 2
131 1
131 2
133 1
133 2
141 1
141 2
142 1
147 1
147 2
151 1
151 2
152 1
152 2
153 1
153 2
154 1
154 2
155 1
155 2
end

Comment

Allie Sun

Join Date: Jul 2018

Posts: 16
#4

10 Sep 2018, 08:34

Thank you.

Above is the data generated with dataex.

I do not notice the presence of the issue described in the error message stata gives me when converting from long to wide format.

Also, I would not like to drop the data which are duplicates, because in fact they have information on different interviewees that I will need later on. I just would like to create a new variable that has only one household number per household and possibly a missing for the other, so that when I explore the distribution of households across clusters I only consider each household one time using "if HouseholdNo==." for example.

Thank you.
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

10 Sep 2018, 09:04

If I understood correctly, you may need something like:

Code:

. duplicates report

Duplicates in terms of all variables

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |           98             0
        2 |            2             1
--------------------------------------

. duplicates list

Duplicates in terms of all variables

  +----------------------------+
  | obs:   Househ~o   Respon~O |
  |----------------------------|
  |   36         51          2 |
  |   37         51          2 |
  +----------------------------+

 
. duplicates tag, gen(dup)

Duplicates in terms of all variables

. tab dup

        dup |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         98       98.00       98.00
          1 |          2        2.00      100.00
------------+-----------------------------------
      Total |        100      100.00

. duplicates drop

Duplicates in terms of all variables

(1 observation deleted)

. tab dup

        dup |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         98       98.99       98.99
          1 |          1        1.01      100.00
------------+-----------------------------------
      Total |         99      100.00

. duplicates report

Duplicates in terms of all variables

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |           99             0
--------------------------------------

Actually, you just need - duplicates drop - or - duplilcates tag - commands, but I thought it would be interesting to show you several possibilities to tackle the issue.

Best regards,

Marcos

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Sep 2018, 13:04

Unless I'm missing something, you do not need to drop observations or to reshape to wide to do this. Just bless one household member to represent the household and tabulate as needed.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int HouseholdNo float RespondentNO str2 state
 2 1 "MI"
 2 2 "MI"
 6 1 "CA"
 6 2 "CA"
 7 1 "CA"
 7 2 "CA"
12 1 "MA"
12 2 "MA"
15 1 "MI"
15 2 "MI"
end

* verify that observations are uniquely identified; leave the data sorted
isid HouseholdNo RespondentNO, sort

* tag the first obs per household; it will represent the household
by HouseholdNo: gen hhtag = _n == 1

* make sure that state is a household-level variable
by HouseholdNo: assert state == state[1]

* tabulate households per state
tab state if hhtag

and the results:

Code:

. tab state if hhtag

      state |      Freq.     Percent        Cum.
------------+-----------------------------------
         CA |          2       40.00       40.00
         MA |          1       20.00       60.00
         MI |          2       40.00      100.00
------------+-----------------------------------
      Total |          5      100.00

.

Comment

Allie Sun

Join Date: Jul 2018

Posts: 16
#7

13 Sep 2018, 02:51

Thank you very much. What you stated, Robert, is exactly what I need to do.

Unfortunately through, STATA reports the error that variables HouseholdNo RespondentNO do not uniquely identify the observations. I thought they did. How can I go about checking and resolving this? Thanks again.
Comment

Announcement