Problems with duplicates in a panel dataset

Alessandro Pompa

Join Date: Oct 2021
Posts: 6

Problems with duplicates in a panel dataset

03 Oct 2021, 15:17

Hello, I am using Stata 16.0 on Windows 10.

I am dealing with a dataset like this one

Code:

. input id year income

            id       year     income
  1. 9 1 10
  2. 9 1 5
  3. 9 1 7
  4. 9 1 14
  5. 9 1 18
  6. 9 2 11
  7. 9 2 6
  8. 9 2 8
  9. 9 2 15
 10. 9 2 19
 11. 10 1 3
 12. 10 2 4
 13. 11 1 1
 14. 11 1 2
 15. 11 2 4
 16. 11 2 4
 17. 12 1 2
 18. 12 1 3
 19. 12 2 3
 20. 12 2 4 
 21. end

. list

     +--------------------+
     | id   year   income |
     |--------------------|
  1. |  9      1       10 |
  2. |  9      1        5 |
  3. |  9      1        7 |
  4. |  9      1       14 |
  5. |  9      1       18 |
     |--------------------|
  6. |  9      2       11 |
  7. |  9      2        6 |
  8. |  9      2        8 |
  9. |  9      2       15 |
 10. |  9      2       19 |
     |--------------------|
 11. | 10      1        3 |
 12. | 10      2        4 |
 13. | 11      1        1 |
 14. | 11      1        2 |
 15. | 11      2        4 |
     |--------------------|
 16. | 11      2        4 |
 17. | 12      1        2 |
 18. | 12      1        3 |
 19. | 12      2        3 |
 20. | 12      2        4 |
     +--------------------+

. 
end of do-file

.

Just for the sake of simplicity let's say that Id is the family identifier and income is the income of a family member. So when I see these duplicates I am actually seeing different incomes from different members of the family. What I would like to do then, is to build a variable which is equal to the sum of the income of the various family member, so that I build a family income indicator.
The final dataset should look like this :

Code:

    +--------------------+
     | id   year   income |
     |--------------------|
  1. |  9      1       54 |
  2. |  9      2       19 |
  3. | 10      1        3 |
  4. | 10      2        4 |
  5. | 11      1        3 |
     |--------------------|
  6. | 11      2        8 |
  7. | 12      1        5 |
  8. | 12      2        7 |
     +--------------------+

I used this simplified dataset to sketch my problem. However my dataset has around 90k observations, and around 4k duplicates :

Code:

duplicates tag id year, gen(isdup)

Duplicates in terms of id year

. tab isdup

      isdup |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |     93,230       95.93       95.93
          1 |      2,128        2.19       98.12
          2 |        930        0.96       99.08
          3 |        492        0.51       99.58
          4 |        205        0.21       99.79
          5 |        126        0.13       99.92
          7 |         40        0.04       99.97
          9 |         10        0.01       99.98
         10 |         11        0.01       99.99
         12 |         13        0.01      100.00
------------+-----------------------------------
      Total |     97,185      100.00

Thanks in advance

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

03 Oct 2021, 15:30

Code:

collapse (sum) income, by(id year)

Thank you for trying to use -dataex- on your first post. Unfortunately, you didn't quite get it right. Instead of posting the output of -dataex- itself, you posted Stata's output from running the output of -dataex-. Fortunately, your question was simple enough that there was no need to try out the code on example data. But for more complicated questions, the ability to easily import data to develop and test code becomes important. Correct use of -dataex- produces something that looks like this (taken from the first 10 observations in auto.dta):

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str18 make int(price mpg rep78) float headroom int(trunk weight length turn displacement) float gear_ratio byte foreign
"AMC Concord"    4099 22 3 2.5 11 2930 186 40 121 3.58 0
"AMC Pacer"      4749 17 3   3 11 3350 173 40 258 2.53 0
"AMC Spirit"     3799 22 .   3 12 2640 168 35 121 3.08 0
"Buick Century"  4816 20 3 4.5 16 3250 196 40 196 2.93 0
"Buick Electra"  7827 15 4   4 20 4080 222 43 350 2.41 0
"Buick LeSabre"  5788 18 3   4 21 3670 218 43 231 2.73 0
"Buick Opel"     4453 26 .   3 10 2230 170 34 304 2.87 0
"Buick Regal"    5189 20 3   2 16 3280 200 42 196 2.93 0
"Buick Riviera" 10372 16 3 3.5 17 3880 207 43 231 2.93 0
"Buick Skylark"  4082 19 3 3.5 13 3400 200 42 231 3.08 0
end
label values foreign origin
label def origin 0 "Domestic", modify

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

03 Oct 2021, 16:35

Let me expand slightly on Clyde's example. Below I show a dataex command and its output, copied from Stata's Results window. The text in red was colored by me. Clyde selected all the text that is I did not color in red, copied it, and pasted it into his post. That is how dataex is used.

Code:

. dataex in 1/10

----------------------- copy starting from the next line -----------------------
[CODE]
* Example generated by -dataex-. For more info, type help dataex
clear
input str18 make int(price mpg rep78) float headroom int(trunk weight length turn displacement) float gear_ratio byte foreign
"AMC Concord"    4099 22 3 2.5 11 2930 186 40 121 3.58 0
"AMC Pacer"      4749 17 3   3 11 3350 173 40 258 2.53 0
"AMC Spirit"     3799 22 .   3 12 2640 168 35 121 3.08 0
"Buick Century"  4816 20 3 4.5 16 3250 196 40 196 2.93 0
"Buick Electra"  7827 15 4   4 20 4080 222 43 350 2.41 0
"Buick LeSabre"  5788 18 3   4 21 3670 218 43 231 2.73 0
"Buick Opel"     4453 26 .   3 10 2230 170 34 304 2.87 0
"Buick Regal"    5189 20 3   2 16 3280 200 42 196 2.93 0
"Buick Riviera" 10372 16 3 3.5 17 3880 207 43 231 2.93 0
"Buick Skylark"  4082 19 3 3.5 13 3400 200 42 231 3.08 0
end
label values foreign origin
label def origin 0 "Domestic", modify
[/CODE]
------------------ copy up to and including the previous line ------------------

Listed 10 out of 74 observations

.

Comment

Alessandro Pompa

Join Date: Oct 2021

Posts: 6
#4

04 Oct 2021, 02:48

Hi, thank you Clyde and Williams.
Next time I want to use dataex should I post something like this ?

Code:

clear input id year income 9 1 10 9 1 5 9 1 7 9 1 14 9 1 18 9 2 11 9 2 6 9 2 8 9 2 15 9 2 19 10 1 3 10 2 4 11 1 1 11 1 2 11 2 4 11 2 4 12 1 2 12 1 3 12 2 3 12 2 4 end

Also, if I found a way to expand on duplicates in a panel dataset (e.g. including string variables etc.) should I add it here? Or am I supposed to create a new thread?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

04 Oct 2021, 03:00

Alessandro:
the way you used -dataex- is absolutely correct and useful for listers interested in your thread:

Code:

. input id year income

            id       year     income
  1.
. 9 1 10
  2.
. 9 1 5
  3.
. 9 1 7
  4.
. 9 1 14
  5.
. 9 1 18
  6.
. 9 2 11
  7.
. 9 2 6
  8.
. 9 2 8
  9.
. 9 2 15
 10.
. 9 2 19
 11.
. 10 1 3
 12.
. 10 2 4
 13.
. 11 1 1
 14.
. 11 1 2
 15.
. 11 2 4
 16.
. 11 2 4
 17.
. 12 1 2
 18.
. 12 1 3
 19.
. 12 2 3
 20.
. 12 2 4
 21.
. end

. list

     +--------------------+
     | id   year   income |
     |--------------------|
  1. |  9      1       10 |
  2. |  9      1        5 |
  3. |  9      1        7 |
  4. |  9      1       14 |
  5. |  9      1       18 |
     |--------------------|
  6. |  9      2       11 |
  7. |  9      2        6 |
  8. |  9      2        8 |
  9. |  9      2       15 |
 10. |  9      2       19 |
     |--------------------|
 11. | 10      1        3 |
 12. | 10      2        4 |
 13. | 11      1        1 |
 14. | 11      1        2 |
 15. | 11      2        4 |
     |--------------------|
 16. | 11      2        4 |
 17. | 12      1        2 |
 18. | 12      1        3 |
 19. | 12      2        3 |
 20. | 12      2        4 |
     +--------------------+

.

Opening a new thread is basically an issue of efficency: if your new inputs/data/resulst relate to the previous query you started, there's no gain in opening a new one.

Kind regards,
Carlo
(Stata 19.0)

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

04 Oct 2021, 05:26

Alessandro Pompa -

I almost always agree with Carlos, but not in this case, I am sorry to say.

While the data shown in post #1 and in post #4 was useful to answer your question, it is not the output of the dataex command. For other questions it may misrepresent your data in crucial ways, as Clyde suggests in post #2. And on reflection, I don't think it was, as Clyde suggested, created by running the dataex command, because the input command created by dataex would include variable types (e.g. float) which this data does not include, as in the followng.

Code:

input float(id year income)

Please take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. It's particularly helpful to use the dataex command to provide sample data, as described in section 12 of the FAQ and demonstrated in posts #2 and #3.
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#7

04 Oct 2021, 05:47

William:
I've probably misunderstood OP's reply #4.
I thought his aim was to test whether the way he used -dataex- was OK for data sharing.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Problems with duplicates in a panel dataset

Comment

Comment

Comment

Comment

Comment

Comment