Panel data - matching observations from an unsorted dataset

Guest
#1

Panel data - matching observations from an unsorted dataset

25 Mar 2022, 14:38

Dear all,

I have a panel dataset No.1 with multiple variables, including time, personal id, education level, training etc. I now need to add another variable - ethnicity - which needs to be imported from dataset No.2 and matched with the original one, No.1, based on personal id. The problem is that the data in the original dataset had been "cleaned up" - missing observations were dropped, some variables were winsorised. This means the number of observations in datasets No.1 and No.2 are not the same anymore for me simply to import the data.

I need to merge the two datasets and figure out if there is a code that matches observations based on personal id, so that I can get rid of all other observations that have not been used in dataset No.1? Thank you in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

25 Mar 2022, 15:46

This is what the -merge- command does. But precisely how to use it depends on the details of your data sets. Does the first data set contain just one observation for each person, or can there be many. Same question about the second data set. From your description, I would guess that the first can contain many (differing on the time variable) but the second only contains one for each person. If so, the code would be:

Code:

use dataset_1, clear merge m:1 using dataset_2

Added: Since I'm just guessing about your data organization, this code may not work in your real data. If that happens, please post back and use the -dataex- command to show example data from both data sets. Be sure to pick examples that include at least some observations in the data sets that should match together.

If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

Last edited by Clyde Schechter; 25 Mar 2022, 15:49.
1 like
Comment

Guest

25 Mar 2022, 18:31

Dear Clyde Schechter, thank you for your advice! I have tried what you recommended and received an error code

Code:

use "/Users/guest/Desktop/78_N.dta", clear
merge m:1 using "/Users/guest/Desktop/78_new.dta"
varlist must be specified after merge m:1
you must specify the key variables after merge m:1
r(198);

When I used 'id' as a key variable, another error code appeared "variable _merge already defined
r(110);"

Hence, please see the data example below:

Dataset 1 with many variables:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double id byte(region age high_qual sector) float(time training_hrs wages) byte _freq
  22445  7  5 3 4 2013   0 10.007475 1
  22445  7  5 3 5 2012 168  11.49954 1
  22445  7  5 3 5 2014  30  16.31351 1
  22445  7  6 3 5 2015   0 16.195204 1
  22445  7  6 3 5 2016   0 18.026306 1
  22445  7  6 3 5 2017   0 15.332698 1
  22445  7  6 3 5 2018  60 13.799448 1
  22445  7  6 1 5 2019  80 11.073648 1
  29925  7  7 1 3 2016   0 14.374425 1
  29925  7  7 1 4 2014   0         0 1
  29925  7  7 1 4 2015   0         0 1
  29925  7  7 1 . 2012   0         0 1
  29925  7  8 1 3 2017   0  13.41613 1
  29925  7  8 1 3 2018   0 16.659334 1
  29925  7  8 1 3 2019   0 14.785123 1
  76165  5  6 3 2 2015   0 17.742147 1
  76165  5  6 3 2 2016   0  22.99908 1
  76165  5  6 3 2 2017   0 101.85307 1
  76165  5  7 3 2 2018   0  26.16638 1
  76165  5  7 3 2 2019   0  28.12459 1
 223725  8  8 5 2 2015   0         0 1
 223725  8  8 5 2 2016   0         . 1
 280165  8  6 4 2 2011   0         0 1
 280165  8  6 4 2 2012   0         0 1
 280165  8  6 4 2 2013   0 19.067024 1
 280165  8  6 4 3 2010   0  9.318868 1
 280165  8  7 4 2 2014   0         0 1
 280165  8  7 4 2 2015   0  18.15717 1
 280165  8  7 4 2 2016   0  17.24931 1
 280165  8  7 4 2 2017   0  17.24931 1
 280165  8  7 4 2 2018   0  17.24931 1
 280165  8  8 4 2 2019   0 15.130974 1
 333205  5  4 4 2 2014  56  13.20893 1
 333205  5  5 2 3 2015 420  11.49954 1
 333205  5  5 2 3 2016  72  13.20893 1
 333205  5  5 2 3 2017   0 12.431935 1
 333205  5  5 2 3 2018 147 15.572274 1
 333205  5  5 2 3 2019 280 16.783112 1
 387605  9  4 3 . 2012   0         0 1
 387605 10  5 3 6 2015   0 12.882072 1
 387605  9  5 3 . 2013   0         0 1
 387605  9  5 3 . 2014  56         0 1
 469205  4  5 5 5 2018  16  9.774609 1
 469205  4  5 5 5 2019   0  10.45495 1
 469205  4  5 5 . 2017   0  8.213957 1
 541285  4  5 1 5 2011   0         0 1
 541285  4  5 1 . 2012   0         0 1
 541285  4  5 1 . 2013   0         0 1
 541285  4  5 1 . 2014   0         0 1
 541965  4  4 3 3 2011   0         0 1
 599765  5  5 3 2 2012   0         0 1
 599765  5  5 2 2 2013 150 13.176576 1
 599765  5  6 2 2 2017  21  19.46642 1
 599765  5  6 2 2 2018  16   20.7412 1
 599765  5  6 1 2 2019 125 22.053913 1
 665045  5  5 3 2 2011   0  5.929899 1
 665045  5  6 3 2 2012   0  5.036798 1
 665045  5  6 3 2 2013   0  6.924448 1
 665045  5  6 3 2 2014   0  4.408157 1
 665045  5  6 3 2 2016   0  8.882475 1
 665045  .  7 3 2 2018   0 3.8868446 1
 665045  5  7 3 2 2019   0  5.606026 1
 732365  2  6 9 . 2017   0         0 1
 732365  2  6 9 . 2018   0         0 1
 732365  2  6 9 . 2019   0         0 1
 813285  2  8 5 4 2012   0         0 1
 813285  2  8 5 4 2013   0         0 1
 813285  2  8 5 4 2014   0         0 1
 813285  2  9 5 4 2015   0         0 1
 813285  2  9 5 4 2016   0         . 1
 850005  3  5 4 4 2015   0         0 1
 956765  1 11 9 4 2010   .         . 1
 956765  1 11 9 4 2011   0         0 1
 987365  2  4 1 4 2011   0         0 1
 987365  2  4 3 . 2010   .         . 1
1114525 11  7 3 4 2010   .         . 1
1558565 10  3 4 . 2011   0         0 1
1587125  1  9 2 . 2014   0         0 1
1587125  1 10 2 2 2015   0         0 1
1587125  1 10 2 2 2016   0         . 1
1587125  1 10 2 2 2017   0         0 1
1587125  1 10 2 2 2018   0         0 1
1587125  1 10 2 2 2019   0         0 1
1697285  4  8 1 3 2016   0  8.432996 1
1697285  4  8 1 3 2017   0 10.272923 1
1697285  4  9 1 3 2018  20  8.690586 1
1731965 11  4 4 4 2013  84  15.33272 1
1833965  8  9 5 2 2012   0         0 1
1833965  8  9 5 2 2013   0  9.255444 1
1833965  8  9 5 3 2011   0 10.180926 1
1833965  8  9 5 4 2010   0         . 1
1833965  8  9 5 4 2014   0  8.080758 1
1833965  8 10 5 2 2015   0 15.736213 1
1833965  8 10 5 3 2016   0 12.777267 1
1833965  8 10 5 5 2017   0  7.973014 1
2067205 10  5 1 5 2015   0  11.51978 1
2067205 10  5 1 . 2014   0         0 1
2270525  3  3 3 . 2016   0         . 1
2270525  3  4 3 . 2017   0         0 1
2292285  6  7 9 . 2011   0         0 1
end
label values region b_gor_dv
label def b_gor_dv 1 "North East", modify
label def b_gor_dv 2 "North West", modify
label def b_gor_dv 3 "Yorkshire and the Humber", modify
label def b_gor_dv 4 "East Midlands", modify
label def b_gor_dv 5 "West Midlands", modify
label def b_gor_dv 6 "East of England", modify
label def b_gor_dv 7 "London", modify
label def b_gor_dv 8 "South East", modify
label def b_gor_dv 9 "South West", modify
label def b_gor_dv 10 "Wales", modify
label def b_gor_dv 11 "Scotland", modify
label values age b_agegr13_dv
label def b_agegr13_dv 3 "18-19 years old", modify
label def b_agegr13_dv 4 "20-24 years old", modify
label def b_agegr13_dv 5 "25-29 years old", modify
label def b_agegr13_dv 6 "30-34 years old", modify
label def b_agegr13_dv 7 "35-39 years old", modify
label def b_agegr13_dv 8 "40-44 years old", modify
label def b_agegr13_dv 9 "45-49 years old", modify
label def b_agegr13_dv 10 "50-54 years old", modify
label def b_agegr13_dv 11 "55-59 years old", modify
label values high_qual b_hiqual_dv
label def b_hiqual_dv 1 "Degree", modify
label def b_hiqual_dv 2 "Other higher degree", modify
label def b_hiqual_dv 3 "A-level etc", modify
label def b_hiqual_dv 4 "GCSE etc", modify
label def b_hiqual_dv 5 "Other qualification", modify
label def b_hiqual_dv 9 "No qualification", modify
label values sector b_jbrgsc_dv
label def b_jbrgsc_dv 2 "managerial & technical occupation", modify
label def b_jbrgsc_dv 3 "skilled non-manual", modify
label def b_jbrgsc_dv 4 "skilled manual", modify
label def b_jbrgsc_dv 5 "partly skilled occupation", modify
label def b_jbrgsc_dv 6 "unskilled occupation", modify

Dataset 2:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long id byte(race _freq)
  1 . 1
  2 . 1
  3 . 1
  4 . 1
  5 . 1
  6 . 1
  7 . 1
  8 . 1
  9 . 1
 10 . 1
 11 . 1
 12 . 1
 13 . 1
 14 . 1
 15 . 1
 16 . 1
 17 . 1
 18 . 1
 19 . 1
 20 . 1
 21 . 1
 22 . 1
 23 . 1
 24 . 1
 25 . 1
 26 . 1
 27 . 1
 28 . 1
 29 . 1
 30 . 1
 31 . 1
 32 . 1
 33 . 1
 34 . 1
 35 . 1
 36 . 1
 37 . 1
 38 . 1
 39 . 1
 40 . 1
 41 . 1
 42 . 1
 43 . 1
 44 . 1
 45 . 1
 46 . 1
 47 . 1
 48 . 1
 49 . 1
 50 . 1
 51 . 1
 52 . 1
 53 . 1
 54 . 1
 55 . 1
 56 . 1
 57 . 1
 58 . 1
 59 . 1
 60 . 1
 61 . 1
 62 . 1
 63 . 1
 64 . 1
 65 . 1
 66 . 1
 67 . 1
 68 . 1
 69 . 1
 70 . 1
 71 . 1
 72 . 1
 73 . 1
 74 . 1
 75 . 1
 76 . 1
 77 . 1
 78 . 1
 79 . 1
 80 . 1
 81 . 1
 82 . 1
 83 . 1
 84 . 1
 85 . 1
 86 . 1
 87 . 1
 88 . 1
 89 . 1
 90 . 1
 91 . 1
 92 . 1
 93 . 1
 94 . 1
 95 . 1
 96 . 1
 97 . 1
 98 . 1
 99 . 1
100 . 1
end
label values race b_racel_dv

Last edited by sladmin; 25 Apr 2022, 07:29. Reason: anonymize original poster

Comment

Guest

25 Mar 2022, 18:36

my apologies, Clyde Schechter - you asked for at least some variables to overlap, just realised. Here is a new example:

Dataset 1:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double id byte(region age high_qual sector) float(time training_hrs wages) byte _freq
  22445  7  5 3 4 2013   0 10.007475 1
  22445  7  5 3 5 2012 168  11.49954 1
  22445  7  5 3 5 2014  30  16.31351 1
  22445  7  6 3 5 2015   0 16.195204 1
  22445  7  6 3 5 2016   0 18.026306 1
  22445  7  6 3 5 2017   0 15.332698 1
  22445  7  6 3 5 2018  60 13.799448 1
  22445  7  6 1 5 2019  80 11.073648 1
  29925  7  7 1 3 2016   0 14.374425 1
  29925  7  7 1 4 2014   0         0 1
  29925  7  7 1 4 2015   0         0 1
  29925  7  7 1 . 2012   0         0 1
  29925  7  8 1 3 2017   0  13.41613 1
  29925  7  8 1 3 2018   0 16.659334 1
  29925  7  8 1 3 2019   0 14.785123 1
  76165  5  6 3 2 2015   0 17.742147 1
  76165  5  6 3 2 2016   0  22.99908 1
  76165  5  6 3 2 2017   0 101.85307 1
  76165  5  7 3 2 2018   0  26.16638 1
  76165  5  7 3 2 2019   0  28.12459 1
 223725  8  8 5 2 2015   0         0 1
 223725  8  8 5 2 2016   0         . 1
 280165  8  6 4 2 2011   0         0 1
 280165  8  6 4 2 2012   0         0 1
 280165  8  6 4 2 2013   0 19.067024 1
 280165  8  6 4 3 2010   0  9.318868 1
 280165  8  7 4 2 2014   0         0 1
 280165  8  7 4 2 2015   0  18.15717 1
 280165  8  7 4 2 2016   0  17.24931 1
 280165  8  7 4 2 2017   0  17.24931 1
 280165  8  7 4 2 2018   0  17.24931 1
 280165  8  8 4 2 2019   0 15.130974 1
 333205  5  4 4 2 2014  56  13.20893 1
 333205  5  5 2 3 2015 420  11.49954 1
 333205  5  5 2 3 2016  72  13.20893 1
 333205  5  5 2 3 2017   0 12.431935 1
 333205  5  5 2 3 2018 147 15.572274 1
 333205  5  5 2 3 2019 280 16.783112 1
 387605  9  4 3 . 2012   0         0 1
 387605 10  5 3 6 2015   0 12.882072 1
 387605  9  5 3 . 2013   0         0 1
 387605  9  5 3 . 2014  56         0 1
 469205  4  5 5 5 2018  16  9.774609 1
 469205  4  5 5 5 2019   0  10.45495 1
 469205  4  5 5 . 2017   0  8.213957 1
 541285  4  5 1 5 2011   0         0 1
 541285  4  5 1 . 2012   0         0 1
 541285  4  5 1 . 2013   0         0 1
 541285  4  5 1 . 2014   0         0 1
 541965  4  4 3 3 2011   0         0 1
 599765  5  5 3 2 2012   0         0 1
 599765  5  5 2 2 2013 150 13.176576 1
 599765  5  6 2 2 2017  21  19.46642 1
 599765  5  6 2 2 2018  16   20.7412 1
 599765  5  6 1 2 2019 125 22.053913 1
 665045  5  5 3 2 2011   0  5.929899 1
 665045  5  6 3 2 2012   0  5.036798 1
 665045  5  6 3 2 2013   0  6.924448 1
 665045  5  6 3 2 2014   0  4.408157 1
 665045  5  6 3 2 2016   0  8.882475 1
 665045  .  7 3 2 2018   0 3.8868446 1
 665045  5  7 3 2 2019   0  5.606026 1
 732365  2  6 9 . 2017   0         0 1
 732365  2  6 9 . 2018   0         0 1
 732365  2  6 9 . 2019   0         0 1
 813285  2  8 5 4 2012   0         0 1
 813285  2  8 5 4 2013   0         0 1
 813285  2  8 5 4 2014   0         0 1
 813285  2  9 5 4 2015   0         0 1
 813285  2  9 5 4 2016   0         . 1
 850005  3  5 4 4 2015   0         0 1
 956765  1 11 9 4 2010   .         . 1
 956765  1 11 9 4 2011   0         0 1
 987365  2  4 1 4 2011   0         0 1
 987365  2  4 3 . 2010   .         . 1
1114525 11  7 3 4 2010   .         . 1
1558565 10  3 4 . 2011   0         0 1
1587125  1  9 2 . 2014   0         0 1
1587125  1 10 2 2 2015   0         0 1
1587125  1 10 2 2 2016   0         . 1
1587125  1 10 2 2 2017   0         0 1
1587125  1 10 2 2 2018   0         0 1
1587125  1 10 2 2 2019   0         0 1
1697285  4  8 1 3 2016   0  8.432996 1
1697285  4  8 1 3 2017   0 10.272923 1
1697285  4  9 1 3 2018  20  8.690586 1
1731965 11  4 4 4 2013  84  15.33272 1
1833965  8  9 5 2 2012   0         0 1
1833965  8  9 5 2 2013   0  9.255444 1
1833965  8  9 5 3 2011   0 10.180926 1
1833965  8  9 5 4 2010   0         . 1
1833965  8  9 5 4 2014   0  8.080758 1
1833965  8 10 5 2 2015   0 15.736213 1
1833965  8 10 5 3 2016   0 12.777267 1
1833965  8 10 5 5 2017   0  7.973014 1
2067205 10  5 1 5 2015   0  11.51978 1
2067205 10  5 1 . 2014   0         0 1
2270525  3  3 3 . 2016   0         . 1
2270525  3  4 3 . 2017   0         0 1
2292285  6  7 9 . 2011   0         0 1
end
label values region b_gor_dv
label def b_gor_dv 1 "North East", modify
label def b_gor_dv 2 "North West", modify
label def b_gor_dv 3 "Yorkshire and the Humber", modify
label def b_gor_dv 4 "East Midlands", modify
label def b_gor_dv 5 "West Midlands", modify
label def b_gor_dv 6 "East of England", modify
label def b_gor_dv 7 "London", modify
label def b_gor_dv 8 "South East", modify
label def b_gor_dv 9 "South West", modify
label def b_gor_dv 10 "Wales", modify
label def b_gor_dv 11 "Scotland", modify
label values age b_agegr13_dv
label def b_agegr13_dv 3 "18-19 years old", modify
label def b_agegr13_dv 4 "20-24 years old", modify
label def b_agegr13_dv 5 "25-29 years old", modify
label def b_agegr13_dv 6 "30-34 years old", modify
label def b_agegr13_dv 7 "35-39 years old", modify
label def b_agegr13_dv 8 "40-44 years old", modify
label def b_agegr13_dv 9 "45-49 years old", modify
label def b_agegr13_dv 10 "50-54 years old", modify
label def b_agegr13_dv 11 "55-59 years old", modify
label values high_qual b_hiqual_dv
label def b_hiqual_dv 1 "Degree", modify
label def b_hiqual_dv 2 "Other higher degree", modify
label def b_hiqual_dv 3 "A-level etc", modify
label def b_hiqual_dv 4 "GCSE etc", modify
label def b_hiqual_dv 5 "Other qualification", modify
label def b_hiqual_dv 9 "No qualification", modify
label values sector b_jbrgsc_dv
label def b_jbrgsc_dv 2 "managerial & technical occupation", modify
label def b_jbrgsc_dv 3 "skilled non-manual", modify
label def b_jbrgsc_dv 4 "skilled manual", modify
label def b_jbrgsc_dv 5 "partly skilled occupation", modify
label def b_jbrgsc_dv 6 "unskilled occupation", modify

Dataset 2:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long id byte(race _freq)
   22445  1 72
   29925  1 63
   76165  1 45
  223725  1 18
  280165  1 82
  333205  1 54
  387605  1 36
  469205  1 27
  541285  1 36
  541965  1  9
  599765  1 45
  665045  1 63
  732365  1 27
  760925 -9 18
  813285  1 45
  850005  1  9
  956765  1 10
  987365 -9 10
 1114525  1  1
 1126765  1  1
 1558565  1 10
 1587125 13 54
 1697285  1 27
 1731965  1  9
 1833965  1 64
 2067205  1 18
 2270525  1 27
 2292285  1 18
 2297045  1 18
 2401085  1  1
 2539125  1  1
 2626845  1 46
 2670365  1 10
 2817245  1 10
 2853965  1 46
 2888645  1 45
 2903605  1  1
 2932845  1  9
 3229325  1 18
 3424485 -9 18
 3564565  1  9
 3565925  2 18
 3567965  2  9
 3568645 -9 18
 3587685  1 27
 3621685  1  1
 3658405  1  1
 3663845  2 54
 3667245  2 63
 3705325  2 81
 3720965  1  1
 3914085  1  9
 3914765  1  9
 3915445  1 19
 3916125  1  9
 4091565  1 37
 4192205  1 19
 4454005  1 82
 4473725 -9 54
 4562125  1 19
 4626045  1 19
 4626725  1 19
 4794685  1 36
 4849085  1 63
 4853165  1 45
68002045  1 19
68002049  1 19
68002725 12 73
68004087  4 73
68006127  1 82
68006135  1 36
68006139  1 45
68006807  1 64
68007487  1 19
68007491 -9  9
68008167  1  1
68008171 -9  1
68008847  1 82
68009527  1 82
68010887  1 82
68011567  1 46
68014287  1 64
68014291  1 55
68014295  1 18
68017687  1 19
68019047  1 36
68019051  1 27
68020407  1 82
68020564  1 54
68021765 15 73
68021769 15 46
68021773 15 37
68021777 15 10
68021781 15 64
68025167  1  1
68025847  1 55
68028567  1 46
68028571  1 46
68028575  1 73
68028579 -9 37
end
label values race b_racel_dv
label def b_racel_dv -9 "missing", modify
label def b_racel_dv 1 "british/english/scottish/welsh/northern irish (white)", modify
label def b_racel_dv 2 "irish (white)", modify
label def b_racel_dv 4 "any other white background (white)", modify
label def b_racel_dv 12 "chinese (asian or asian british)", modify
label def b_racel_dv 13 "any other asian background (asian or asian british)", modify
label def b_racel_dv 15 "african (black or black britih)", modify

------------------ copy up to and including the previous line ------------------

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

25 Mar 2022, 19:49

Forgetting to mention the key id variable in the -merge- command was my error--apologies.

With these example data sets:

Code:

use dataset1, clear merge m:1 id using dataset2, keep(master match) nogenerate

works with no error messages and produces a match to all 100 observations in the first dataset. With the -nogenerate- option specified, it will also not give you the "variable _merge already exists" error. This doesn't even come up in your example data because neither example data set already contains a variable named _merge. But apparently one of your real data sets does. The code shown here gets around that. But unless you actually need that _merge variable for something later, I would recommend -drop-ing it once you've completed the above. It will only get in the way of any future -merge-s, and is a waste of space if not needed. If you do need it, I suggest -rename-ing it so that it won't interfere with any future -merge- commands.
Comment
Guest
#6

25 Mar 2022, 20:02

thank you very much, Clyde Schechter ! I am slightly confused as I got the error code of "variable id does not uniquely identify observations in the using data r(459)". But the "id" I am using is the one issued by BHPS - it is supposed to be unique...
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

25 Mar 2022, 20:44

Well, there is something wrong with that data set. To find the offending observations, do this:

Code:

use dataset2, clear duplicates tag id, gen(flag) browse if flag

There are a few possibilities. If the data set was originally in a spreadsheet or CSV file and got imported into Stata, it is possible that some extra rows with no information got dragged in along with the real data. If that's the case the browser will be showing you observations where id is missing. In that case, -drop if missing(id)- will remove them. Re-save the data set, and try the merge again. For some reason, when Stata imports these data sets it sometimes gets the boundaries of the data wrong and brings in extra missing lines. I don't know why that happens, and I don't know if it is something that StataCorp could fix or if it is a bug in Excel. In any case, in this situation, you can just -drop- these observations with missing ids and then proceed.

But if you see observations in the browser that have real id's in them that are duplicates, then you have a bigger problem. There are two distinct situations:

Case 1. The duplicate id observations agree on the other variables. In that case -duplicates drop- will solve the immediate problem. But you should not stop there.
Case 2. The duplicate id observations disagree on the other variables. In that case, you have conflicting information and you cannot proceed until you can resolve these inconsistencies in the data. That will require some investigation on your part, or perhaps contacting BHPS to find out what's going on here.

However, whether it's case 1 or case 2, it's a sign that something went wrong in the data management that created that data set. Those duplicate id observations shouldn't be there. So you need to review all of the data management that you and your colleagues used to create this data set and fix the mistakes that led to this situation. There is a good chance that in the course of doing that you will find other mistakes which need fixing. If you find nothing wrong in your data management, then look directly at the source BHPS data and find the same duplication there--contact BHPS, inform them of the problem, and ask them to fix their mistake(s).
Comment
Guest
#8

26 Mar 2022, 04:34

thank you very much for everything, Clyde Schechter!
Comment
Guest
#9

26 Mar 2022, 04:40

the code you recommended revealed that 3,382,588 of 3,90,185 observations are duplicates, which does not really make much sense.
Comment

Announcement