Keep a subset of observations from large database

Mariska Snel

Join Date: Mar 2015

Posts: 33
#1

Keep a subset of observations from large database

02 Jun 2015, 06:46

Dear Statalists,

I'd like to ask for your advice on the " keep if " command. My database uses a unique identifier for each person (such as: 123456 or 654467 or 987344). As I want to keep only 40 persons from my database of around 30.000 persons I cannot use the "drop if" commmand. It is not possible to use

Code:

list

, because the identifiers are spread throughout the whole file, so there is no uniformity. After reading the FAQ by Nick Cox "How do I select a subset of observations using a complicated criterion?" this made me use
keep if identifier == "123456" | id == "654467" | id == "987344"| , but this is very time-consuming.
Do you happen to have another approach to select a subset of observations? Thanks in advance, Mariska
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#2

02 Jun 2015, 06:47

help inlist
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#3

02 Jun 2015, 08:50

Rich Goldstein's advice would work if your identifier were a numeric variable. But you show it as a string, and for strings, -inlist()- allows at most 10 values to be specified.

Here's another approach. Create a new data set that contains only one variable, id, and contains only the 40 particular id's that you want to keep. Save that data set: I'll call it keepers.dta. Then open (-use-) your original data set in Stata and run

Code:

merge 1:1 id using keepers, keep(match) assert(match master) nogenerate

At the end of this, the data in memory will consist of all the records for those 40 id's that you want to keep. You can save it as a new data set or proceed to analyze it directly as you prefer.

Note: if you make a mistake in creating keepers.dta and it contains any id's that don't actually exist in the original data, you will get an error message when you run the -merge- command.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#4

02 Jun 2015, 10:04

actually, my advice works anyway - you just need to "or" several inlists - in my experience, many users have trouble with merge; further, I note that Clyde Schechter suggests forming a new variable with just the 40 id's but does not tell the OP how to do that <grin>; once the OP does it (presumably Clyde Schechter means to do it from scratch), then merge is fine - whether it is easier depends on both setting up the new data file and getting the merge correct
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35642
#5

02 Jun 2015, 11:26

On the merge tactic see also

http://www.stata.com/support/faqs/da...ets/index.html
Comment
Fernando Bastidas

Join Date: Nov 2020

Posts: 28
#6

16 Nov 2022, 13:09

Originally posted by Clyde Schechter View Post

Rich Goldstein's advice would work if your identifier were a numeric variable. But you show it as a string, and for strings, -inlist()- allows at most 10 values to be specified.

Here's another approach. Create a new data set that contains only one variable, id, and contains only the 40 particular id's that you want to keep. Save that data set: I'll call it keepers.dta. Then open (-use-) your original data set in Stata and run

Code:

merge 1:1 id using keepers, keep(match) assert(match master) nogenerate

At the end of this, the data in memory will consist of all the records for those 40 id's that you want to keep. You can save it as a new data set or proceed to analyze it directly as you prefer.

Note: if you make a mistake in creating keepers.dta and it contains any id's that don't actually exist in the original data, you will get an error message when you run the -merge- command.

Hello Clyde, I'm having the same problem as Mariska and I ran the code above. I created this data set containing only the id that I'm interested (just one column of data), however it appears this error:

Code:

variable mrun does not uniquely identify observations in the master data

My master data set is a panel data, so I'm wondering that maybe that's the problem, but I don't know how to fix it. I would really appreciate your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#7

16 Nov 2022, 13:16

Code:

variable mrun does not uniquely identify observations in the master data
My master data set is a panel data, so I'm wondering that maybe that's the problem, but I don't know how to fix it.

Yes, if your master data set is panel data, then there will be multiple observations per panel identifier, one for each time period. The solution is simple. In the -merge- command, change -1:1- to -m:1-.
Comment
Fernando Bastidas

Join Date: Nov 2020

Posts: 28
#8

16 Nov 2022, 13:31

Originally posted by Clyde Schechter View Post

Yes, if your master data set is panel data, then there will be multiple observations per panel identifier, one for each time period. The solution is simple. In the -merge- command, change -1:1- to -m:1-.

It doesn't work, it appears:

Code:

variable mrun does not uniquely identify observations in the using data

And this is my code:

Code:

merge m:1 mrun using "/Users/fernandobastidasespinoza/Desktop/Universidad/2022 S2/Tesis/Datos/Asistencia/Listado 2022.dta", keep(match) assert(match master) nogenerate
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#9

16 Nov 2022, 13:38

although mrun apparently is repeated in each data set, it may be the case the combination of mrun and something else works - but you have not presented in any example data (using -dataex- please - see the FAQ) so no specific help can be given; as a general point, note that in many panel data sets there is another variable (e.g, some measure of time if longitudinal or some measure of group if cross-sectional but nested) that will uniquely identify each row in at least one of the files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#10

16 Nov 2022, 14:20

Rich Goldstein raises one possibility here. But I don't think that's it. If the task to which Fernando Bastidas has set himself is to keep a particular subset of the panels, then he has done something wrong when he created his using data set. Because for that purpose, the using data set should contain only the panel identifier variable, and there should be no duplicates in it. So he needs to go back and fix his using data set.

To find the duplicates in the using data set he can -use- that data set and then run

Code:

duplicates tag mrn, gen(flag) list if flag

If his task is different and involves keeping only certain panels at certain times, then he needs to go back and re-create his using data set so that it contains two variables: the panel identifier and the time periods, listing only those combinations of panel and time that he wants to retain. And again, there should be no duplicates there.
Comment

Fernando Bastidas

Join Date: Nov 2020
Posts: 28

#11

16 Nov 2022, 15:40

This is the panel dataset

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int agno long mrun
2016  17186
2017  17186
2018  17186
2019  17186
2016  23419
2017  23419
2018  23419
2019  23419
2016  75267
2017  75267
2018  75267
2019  75267
2016  90459
2017  90459
2018  90459
2019  90459
2016  94450
2017  94450
2018  94450
2019  94450
2016  94619
2017  94619
2018  94619
2019  94619
2016 129078
2017 129078
2018 129078
2019 129078
2016 134699
2017 134699
2018 134699
2019 134699
2016 135137
2017 135137
2018 135137
2019 135137
2016 141664
2017 141664
2018 141664
2019 141664
2016 143421
2017 143421
2018 143421
2019 143421
2016 150893
2017 150893
2018 150893
2019 150893
2016 175838
2017 175838
2018 175838
2019 175838
2016 178837
2017 178837
2018 178837
2019 178837
2016 180657
2017 180657
2018 180657
2019 180657
2016 181067
2017 181067
2018 181067
2019 181067
2016 234579
2017 234579
2018 234579
2019 234579
2016 241085
2017 241085
2018 241085
2019 241085
2016 243512
2017 243512
2018 243512
2019 243512
2016 253260
2017 253260
2018 253260
2019 253260
2016 275701
2017 275701
2018 275701
2019 275701
2016 277911
2017 277911
2018 277911
2019 277911
2016 287988
2017 287988
2018 287988
2019 287988
2016 310689
2017 310689
2018 310689
2019 310689
2016 324059
2017 324059
2018 324059
2019 324059
end

This is the id data set:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long mrun
  332504
 5640101
 7355008
10861035
11032673
12182377
14900082
17030434
18835797
19374640
19438028
20118466
20704466
22389334
22598101
23353492
23720050
24011292
  430656
  509505
 2565853
 2798915
 3950312
 4037972
 6201012
 7362722
 7637145
 8204569
 8302681
 9581946
10568115
11194389
11249555
16899847
17037962
17552679
17721477
18044536
18751937
19329157
19496651
22789339
  840858
 2626271
 4240916
 5909188
 7095367
 8452141
 8983078
 9290764
12130049
13330439
15318444
15646754
17368071
18271994
19545622
20559799
21194585
23648213
23822522
24613222
  659268
 2477491
 2839298
 2898887
 4091455
 4754059
 8091813
11266456
15472654
16037628
16913199
17731897
20149650
20384269
20670119
21375873
21894732
22730771
25971366
26359738
  661236
 2995433
 9991286
12266946
17945179
19001140
22977503
  236335
  841285
 1662546
 3409833
 3474482
 3882500
 4239379
 7118563
 7953771
 8379151
 9104056
end

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30065

#12

16 Nov 2022, 17:09

I cannot replicate your problem on my setup. Here's what I get:

Code:

. * Example generated by -dataex-. For more info, type help dataex
. clear

. input int agno long mrun

         agno          mrun
  1. 2016  17186
  2. 2017  17186
  3. 2018  17186
  4. 2019  17186
  5. 2016  23419
  6. 2017  23419
  7. 2018  23419
  8. 2019  23419
  9. 2016  75267
 10. 2017  75267
 11. 2018  75267
 12. 2019  75267
 13. 2016  90459
 14. 2017  90459
 15. 2018  90459
 16. 2019  90459
 17. 2016  94450
 18. 2017  94450
 19. 2018  94450
 20. 2019  94450
 21. 2016  94619
 22. 2017  94619
 23. 2018  94619
 24. 2019  94619
 25. 2016 129078
 26. 2017 129078
 27. 2018 129078
 28. 2019 129078
 29. 2016 134699
 30. 2017 134699
 31. 2018 134699
 32. 2019 134699
 33. 2016 135137
 34. 2017 135137
 35. 2018 135137
 36. 2019 135137
 37. 2016 141664
 38. 2017 141664
 39. 2018 141664
 40. 2019 141664
 41. 2016 143421
 42. 2017 143421
 43. 2018 143421
 44. 2019 143421
 45. 2016 150893
 46. 2017 150893
 47. 2018 150893
 48. 2019 150893
 49. 2016 175838
 50. 2017 175838
 51. 2018 175838
 52. 2019 175838
 53. 2016 178837
 54. 2017 178837
 55. 2018 178837
 56. 2019 178837
 57. 2016 180657
 58. 2017 180657
 59. 2018 180657
 60. 2019 180657
 61. 2016 181067
 62. 2017 181067
 63. 2018 181067
 64. 2019 181067
 65. 2016 234579
 66. 2017 234579
 67. 2018 234579
 68. 2019 234579
 69. 2016 241085
 70. 2017 241085
 71. 2018 241085
 72. 2019 241085
 73. 2016 243512
 74. 2017 243512
 75. 2018 243512
 76. 2019 243512
 77. 2016 253260
 78. 2017 253260
 79. 2018 253260
 80. 2019 253260
 81. 2016 275701
 82. 2017 275701
 83. 2018 275701
 84. 2019 275701
 85. 2016 277911
 86. 2017 277911
 87. 2018 277911
 88. 2019 277911
 89. 2016 287988
 90. 2017 287988
 91. 2018 287988
 92. 2019 287988
 93. 2016 310689
 94. 2017 310689
 95. 2018 310689
 96. 2019 310689
 97. 2016 324059
 98. 2017 324059
 99. 2018 324059
100. 2019 324059
101. end

. tempfile panel_data

. save `panel_data'
file C:\Users\clyde\AppData\Local\Temp\ST_74e0_000001.tmp saved as .dta format

.
. * Example generated by -dataex-. For more info, type help dataex
. clear

. input long mrun

             mrun
  1.   332504
  2.  5640101
  3.  7355008
  4. 10861035
  5. 11032673
  6. 12182377
  7. 14900082
  8. 17030434
  9. 18835797
 10. 19374640
 11. 19438028
 12. 20118466
 13. 20704466
 14. 22389334
 15. 22598101
 16. 23353492
 17. 23720050
 18. 24011292
 19.   430656
 20.   509505
 21.  2565853
 22.  2798915
 23.  3950312
 24.  4037972
 25.  6201012
 26.  7362722
 27.  7637145
 28.  8204569
 29.  8302681
 30.  9581946
 31. 10568115
 32. 11194389
 33. 11249555
 34. 16899847
 35. 17037962
 36. 17552679
 37. 17721477
 38. 18044536
 39. 18751937
 40. 19329157
 41. 19496651
 42. 22789339
 43.   840858
 44.  2626271
 45.  4240916
 46.  5909188
 47.  7095367
 48.  8452141
 49.  8983078
 50.  9290764
 51. 12130049
 52. 13330439
 53. 15318444
 54. 15646754
 55. 17368071
 56. 18271994
 57. 19545622
 58. 20559799
 59. 21194585
 60. 23648213
 61. 23822522
 62. 24613222
 63.   659268
 64.  2477491
 65.  2839298
 66.  2898887
 67.  4091455
 68.  4754059
 69.  8091813
 70. 11266456
 71. 15472654
 72. 16037628
 73. 16913199
 74. 17731897
 75. 20149650
 76. 20384269
 77. 20670119
 78. 21375873
 79. 21894732
 80. 22730771
 81. 25971366
 82. 26359738
 83.   661236
 84.  2995433
 85.  9991286
 86. 12266946
 87. 17945179
 88. 19001140
 89. 22977503
 90.   236335
 91.   841285
 92.  1662546
 93.  3409833
 94.  3474482
 95.  3882500
 96.  4239379
 97.  7118563
 98.  7953771
 99.  8379151
100.  9104056
101. end

. tempfile id_file

. save `id_file'
file C:\Users\clyde\AppData\Local\Temp\ST_74e0_000002.tmp saved as .dta format

.
. use `panel_data', clear

. merge m:1 mrun using `id_file', keep(match)

    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                                 0  (_merge==3)
    -----------------------------------------

.
.
end of do-file

There are no error messages, as you see. Now, at least in the example data you show, none of the mrun's in the panel data file find a match in the id file. But I would imagine that the real panel data set is larger and does contain at least some matches.

I suppose it is also possible that the real id file is also larger and it may be that it contains duplications. You need to find them yourself.

[code]
use `id_file', clear // PUT ACTUAL NAME OF THE ID FILE IN PLACE OF `id_file'
duplicates tag mrun, gen(flag)
list mrun if flag
[//code]
That will show you the duplicated id's. Then you need to figure out what to do about them. It may be that they arose because you entered the same id twice. If so, you can just delete one of them. But another possibility is that one of the id's is actually a typo and should be some different value. Figuring out what value it should be and correcting the error is the next step. Ultimately you need to end up with an id data set that lists all and only the mrun's that you want to retain from the panel data, and lists each of them exactly once.

Announcement