Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep a subset of observations from large database

    Dear Statalists,

    I'd like to ask for your advice on the " keep if " command. My database uses a unique identifier for each person (such as: 123456 or 654467 or 987344). As I want to keep only 40 persons from my database of around 30.000 persons I cannot use the "drop if" commmand. It is not possible to use
    Code:
    list
    , because the identifiers are spread throughout the whole file, so there is no uniformity. After reading the FAQ by Nick Cox "How do I select a subset of observations using a complicated criterion?" this made me use
    keep if identifier == "123456" | id == "654467" | id == "987344"| , but this is very time-consuming.
    Do you happen to have another approach to select a subset of observations? Thanks in advance, Mariska

  • #2
    help inlist

    Comment


    • #3
      Rich Goldstein's advice would work if your identifier were a numeric variable. But you show it as a string, and for strings, -inlist()- allows at most 10 values to be specified.

      Here's another approach. Create a new data set that contains only one variable, id, and contains only the 40 particular id's that you want to keep. Save that data set: I'll call it keepers.dta. Then open (-use-) your original data set in Stata and run

      Code:
      merge 1:1 id using keepers, keep(match) assert(match master) nogenerate
      At the end of this, the data in memory will consist of all the records for those 40 id's that you want to keep. You can save it as a new data set or proceed to analyze it directly as you prefer.

      Note: if you make a mistake in creating keepers.dta and it contains any id's that don't actually exist in the original data, you will get an error message when you run the -merge- command.

      Comment


      • #4
        actually, my advice works anyway - you just need to "or" several inlists - in my experience, many users have trouble with merge; further, I note that Clyde Schechter suggests forming a new variable with just the 40 id's but does not tell the OP how to do that <grin>; once the OP does it (presumably Clyde Schechter means to do it from scratch), then merge is fine - whether it is easier depends on both setting up the new data file and getting the merge correct

        Comment


        • #5
          On the merge tactic see also

          http://www.stata.com/support/faqs/da...ets/index.html

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            Rich Goldstein's advice would work if your identifier were a numeric variable. But you show it as a string, and for strings, -inlist()- allows at most 10 values to be specified.

            Here's another approach. Create a new data set that contains only one variable, id, and contains only the 40 particular id's that you want to keep. Save that data set: I'll call it keepers.dta. Then open (-use-) your original data set in Stata and run

            Code:
            merge 1:1 id using keepers, keep(match) assert(match master) nogenerate
            At the end of this, the data in memory will consist of all the records for those 40 id's that you want to keep. You can save it as a new data set or proceed to analyze it directly as you prefer.

            Note: if you make a mistake in creating keepers.dta and it contains any id's that don't actually exist in the original data, you will get an error message when you run the -merge- command.
            Hello Clyde, I'm having the same problem as Mariska and I ran the code above. I created this data set containing only the id that I'm interested (just one column of data), however it appears this error:

            Code:
            variable mrun does not uniquely identify observations in the master data
            My master data set is a panel data, so I'm wondering that maybe that's the problem, but I don't know how to fix it. I would really appreciate your help.

            Comment


            • #7
              Code:

              variable mrun does not uniquely identify observations in the master data
              My master data set is a panel data, so I'm wondering that maybe that's the problem, but I don't know how to fix it.
              Yes, if your master data set is panel data, then there will be multiple observations per panel identifier, one for each time period. The solution is simple. In the -merge- command, change -1:1- to -m:1-.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post

                Yes, if your master data set is panel data, then there will be multiple observations per panel identifier, one for each time period. The solution is simple. In the -merge- command, change -1:1- to -m:1-.
                It doesn't work, it appears:

                Code:
                variable mrun does not uniquely identify observations in the using data
                And this is my code:

                Code:
                merge m:1 mrun using "/Users/fernandobastidasespinoza/Desktop/Universidad/2022 S2/Tesis/Datos/Asistencia/Listado 2022.dta", keep(match) assert(match master) nogenerate

                Comment


                • #9
                  although mrun apparently is repeated in each data set, it may be the case the combination of mrun and something else works - but you have not presented in any example data (using -dataex- please - see the FAQ) so no specific help can be given; as a general point, note that in many panel data sets there is another variable (e.g, some measure of time if longitudinal or some measure of group if cross-sectional but nested) that will uniquely identify each row in at least one of the files

                  Comment


                  • #10
                    Rich Goldstein raises one possibility here. But I don't think that's it. If the task to which Fernando Bastidas has set himself is to keep a particular subset of the panels, then he has done something wrong when he created his using data set. Because for that purpose, the using data set should contain only the panel identifier variable, and there should be no duplicates in it. So he needs to go back and fix his using data set.

                    To find the duplicates in the using data set he can -use- that data set and then run
                    Code:
                    duplicates tag mrn, gen(flag)
                    list if flag
                    If his task is different and involves keeping only certain panels at certain times, then he needs to go back and re-create his using data set so that it contains two variables: the panel identifier and the time periods, listing only those combinations of panel and time that he wants to retain. And again, there should be no duplicates there.

                    Comment


                    • #11
                      This is the panel dataset

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input int agno long mrun
                      2016  17186
                      2017  17186
                      2018  17186
                      2019  17186
                      2016  23419
                      2017  23419
                      2018  23419
                      2019  23419
                      2016  75267
                      2017  75267
                      2018  75267
                      2019  75267
                      2016  90459
                      2017  90459
                      2018  90459
                      2019  90459
                      2016  94450
                      2017  94450
                      2018  94450
                      2019  94450
                      2016  94619
                      2017  94619
                      2018  94619
                      2019  94619
                      2016 129078
                      2017 129078
                      2018 129078
                      2019 129078
                      2016 134699
                      2017 134699
                      2018 134699
                      2019 134699
                      2016 135137
                      2017 135137
                      2018 135137
                      2019 135137
                      2016 141664
                      2017 141664
                      2018 141664
                      2019 141664
                      2016 143421
                      2017 143421
                      2018 143421
                      2019 143421
                      2016 150893
                      2017 150893
                      2018 150893
                      2019 150893
                      2016 175838
                      2017 175838
                      2018 175838
                      2019 175838
                      2016 178837
                      2017 178837
                      2018 178837
                      2019 178837
                      2016 180657
                      2017 180657
                      2018 180657
                      2019 180657
                      2016 181067
                      2017 181067
                      2018 181067
                      2019 181067
                      2016 234579
                      2017 234579
                      2018 234579
                      2019 234579
                      2016 241085
                      2017 241085
                      2018 241085
                      2019 241085
                      2016 243512
                      2017 243512
                      2018 243512
                      2019 243512
                      2016 253260
                      2017 253260
                      2018 253260
                      2019 253260
                      2016 275701
                      2017 275701
                      2018 275701
                      2019 275701
                      2016 277911
                      2017 277911
                      2018 277911
                      2019 277911
                      2016 287988
                      2017 287988
                      2018 287988
                      2019 287988
                      2016 310689
                      2017 310689
                      2018 310689
                      2019 310689
                      2016 324059
                      2017 324059
                      2018 324059
                      2019 324059
                      end
                      This is the id data set:

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input long mrun
                        332504
                       5640101
                       7355008
                      10861035
                      11032673
                      12182377
                      14900082
                      17030434
                      18835797
                      19374640
                      19438028
                      20118466
                      20704466
                      22389334
                      22598101
                      23353492
                      23720050
                      24011292
                        430656
                        509505
                       2565853
                       2798915
                       3950312
                       4037972
                       6201012
                       7362722
                       7637145
                       8204569
                       8302681
                       9581946
                      10568115
                      11194389
                      11249555
                      16899847
                      17037962
                      17552679
                      17721477
                      18044536
                      18751937
                      19329157
                      19496651
                      22789339
                        840858
                       2626271
                       4240916
                       5909188
                       7095367
                       8452141
                       8983078
                       9290764
                      12130049
                      13330439
                      15318444
                      15646754
                      17368071
                      18271994
                      19545622
                      20559799
                      21194585
                      23648213
                      23822522
                      24613222
                        659268
                       2477491
                       2839298
                       2898887
                       4091455
                       4754059
                       8091813
                      11266456
                      15472654
                      16037628
                      16913199
                      17731897
                      20149650
                      20384269
                      20670119
                      21375873
                      21894732
                      22730771
                      25971366
                      26359738
                        661236
                       2995433
                       9991286
                      12266946
                      17945179
                      19001140
                      22977503
                        236335
                        841285
                       1662546
                       3409833
                       3474482
                       3882500
                       4239379
                       7118563
                       7953771
                       8379151
                       9104056
                      end

                      Comment


                      • #12
                        I cannot replicate your problem on my setup. Here's what I get:

                        Code:
                        . * Example generated by -dataex-. For more info, type help dataex
                        . clear
                        
                        . input int agno long mrun
                        
                                 agno          mrun
                          1. 2016  17186
                          2. 2017  17186
                          3. 2018  17186
                          4. 2019  17186
                          5. 2016  23419
                          6. 2017  23419
                          7. 2018  23419
                          8. 2019  23419
                          9. 2016  75267
                         10. 2017  75267
                         11. 2018  75267
                         12. 2019  75267
                         13. 2016  90459
                         14. 2017  90459
                         15. 2018  90459
                         16. 2019  90459
                         17. 2016  94450
                         18. 2017  94450
                         19. 2018  94450
                         20. 2019  94450
                         21. 2016  94619
                         22. 2017  94619
                         23. 2018  94619
                         24. 2019  94619
                         25. 2016 129078
                         26. 2017 129078
                         27. 2018 129078
                         28. 2019 129078
                         29. 2016 134699
                         30. 2017 134699
                         31. 2018 134699
                         32. 2019 134699
                         33. 2016 135137
                         34. 2017 135137
                         35. 2018 135137
                         36. 2019 135137
                         37. 2016 141664
                         38. 2017 141664
                         39. 2018 141664
                         40. 2019 141664
                         41. 2016 143421
                         42. 2017 143421
                         43. 2018 143421
                         44. 2019 143421
                         45. 2016 150893
                         46. 2017 150893
                         47. 2018 150893
                         48. 2019 150893
                         49. 2016 175838
                         50. 2017 175838
                         51. 2018 175838
                         52. 2019 175838
                         53. 2016 178837
                         54. 2017 178837
                         55. 2018 178837
                         56. 2019 178837
                         57. 2016 180657
                         58. 2017 180657
                         59. 2018 180657
                         60. 2019 180657
                         61. 2016 181067
                         62. 2017 181067
                         63. 2018 181067
                         64. 2019 181067
                         65. 2016 234579
                         66. 2017 234579
                         67. 2018 234579
                         68. 2019 234579
                         69. 2016 241085
                         70. 2017 241085
                         71. 2018 241085
                         72. 2019 241085
                         73. 2016 243512
                         74. 2017 243512
                         75. 2018 243512
                         76. 2019 243512
                         77. 2016 253260
                         78. 2017 253260
                         79. 2018 253260
                         80. 2019 253260
                         81. 2016 275701
                         82. 2017 275701
                         83. 2018 275701
                         84. 2019 275701
                         85. 2016 277911
                         86. 2017 277911
                         87. 2018 277911
                         88. 2019 277911
                         89. 2016 287988
                         90. 2017 287988
                         91. 2018 287988
                         92. 2019 287988
                         93. 2016 310689
                         94. 2017 310689
                         95. 2018 310689
                         96. 2019 310689
                         97. 2016 324059
                         98. 2017 324059
                         99. 2018 324059
                        100. 2019 324059
                        101. end
                        
                        . tempfile panel_data
                        
                        . save `panel_data'
                        file C:\Users\clyde\AppData\Local\Temp\ST_74e0_000001.tmp saved as .dta format
                        
                        .
                        . * Example generated by -dataex-. For more info, type help dataex
                        . clear
                        
                        . input long mrun
                        
                                     mrun
                          1.   332504
                          2.  5640101
                          3.  7355008
                          4. 10861035
                          5. 11032673
                          6. 12182377
                          7. 14900082
                          8. 17030434
                          9. 18835797
                         10. 19374640
                         11. 19438028
                         12. 20118466
                         13. 20704466
                         14. 22389334
                         15. 22598101
                         16. 23353492
                         17. 23720050
                         18. 24011292
                         19.   430656
                         20.   509505
                         21.  2565853
                         22.  2798915
                         23.  3950312
                         24.  4037972
                         25.  6201012
                         26.  7362722
                         27.  7637145
                         28.  8204569
                         29.  8302681
                         30.  9581946
                         31. 10568115
                         32. 11194389
                         33. 11249555
                         34. 16899847
                         35. 17037962
                         36. 17552679
                         37. 17721477
                         38. 18044536
                         39. 18751937
                         40. 19329157
                         41. 19496651
                         42. 22789339
                         43.   840858
                         44.  2626271
                         45.  4240916
                         46.  5909188
                         47.  7095367
                         48.  8452141
                         49.  8983078
                         50.  9290764
                         51. 12130049
                         52. 13330439
                         53. 15318444
                         54. 15646754
                         55. 17368071
                         56. 18271994
                         57. 19545622
                         58. 20559799
                         59. 21194585
                         60. 23648213
                         61. 23822522
                         62. 24613222
                         63.   659268
                         64.  2477491
                         65.  2839298
                         66.  2898887
                         67.  4091455
                         68.  4754059
                         69.  8091813
                         70. 11266456
                         71. 15472654
                         72. 16037628
                         73. 16913199
                         74. 17731897
                         75. 20149650
                         76. 20384269
                         77. 20670119
                         78. 21375873
                         79. 21894732
                         80. 22730771
                         81. 25971366
                         82. 26359738
                         83.   661236
                         84.  2995433
                         85.  9991286
                         86. 12266946
                         87. 17945179
                         88. 19001140
                         89. 22977503
                         90.   236335
                         91.   841285
                         92.  1662546
                         93.  3409833
                         94.  3474482
                         95.  3882500
                         96.  4239379
                         97.  7118563
                         98.  7953771
                         99.  8379151
                        100.  9104056
                        101. end
                        
                        . tempfile id_file
                        
                        . save `id_file'
                        file C:\Users\clyde\AppData\Local\Temp\ST_74e0_000002.tmp saved as .dta format
                        
                        .
                        . use `panel_data', clear
                        
                        . merge m:1 mrun using `id_file', keep(match)
                        
                            Result                      Number of obs
                            -----------------------------------------
                            Not matched                             0
                            Matched                                 0  (_merge==3)
                            -----------------------------------------
                        
                        .
                        .
                        end of do-file
                        There are no error messages, as you see. Now, at least in the example data you show, none of the mrun's in the panel data file find a match in the id file. But I would imagine that the real panel data set is larger and does contain at least some matches.

                        I suppose it is also possible that the real id file is also larger and it may be that it contains duplications. You need to find them yourself.

                        [code]
                        use `id_file', clear // PUT ACTUAL NAME OF THE ID FILE IN PLACE OF `id_file'
                        duplicates tag mrun, gen(flag)
                        list mrun if flag
                        [//code]
                        That will show you the duplicated id's. Then you need to figure out what to do about them. It may be that they arose because you entered the same id twice. If so, you can just delete one of them. But another possibility is that one of the id's is actually a typo and should be some different value. Figuring out what value it should be and correcting the error is the next step. Ultimately you need to end up with an id data set that lists all and only the mrun's that you want to retain from the panel data, and lists each of them exactly once.

                        Comment

                        Working...
                        X