Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping observations based on a group

    Hello, I have problem, I have a variable named dup_casnum which lists the duplicated case numbers, I need to specify some complex commands, If there is an instance where phonetype has 1 cell phone entry as well as other entries, I will retain only the cell phone entries, if there is a case where the duplicates have no cell phone entry but "Other" I will retain "Other" if there is an entry with "home" and "work" I will retain only the "home" entry and so forth. so in a situation like the one shown above the code, I will only retain the "cell" entry and delete the other phone types, in a second situation where there is no "Cell" like in the "SAPD24009076" casenumber I will retain only the "Other" entry and delete "Home". I hope this makes sense

    "SAPD24010220" "Work"
    "SAPD24010220" "Other"
    "SAPD24010220" "Cell"
    "SAPD24010220" "Home"

    "SAPD24009076" "Other"
    "SAPD24009076" "Home"
    "SAPD24009076" "Home"


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str12 dup_casenum str16 PHONETYPE
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Other"
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24001556" ""     
    "SAPD24001556" "Cell" 
    ""             "Other"
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Other"
    ""             "Other"
    ""             "Other"
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Other"
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Other"
    ""             "Cell" 
    ""             "Cell" 
    ""             "Home" 
    ""             "Cell" 
    ""             "Other"
    ""             "Cell" 
    ""             "Other"
    ""             "Other"
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Other"
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24007083" "Cell" 
    "SAPD24007083" "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24007908" "Work" 
    "SAPD24007908" "Other"
    ""             "Cell" 
    "SAPD24008281" "Cell" 
    "SAPD24008281" "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24009076" "Other"
    "SAPD24009076" "Home" 
    "SAPD24009076" "Home" 
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24009638" "Cell" 
    "SAPD24009638" "Cell" 
    "SAPD24009638" "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24009906" "Other"
    "SAPD24009906" "Home" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24010220" "Work" 
    "SAPD24010220" "Other"
    "SAPD24010220" "Cell" 
    "SAPD24010220" "Home" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24011302" "Cell" 
    "SAPD24011302" "Work" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    ""             "Cell" 
    "SAPD24012030" "Work" 
    "SAPD24012030" "Cell" 
    ""             "Cell" 
    "SAPD24012182" "Cell" 
    "SAPD24012182" "Cell" 
    "SAPD24012672" "Cell" 
    "SAPD24012672" "Cell" 
    end

  • #2
    I think it would be easier to provide help if you provide the respondent ID rather than the duplicate id.

    Comment


    • #3
      Originally posted by George Ford View Post
      I think it would be easier to provide help if you provide the respondent ID rather than the duplicate id.
      [CODE]
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str12 CASENUM str16 PHONETYPE
      "SAPD23235461" "Cell"
      "SAPD24000177" "Cell"
      "SAPD24000195" "Cell"
      "SAPD24000290" "Cell"
      "SAPD24001036" "Cell"
      "SAPD24001107" "Other"
      "SAPD24001116" "Cell"
      "SAPD24001334" "Cell"
      "SAPD24001556" ""
      "SAPD24001556" "Cell"
      "SAPD24002046" "Other"
      "SAPD24002250" "Cell"
      "SAPD24002287" "Cell"
      "SAPD24002352" "Cell"
      "SAPD24002944" "Other"
      "SAPD24003514" "Other"
      "SAPD24003568" "Other"
      "SAPD24003700" "Cell"
      "SAPD24003811" "Cell"
      "SAPD24003953" "Cell"
      "SAPD24004052" "Cell"
      "SAPD24004141" "Other"
      "SAPD24004200" "Cell"
      "SAPD24004220" "Cell"
      "SAPD24004271" "Cell"
      "SAPD24004665" "Other"
      "SAPD24004733" "Cell"
      "SAPD24004737" "Cell"
      "SAPD24004751" "Home"
      "SAPD24004839" "Cell"
      "SAPD24004847" "Other"
      "SAPD24004894" "Cell"
      "SAPD24005422" "Other"
      "SAPD24005476" "Other"
      "SAPD24005649" "Cell"
      "SAPD24005664" "Cell"
      "SAPD24005807" "Cell"
      "SAPD24005845" "Cell"
      "SAPD24006049" "Cell"
      "SAPD24006108" "Other"
      "SAPD24006131" "Cell"
      "SAPD24006513" "Cell"
      "SAPD24006710" "Cell"
      "SAPD24006942" "Cell"
      "SAPD24007083" "Cell"
      "SAPD24007083" "Cell"
      "SAPD24007434" "Cell"
      "SAPD24007532" "Cell"
      "SAPD24007541" "Cell"
      "SAPD24007690" "Cell"
      "SAPD24007908" "Work"
      "SAPD24007908" "Other"
      "SAPD24008168" "Cell"
      "SAPD24008281" "Cell"
      "SAPD24008281" "Cell"
      "SAPD24008349" "Cell"
      "SAPD24008354" "Cell"
      "SAPD24008438" "Cell"
      "SAPD24008453" "Cell"
      "SAPD24008815" "Cell"
      "SAPD24009076" "Other"
      "SAPD24009076" "Home"
      "SAPD24009076" "Home"
      "SAPD24009298" "Cell"
      "SAPD24009308" "Cell"
      "SAPD24009638" "Cell"
      "SAPD24009638" "Cell"
      "SAPD24009638" "Cell"
      "SAPD24009834" "Cell"
      "SAPD24009859" "Cell"
      "SAPD24009879" "Cell"
      "SAPD24009895" "Cell"
      "SAPD24009906" "Other"
      "SAPD24009906" "Home"
      "SAPD24009962" "Cell"
      "SAPD24009966" "Cell"
      "SAPD24009987" "Cell"
      "SAPD24009991" "Cell"
      "SAPD24010178" "Cell"
      "SAPD24010220" "Work"
      "SAPD24010220" "Other"
      "SAPD24010220" "Cell"
      "SAPD24010220" "Home"
      "SAPD24010530" "Cell"
      "SAPD24010886" "Cell"
      "SAPD24011065" "Cell"
      "SAPD24011100" "Cell"
      "SAPD24011302" "Cell"
      "SAPD24011302" "Work"
      "SAPD24011485" "Cell"
      "SAPD24011508" "Cell"
      "SAPD24011680" "Cell"
      "SAPD24011762" "Cell"
      "SAPD24012030" "Work"
      "SAPD24012030" "Cell"
      "SAPD24012175" "Cell"
      "SAPD24012182" "Cell"
      "SAPD24012182" "Cell"
      "SAPD24012672" "Cell"
      "SAPD24012672" "Cell"
      end


      here you go

      Comment


      • #4
        Some of these are pure duplicates. Do you have other variables that identify family members or some such? Do you care about that?

        Comment


        • #5
          Originally posted by George Ford View Post
          Some of these are pure duplicates. Do you have other variables that identify family members or some such? Do you care about that?
          no other variable identifies family members, I only have the case numbers, I subsetted the case numbers to those with duplicates because only those individuals with duplicates will have multiple entries for PHONETYPE

          Comment


          • #6
            Code:
            duplicates drop PHONETYPE CASENUM , force
            * unless you want to keep them for some reason; if so, you need to egen new caseid using group with CASENUM and other relevant ID variables.
            g prefer = (PHONETYPE=="Cell")*1 + (PHONETYPE=="Other")*2 + (PHONETYPE=="Home")*3
            egen prefert = min(prefer), by(CASENUM)
            unique id
            keep if prefer==prefert
            unique id
            egen ID_N = count(CASENUM), by(CASENUM)  //should be 1 for all
            Last edited by George Ford; 09 Jul 2024, 09:02.

            Comment


            • #7
              Originally posted by George Ford View Post
              Code:
              duplicates drop PHONETYPE CASENUM , force
              * unless you want to keep them for some reason; if so, you need to egen new caseid using group with CASENUM and other relevant ID variables.
              g prefer = (PHONETYPE=="Cell")*1 + (PHONETYPE=="Other")*2 + (PHONETYPE=="Home")*3
              egen prefert = min(prefer), by(CASENUM)
              unique id
              keep if prefer==prefert
              unique id
              egen ID_N = count(CASENUM), by(CASENUM) //should be 1 for all
              there seems to be an error in your code,

              . unique id
              command unique is unrecognized
              r(199);

              Comment


              • #8
                ssc install unique

                you can drop it, but it shows that the CASENUM count is the same before and after the keep.

                Comment


                • #9
                  Originally posted by George Ford View Post
                  Code:
                  duplicates drop PHONETYPE CASENUM , force
                  * unless you want to keep them for some reason; if so, you need to egen new caseid using group with CASENUM and other relevant ID variables.
                  g prefer = (PHONETYPE=="Cell")*1 + (PHONETYPE=="Other")*2 + (PHONETYPE=="Home")*3
                  egen prefert = min(prefer), by(CASENUM)
                  unique id
                  keep if prefer==prefert
                  unique id
                  egen ID_N = count(CASENUM), by(CASENUM) //should be 1 for all
                  My apologies but I'm still getting another another error

                  unique id
                  variable id not found

                  Comment


                  • #10
                    sorry. change id to CASENUM

                    Comment


                    • #11
                      Originally posted by George Ford View Post
                      sorry. change id to CASENUM
                      Sorry but I'm not sure the code fixes my problem, for example in this situation the code should retain both cell phones, but the code is retaining the Work phonetype. The code should apply to the cases where there are multiple phone entries per each participant, so that's why I originally subsetted the duplicate CASENUM, because the duplicate CASENUM's are the ones with multiple phone entries.
                      SAPD24111630
                      SAPD24111630
                      SAPD24111630
                      SAPD24111630
                      Cell
                      Cell
                      Work
                      Other
                      Last edited by Luis Mijares Castaneda; 09 Jul 2024, 09:46.

                      Comment


                      • #12
                        I see. Work was being assigned a value of 0, which made it the minimum.

                        The duplicates drop only drops exact replicas.

                        Try this:

                        Code:
                        duplicates drop PHONETYPE CASENUM , force
                        unique CASENUM
                        g prefer = (PHONETYPE=="Cell")*1 + (PHONETYPE=="Other")*2 + (PHONETYPE=="Home")*3 + (PHONETYPE=="Work")*4
                        egen prefert = min(prefer), by(CASENUM)
                        keep if prefer==prefert
                        unique CASENUM
                        egen ID_N = count(CASENUM), by(CASENUM)

                        Comment


                        • #13
                          [QUOTE=George Ford;n1758275]I see. Work was being assigned a value of 0, which made it the minimum.

                          The duplicates drop only drops exact replicas.

                          Try this:

                          Code:
                          duplicates drop PHONETYPE CASENUM , force
                          unique CASENUM
                          g prefer = (PHONETYPE=="Cell")*1 + (PHONETYPE=="Other")*2 + (PHONETYPE=="Home")*3 + (PHONETYPE=="Work")*4
                          egen prefert = min(prefer), by(CASENUM)
                          keep if prefer==prefert
                          unique CASENUM
                          egen ID_N = count(CASENUM), by(CASENUM)
                          sorry one more question I have another variable named phone number, I need to retain the phone number of the second duplicate CASENUM if Phonetype is cell phone, so in the example below, the second cell PHONENUM will become another variables, does this make sense? Sorry for the confusion
                          SAPD24111630 Cell
                          SAPD24111630 Cell
                          SAPD24111630 Work
                          SAPD24111630 Other


                          * Example generated by -dataex-. For more info, type help dataex
                          clear
                          input str12 CASENUM str16 PHONETYPE double PHONENUMBER
                          "SAPD24118937" "Cell" 18304316624
                          "SAPD24118889" "Cell" 12109791643
                          "SAPD24118835" "Cell" 14024994996
                          "SAPD24118777" "Cell" 12812544320
                          "SAPD24118755" "Cell" 17856141487
                          "SAPD24118755" "Cell" 13038701951
                          "SAPD24118519" "Cell" 12106053538
                          "SAPD24118322" "Cell" 12106438156
                          "SAPD24118198" "Cell" 12252784565
                          "SAPD24118113" "Cell" 13613433082
                          "SAPD24117355" "Cell" 12109622257
                          "SAPD24117332" "Cell" 12107711890
                          "SAPD24117332" "Home" 17547799
                          "SAPD24117332" "Home" 12104452311
                          "SAPD24117313" "Cell" 15095928290
                          "SAPD24117309" "Cell" 12103866666
                          "SAPD24117285" "Cell" 12107710816
                          "SAPD24117283" "Home" 16263285
                          "SAPD24117283" "Cell" 12104541543
                          "SAPD24117155" "Cell" 12108105758
                          "SAPD24116684" "Cell" 12105178751
                          "SAPD24116652" "Cell" 15614258338
                          "SAPD24116633" "Cell" 12107179800
                          "SAPD24116617" "Cell" 12547072120
                          "SAPD24116556" "Cell" 17262067496
                          "SAPD24116238" "Work" 12106890915
                          "SAPD24116238" "Cell" 12107168732
                          "SAPD24116238" "Other" 12105733008
                          "SAPD24115375" "Cell" 12102871475
                          "SAPD24115319" "Other" 12104265972
                          "SAPD24115279" "" .
                          "SAPD24115279" "Cell" 18325854444
                          "SAPD24115247" "Cell" 12108484932
                          "SAPD24114875" "Cell" 18305818695
                          "SAPD24114836" "Work" 12103662304
                          "SAPD24114836" "Home" 14796002
                          "SAPD24114836" "Other" 12102136075
                          "SAPD24114836" "Other" 12102136975
                          "SAPD24114824" "Home" 12102871383
                          "SAPD24114824" "Cell" 13614459044
                          "SAPD24114824" "Other" 18303050607
                          "SAPD24114770" "Other" 12103855645
                          "SAPD24114759" "Cell" 18306942226
                          "SAPD24114692" "Cell" 12145639030
                          "SAPD24114658" "Work" 12102062238
                          "SAPD24114658" "Cell" 12103780789
                          "SAPD24114544" "Cell" 17263339822
                          "SAPD24114451" "Cell" 12106300710
                          "SAPD24114368" "Cell" 14328160330
                          "SAPD24114342" "Other" 12109877906
                          "SAPD24114271" "Cell" 12103859859
                          "SAPD24114255" "Cell" 12102515863
                          "SAPD24114025" "Cell" 19542257783
                          "SAPD24113868" "Cell" 13526344163
                          "SAPD24113859" "Cell" 19404358533
                          "SAPD24113829" "Work" 12104106553
                          "SAPD24113829" "Other" 18305708881
                          "SAPD24113799" "Cell" 14328894618
                          "SAPD24113799" "Cell" 14328892030
                          "SAPD24113740" "Cell" 14328038084
                          "SAPD24113630" "Cell" 17872426437
                          "SAPD24113458" "Cell" 16083439978
                          "SAPD24113439" "Cell" 12102547088
                          "SAPD24113428" "Cell" 18315962388
                          "SAPD24113306" "Cell" 12104730533
                          "SAPD24113212" "Cell" 12102907162
                          "SAPD24113212" "Cell" 12102942307
                          "SAPD24113193" "Cell" 19034531168
                          "SAPD24113108" "Cell" 17262094027
                          "SAPD24111630" "Cell" 17267770236
                          "SAPD24111630" "Cell" 12107945931
                          "SAPD24111630" "Work" 12105341874
                          "SAPD24111630" "Other" 12104655628
                          "SAPD24111551" "Cell" 12108129810
                          "SAPD24111487" "Cell" 15596946571
                          "SAPD24111414" "Cell" 18303189541
                          "SAPD24111414" "Other" 12107794694
                          "SAPD24111379" "Cell" 12103653666
                          "SAPD24110966" "Cell" 18325840971
                          "SAPD24110660" "Cell" 12108595319
                          "SAPD24110498" "Cell" 12105890399
                          "SAPD24110011" "Cell" 12103320671
                          "SAPD24109950" "Cell" 12104390646
                          "SAPD24109941" "Cell" 17209660481
                          "SAPD24109786" "Cell" 18329357982
                          "SAPD24109677" "Cell" 12108834344
                          "SAPD24108872" "" .
                          "SAPD24108807" "Cell" 18305131205
                          "SAPD24108746" "Cell" 12106680556
                          "SAPD24108727" "Cell" 19562363062
                          "SAPD24108472" "Home" 19033609332
                          "SAPD24108472" "Cell" 12103800924
                          "SAPD24108472" "Cell" 12107276540
                          "SAPD24108372" "Cell" 12105961760
                          "SAPD24108372" "Cell" 12106678965
                          "SAPD24108326" "Cell" 19794804347
                          "SAPD24108235" "" .
                          "SAPD24108235" "Cell" 12108127268
                          "SAPD24108217" "Cell" 18169165169
                          "SAPD24107440" "Cell" 14694505791
                          end
                          [/CODE]
                          ------------------ copy up to and including the previous line ------------------

                          Comment


                          • #14
                            Well that's quite the wrinkle.

                            As a first shot, try this and then sort out what you get.

                            Code:
                            sort CASENUM (PHONETYPE)
                            bys CASENUM: g id_n = _n
                            tab id_n
                            
                            forv i = 1/4 {
                            preserve 
                            keep if id_n == `i'
                            rename PHONETYPE PHONETYPE`i' 
                            rename PHONENUMBER PHONENUMBER`i' 
                            save group`i', replace
                            restore
                            }
                            
                            use group1, clear
                            forv i = 2/4 {
                                joinby CASENUM using group`i' , unmatched(master) 
                                drop _merge
                            }
                            
                            drop id_n

                            Comment


                            • #15
                              I wonder if reshape might do this more directly. I'll try it.

                              Comment

                              Working...
                              X