Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deleting observations based on a condition

    Hi all,

    I have been working with this data set and trying to delete the observations for which I got a perfect match and also with the observations with values non-zero for the variable 'd', provided there is one observation with value 0 for the variable 'd'.

    In the following example data attached, I tried to find matches for candidate_m. Stata has produced a lot of perfect matches with the same name which are given in candidate_u. But not all the matches are exactly right even though they have a similarity score of 1 from the 'matchit' command. What I would like to do now is, within each identifier variable 'id' (combination of year_m, const_m and candidate_m), I have to check whether d is zero. If it is zero, then delete the other observations within it that are non-zero. If none of the observation has the value of 0 for d within an id, then do not delete anything within that id.


    For ex, there are three ids in the given example below: 17782, 18156 and 19101. The last two ids have a value of zero for the variable 'd'. but the id 17782 does not have a value of zero for the variable 'd'. Therefore, none of the observations within 17782 should be deleted but for the ids 18156 and 19101, all the observations that are non-zero for 'd' should be deleted. To execute this, I have written the following lines of code.

    egen id = group( year_m const_m candidate_m )
    xtset id
    bysort id : gen cum_simil=sum(similscore) // to take care of duplicates
    by id, sort: gen has_perfect_match = 1 if d==0 & cum_simil>1
    drop if has_perfect_match & d!=0

    But the above set of commands delete even those ids which have non-zero value for 'd' i.e. the id 17782 has been deleted altogether, which I don't want to be deleted. There is some mistake in the set of codes and I have tried different combinations for a couple of hours already, but in vain. Any suggestion would be helpful.

    Regards


    ***************************** Example dataset is as below


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(year_m year_u) str26 const_m long ac_m str26 const_u float(order_m order_u) str55 candidate_m float id byte dup double d float(cum_simil has_perfect_match) double similscore str55 candidate_u
    1991 1993 "Garautha" 139 "Bangarmau"  3 10 "CHANDRA PAL SINGH" 17782 6 107.89971231641277 7 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Garautha" 139 "Chhibramau" 3 22 "CHANDRA PAL SINGH" 17782 6  43.09548223868419 2 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Garautha" 139 "Kanth"      3  4 "CHANDRA PAL SINGH" 17782 6  64.94923194380358 4 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Garautha" 139 "Kasganj"    3 26 "CHANDRA PAL SINGH" 17782 6  51.19461464844054 6 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Garautha" 139 "Kashipur"   3 34 "CHANDRA PAL SINGH" 17782 6            49.9999 5 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Garautha" 139 "Shahabad"   3  3 "CHANDRA PAL SINGH" 17782 6            49.9999 1 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Garautha" 139 "Siana"      3 16 "CHANDRA PAL SINGH" 17782 6 113.41941735080039 3 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Kanth"    205 "Bangarmau"  3 10 "CHANDRA PAL SINGH" 18156 6 167.82627376584603 5 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Kanth"    205 "Chhibramau" 3 22 "CHANDRA PAL SINGH" 18156 6 104.90555977626428 4 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Kanth"    205 "Kanth"      3  4 "CHANDRA PAL SINGH" 18156 6                  0 2 1 1 "CHANDRA PAL SINGH"
    1991 1993 "Kanth"    205 "Kasganj"    3 26 "CHANDRA PAL SINGH" 18156 6 13.355054613263283 6 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Kanth"    205 "Kashipur"   3 34 "CHANDRA PAL SINGH" 18156 6            49.9999 7 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Kanth"    205 "Shahabad"   3  3 "CHANDRA PAL SINGH" 18156 6            49.9999 3 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Kanth"    205 "Siana"      3 16 "CHANDRA PAL SINGH" 18156 6 47.062286729680835 1 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Shahabad" 371 "Bangarmau"  2 10 "CHANDRA PAL SINGH" 19101 6            49.9999 7 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Shahabad" 371 "Chhibramau" 2 22 "CHANDRA PAL SINGH" 19101 6            49.9999 3 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Shahabad" 371 "Kanth"      2  4 "CHANDRA PAL SINGH" 19101 6            49.9999 5 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Shahabad" 371 "Kasganj"    2 26 "CHANDRA PAL SINGH" 19101 6            49.9999 6 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Shahabad" 371 "Kashipur"   2 34 "CHANDRA PAL SINGH" 19101 6            49.9999 2 . 1 "CHANDRA PAL SINGH"
    1991 1993 "Shahabad" 371 "Shahabad"   2  3 "CHANDRA PAL SINGH" 19101 6                  0 4 1 1 "CHANDRA PAL SINGH"
    1991 1993 "Shahabad" 371 "Siana"      2 16 "CHANDRA PAL SINGH" 19101 6            49.9999 1 . 1 "CHANDRA PAL SINGH"
    end


    Last edited by Vignesh Ponnusamy; 17 Nov 2018, 00:16.

  • #2
    You have two problems in your code.
    Code:
    bysort id : gen cum_simil=sum(similscore)
    creates a running total of similscore, so that one observation has a value of 1 even if the total value is more than 1. You need
    Code:
    bysort id : egen cum_simil=total(similscore)
    Code:
    by id, sort: gen has_perfect_match = 1 if d==0 & cum_simil>1
    creates has_perfect_match with the value 1 or missing. But then
    Code:
    drop if has_perfect_match & d!=0
    is the same as
    Code:
    drop if has_perfect_match!=0 & d!=0
    since any nonzero value - including missing - is treated as true, so every observations with d!=0 is dropped. What you needed was
    Code:
    by id, sort: gen has_perfect_match = d==0 & cum_simil>1

    Putting this all together, the following code, run on your data, demonstrates the differences between your cum_simil and has_perfect_match and my n_cum_simil and n_has_perfect_match.
    Code:
    // egen id = group( year_m const_m candidate_m )
    xtset id
    bysort id : egen n_cum_simil=total(similscore) // to take care of duplicates
    by id, sort: gen n_has_perfect_match = d==0 & n_cum_simil>1
    // drop if has_perfect_match & d!=0
    
    list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
      if !(has_perfect_match & d!=0) | !(n_has_perfect_match), sepby(id) noobs
    Code:
      +--------------------------------------------------------------------------+
      |    id           d   simils~e   cum_si~l   has_pe~h   n_cum_~l   n_has_~h |
      |--------------------------------------------------------------------------|
      | 17782   107.89971          1          7          .          7          0 |
      | 17782   43.095482          1          2          .          7          0 |
      | 17782   64.949232          1          4          .          7          0 |
      | 17782   51.194615          1          6          .          7          0 |
      | 17782     49.9999          1          5          .          7          0 |
      | 17782     49.9999          1          1          .          7          0 |
      | 17782   113.41942          1          3          .          7          0 |
      |--------------------------------------------------------------------------|
      | 18156   167.82627          1          5          .          7          0 |
      | 18156   104.90556          1          4          .          7          0 |
      | 18156           0          1          2          1          7          1 |
      | 18156   13.355055          1          6          .          7          0 |
      | 18156     49.9999          1          7          .          7          0 |
      | 18156     49.9999          1          3          .          7          0 |
      | 18156   47.062287          1          1          .          7          0 |
      |--------------------------------------------------------------------------|
      | 19101     49.9999          1          7          .          7          0 |
      | 19101     49.9999          1          3          .          7          0 |
      | 19101     49.9999          1          5          .          7          0 |
      | 19101     49.9999          1          6          .          7          0 |
      | 19101     49.9999          1          2          .          7          0 |
      | 19101           0          1          4          1          7          1 |
      | 19101     49.9999          1          1          .          7          0 |
      +--------------------------------------------------------------------------+

    Comment


    • #3
      Thanks for your reply, Lisowski.
      I have tried this version of the code as well before. If I assign either zero or one to has_perfect_match and try to drop that has perfect match and also d!=0, then none of the observations will be deleted. But, if I assign the value of one or a missing value to has_perfect_match and try to delete has_perfect_match and d!=0, as in the code I have provided earlier, then all the observations that are non-zero are deleted, totally discarding the has_perfect_match condition.

      I have also tried your code for the sample dataset and same as before, '0 observations are deleted' is the result I end up with. There is some simple mistake but can't figure it out yet.

      Regards.

      Comment


      • #4
        Perhaps instead
        Code:
        by id, sort: gen has_perfect_match = 1 if d==0 & cum_simil>1
        should be
        Code:
        by id, sort: egen has_perfect_match = max(d==0 & cum_simil>1)
        Code:
        // egen id = group( year_m const_m candidate_m )
        xtset id
        bysort id : egen n_cum_simil=total(similscore) // to take care of duplicates
        by id, sort: egen n_has_perfect_match = max(d==0 & n_cum_simil>1)
        
        list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
          if !(has_perfect_match & d!=0) | !(n_has_perfect_match), sepby(id) noobs
          
        drop if n_has_perfect_match & d!=0
        
        list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
          , sepby(id) noobs
        Code:
        . list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
        >   if !(has_perfect_match & d!=0) | !(n_has_perfect_match), sepby(id) noobs
        
          +--------------------------------------------------------------------------+
          |    id           d   simils~e   cum_si~l   has_pe~h   n_cum_~l   n_has_~h |
          |--------------------------------------------------------------------------|
          | 17782   107.89971          1          7          .          7          0 |
          | 17782   43.095482          1          2          .          7          0 |
          | 17782   64.949232          1          4          .          7          0 |
          | 17782   51.194615          1          6          .          7          0 |
          | 17782     49.9999          1          5          .          7          0 |
          | 17782     49.9999          1          1          .          7          0 |
          | 17782   113.41942          1          3          .          7          0 |
          |--------------------------------------------------------------------------|
          | 18156           0          1          2          1          7          1 |
          |--------------------------------------------------------------------------|
          | 19101           0          1          4          1          7          1 |
          +--------------------------------------------------------------------------+
        
        .   
        . drop if n_has_perfect_match & d!=0
        (12 observations deleted)
        
        . 
        . list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
        >   , sepby(id) noobs
        
          +--------------------------------------------------------------------------+
          |    id           d   simils~e   cum_si~l   has_pe~h   n_cum_~l   n_has_~h |
          |--------------------------------------------------------------------------|
          | 17782   107.89971          1          7          .          7          0 |
          | 17782   43.095482          1          2          .          7          0 |
          | 17782   64.949232          1          4          .          7          0 |
          | 17782   51.194615          1          6          .          7          0 |
          | 17782     49.9999          1          5          .          7          0 |
          | 17782     49.9999          1          1          .          7          0 |
          | 17782   113.41942          1          3          .          7          0 |
          |--------------------------------------------------------------------------|
          | 18156           0          1          2          1          7          1 |
          |--------------------------------------------------------------------------|
          | 19101           0          1          4          1          7          1 |
          +--------------------------------------------------------------------------+

        Comment


        • #5
          Thanks, Lisowski.
          Works perfectly.

          Regards
          Vignesh.

          Comment

          Working...
          X