Deleting observations based on a condition

Vignesh Ponnusamy

Join Date: May 2018
Posts: 29

Deleting observations based on a condition

17 Nov 2018, 00:01

Hi all,

I have been working with this data set and trying to delete the observations for which I got a perfect match and also with the observations with values non-zero for the variable 'd', provided there is one observation with value 0 for the variable 'd'.

In the following example data attached, I tried to find matches for candidate_m. Stata has produced a lot of perfect matches with the same name which are given in candidate_u. But not all the matches are exactly right even though they have a similarity score of 1 from the 'matchit' command. What I would like to do now is, within each identifier variable 'id' (combination of year_m, const_m and candidate_m), I have to check whether d is zero. If it is zero, then delete the other observations within it that are non-zero. If none of the observation has the value of 0 for d within an id, then do not delete anything within that id.

For ex, there are three ids in the given example below: 17782, 18156 and 19101. The last two ids have a value of zero for the variable 'd'. but the id 17782 does not have a value of zero for the variable 'd'. Therefore, none of the observations within 17782 should be deleted but for the ids 18156 and 19101, all the observations that are non-zero for 'd' should be deleted. To execute this, I have written the following lines of code.

egen id = group( year_m const_m candidate_m )
xtset id
bysort id : gen cum_simil=sum(similscore) // to take care of duplicates
by id, sort: gen has_perfect_match = 1 if d==0 & cum_simil>1
drop if has_perfect_match & d!=0

But the above set of commands delete even those ids which have non-zero value for 'd' i.e. the id 17782 has been deleted altogether, which I don't want to be deleted. There is some mistake in the set of codes and I have tried different combinations for a couple of hours already, but in vain. Any suggestion would be helpful.

Regards

***************************** Example dataset is as below

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(year_m year_u) str26 const_m long ac_m str26 const_u float(order_m order_u) str55 candidate_m float id byte dup double d float(cum_simil has_perfect_match) double similscore str55 candidate_u
1991 1993 "Garautha" 139 "Bangarmau"  3 10 "CHANDRA PAL SINGH" 17782 6 107.89971231641277 7 . 1 "CHANDRA PAL SINGH"
1991 1993 "Garautha" 139 "Chhibramau" 3 22 "CHANDRA PAL SINGH" 17782 6  43.09548223868419 2 . 1 "CHANDRA PAL SINGH"
1991 1993 "Garautha" 139 "Kanth"      3  4 "CHANDRA PAL SINGH" 17782 6  64.94923194380358 4 . 1 "CHANDRA PAL SINGH"
1991 1993 "Garautha" 139 "Kasganj"    3 26 "CHANDRA PAL SINGH" 17782 6  51.19461464844054 6 . 1 "CHANDRA PAL SINGH"
1991 1993 "Garautha" 139 "Kashipur"   3 34 "CHANDRA PAL SINGH" 17782 6            49.9999 5 . 1 "CHANDRA PAL SINGH"
1991 1993 "Garautha" 139 "Shahabad"   3  3 "CHANDRA PAL SINGH" 17782 6            49.9999 1 . 1 "CHANDRA PAL SINGH"
1991 1993 "Garautha" 139 "Siana"      3 16 "CHANDRA PAL SINGH" 17782 6 113.41941735080039 3 . 1 "CHANDRA PAL SINGH"
1991 1993 "Kanth"    205 "Bangarmau"  3 10 "CHANDRA PAL SINGH" 18156 6 167.82627376584603 5 . 1 "CHANDRA PAL SINGH"
1991 1993 "Kanth"    205 "Chhibramau" 3 22 "CHANDRA PAL SINGH" 18156 6 104.90555977626428 4 . 1 "CHANDRA PAL SINGH"
1991 1993 "Kanth"    205 "Kanth"      3  4 "CHANDRA PAL SINGH" 18156 6                  0 2 1 1 "CHANDRA PAL SINGH"
1991 1993 "Kanth"    205 "Kasganj"    3 26 "CHANDRA PAL SINGH" 18156 6 13.355054613263283 6 . 1 "CHANDRA PAL SINGH"
1991 1993 "Kanth"    205 "Kashipur"   3 34 "CHANDRA PAL SINGH" 18156 6            49.9999 7 . 1 "CHANDRA PAL SINGH"
1991 1993 "Kanth"    205 "Shahabad"   3  3 "CHANDRA PAL SINGH" 18156 6            49.9999 3 . 1 "CHANDRA PAL SINGH"
1991 1993 "Kanth"    205 "Siana"      3 16 "CHANDRA PAL SINGH" 18156 6 47.062286729680835 1 . 1 "CHANDRA PAL SINGH"
1991 1993 "Shahabad" 371 "Bangarmau"  2 10 "CHANDRA PAL SINGH" 19101 6            49.9999 7 . 1 "CHANDRA PAL SINGH"
1991 1993 "Shahabad" 371 "Chhibramau" 2 22 "CHANDRA PAL SINGH" 19101 6            49.9999 3 . 1 "CHANDRA PAL SINGH"
1991 1993 "Shahabad" 371 "Kanth"      2  4 "CHANDRA PAL SINGH" 19101 6            49.9999 5 . 1 "CHANDRA PAL SINGH"
1991 1993 "Shahabad" 371 "Kasganj"    2 26 "CHANDRA PAL SINGH" 19101 6            49.9999 6 . 1 "CHANDRA PAL SINGH"
1991 1993 "Shahabad" 371 "Kashipur"   2 34 "CHANDRA PAL SINGH" 19101 6            49.9999 2 . 1 "CHANDRA PAL SINGH"
1991 1993 "Shahabad" 371 "Shahabad"   2  3 "CHANDRA PAL SINGH" 19101 6                  0 4 1 1 "CHANDRA PAL SINGH"
1991 1993 "Shahabad" 371 "Siana"      2 16 "CHANDRA PAL SINGH" 19101 6            49.9999 1 . 1 "CHANDRA PAL SINGH"
end

Last edited by Vignesh Ponnusamy; 17 Nov 2018, 00:16.

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 Nov 2018, 09:28

You have two problems in your code.

Code:

bysort id : gen cum_simil=sum(similscore)

creates a running total of similscore, so that one observation has a value of 1 even if the total value is more than 1. You need

Code:

bysort id : egen cum_simil=total(similscore)

Code:

by id, sort: gen has_perfect_match = 1 if d==0 & cum_simil>1

creates has_perfect_match with the value 1 or missing. But then

Code:

drop if has_perfect_match & d!=0

is the same as

Code:

drop if has_perfect_match!=0 & d!=0

since any nonzero value - including missing - is treated as true, so every observations with d!=0 is dropped. What you needed was

Code:

by id, sort: gen has_perfect_match = d==0 & cum_simil>1

Putting this all together, the following code, run on your data, demonstrates the differences between your cum_simil and has_perfect_match and my n_cum_simil and n_has_perfect_match.

Code:

// egen id = group( year_m const_m candidate_m )
xtset id
bysort id : egen n_cum_simil=total(similscore) // to take care of duplicates
by id, sort: gen n_has_perfect_match = d==0 & n_cum_simil>1
// drop if has_perfect_match & d!=0

list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
  if !(has_perfect_match & d!=0) | !(n_has_perfect_match), sepby(id) noobs

Code:

  +--------------------------------------------------------------------------+
  |    id           d   simils~e   cum_si~l   has_pe~h   n_cum_~l   n_has_~h |
  |--------------------------------------------------------------------------|
  | 17782   107.89971          1          7          .          7          0 |
  | 17782   43.095482          1          2          .          7          0 |
  | 17782   64.949232          1          4          .          7          0 |
  | 17782   51.194615          1          6          .          7          0 |
  | 17782     49.9999          1          5          .          7          0 |
  | 17782     49.9999          1          1          .          7          0 |
  | 17782   113.41942          1          3          .          7          0 |
  |--------------------------------------------------------------------------|
  | 18156   167.82627          1          5          .          7          0 |
  | 18156   104.90556          1          4          .          7          0 |
  | 18156           0          1          2          1          7          1 |
  | 18156   13.355055          1          6          .          7          0 |
  | 18156     49.9999          1          7          .          7          0 |
  | 18156     49.9999          1          3          .          7          0 |
  | 18156   47.062287          1          1          .          7          0 |
  |--------------------------------------------------------------------------|
  | 19101     49.9999          1          7          .          7          0 |
  | 19101     49.9999          1          3          .          7          0 |
  | 19101     49.9999          1          5          .          7          0 |
  | 19101     49.9999          1          6          .          7          0 |
  | 19101     49.9999          1          2          .          7          0 |
  | 19101           0          1          4          1          7          1 |
  | 19101     49.9999          1          1          .          7          0 |
  +--------------------------------------------------------------------------+

Comment

Vignesh Ponnusamy

Join Date: May 2018

Posts: 29
#3

17 Nov 2018, 11:16

Thanks for your reply, Lisowski.
I have tried this version of the code as well before. If I assign either zero or one to has_perfect_match and try to drop that has perfect match and also d!=0, then none of the observations will be deleted. But, if I assign the value of one or a missing value to has_perfect_match and try to delete has_perfect_match and d!=0, as in the code I have provided earlier, then all the observations that are non-zero are deleted, totally discarding the has_perfect_match condition.

I have also tried your code for the sample dataset and same as before, '0 observations are deleted' is the result I end up with. There is some simple mistake but can't figure it out yet.

Regards.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 Nov 2018, 11:47

Perhaps instead

Code:

by id, sort: gen has_perfect_match = 1 if d==0 & cum_simil>1

should be

Code:

by id, sort: egen has_perfect_match = max(d==0 & cum_simil>1)

Code:

// egen id = group( year_m const_m candidate_m )
xtset id
bysort id : egen n_cum_simil=total(similscore) // to take care of duplicates
by id, sort: egen n_has_perfect_match = max(d==0 & n_cum_simil>1)

list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
  if !(has_perfect_match & d!=0) | !(n_has_perfect_match), sepby(id) noobs
  
drop if n_has_perfect_match & d!=0

list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
  , sepby(id) noobs

Code:

. list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
>   if !(has_perfect_match & d!=0) | !(n_has_perfect_match), sepby(id) noobs

  +--------------------------------------------------------------------------+
  |    id           d   simils~e   cum_si~l   has_pe~h   n_cum_~l   n_has_~h |
  |--------------------------------------------------------------------------|
  | 17782   107.89971          1          7          .          7          0 |
  | 17782   43.095482          1          2          .          7          0 |
  | 17782   64.949232          1          4          .          7          0 |
  | 17782   51.194615          1          6          .          7          0 |
  | 17782     49.9999          1          5          .          7          0 |
  | 17782     49.9999          1          1          .          7          0 |
  | 17782   113.41942          1          3          .          7          0 |
  |--------------------------------------------------------------------------|
  | 18156           0          1          2          1          7          1 |
  |--------------------------------------------------------------------------|
  | 19101           0          1          4          1          7          1 |
  +--------------------------------------------------------------------------+

.   
. drop if n_has_perfect_match & d!=0
(12 observations deleted)

. 
. list id d similscore cum_simil has_perfect_match n_cum_simil n_has_perfect_match ///
>   , sepby(id) noobs

  +--------------------------------------------------------------------------+
  |    id           d   simils~e   cum_si~l   has_pe~h   n_cum_~l   n_has_~h |
  |--------------------------------------------------------------------------|
  | 17782   107.89971          1          7          .          7          0 |
  | 17782   43.095482          1          2          .          7          0 |
  | 17782   64.949232          1          4          .          7          0 |
  | 17782   51.194615          1          6          .          7          0 |
  | 17782     49.9999          1          5          .          7          0 |
  | 17782     49.9999          1          1          .          7          0 |
  | 17782   113.41942          1          3          .          7          0 |
  |--------------------------------------------------------------------------|
  | 18156           0          1          2          1          7          1 |
  |--------------------------------------------------------------------------|
  | 19101           0          1          4          1          7          1 |
  +--------------------------------------------------------------------------+

Comment

Vignesh Ponnusamy

Join Date: May 2018

Posts: 29
#5

17 Nov 2018, 19:30

Thanks, Lisowski.
Works perfectly.

Regards
Vignesh.
Comment

Announcement

Deleting observations based on a condition

Comment

Comment

Comment

Comment