Dropping variables depending on multiple criteria

Helena Petersen

Join Date: Apr 2023

Posts: 13
#1

Dropping variables depending on multiple criteria

15 Apr 2023, 12:14

Hi!

I hope someone can help me out. After spending several days reading tips&tricks, as well as multiple Statalist posts, I still can't solve this. Here goes.

In this example I want to be able to drop line 3&4. To be specific: For this ID (770), line 1&2 are linked because they have the same category, but I want to be able to specify that "if type = D identify corresponding line with same category as D" In this case I want to keep both line 1&2 as well as 5&6, but I want to drop line 3&4.

I've generated a simple example of my data. Original dataset contains several thousand IDs and multiple variables. Different ID's will not have the same combinations of type and category as this exampe, so what i'm searching for is a way of solving my problem for all ID's not depending on defining each possible combination of type and category.

Hope I made myself clear, otherwise please ask me to explain (English is not my first language), also I hope I've posted the right way, this is my first post.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int ID long(type category) 770 1 1 770 2 1 770 2 2 770 2 2 770 2 3 770 1 3 end label values type type label def type 1 "D", modify label def type 2 "J", modify label values category category label def category 1 "A", modify label def category 2 "B", modify label def category 3 "C", modify

Last edited by Helena Petersen; 15 Apr 2023, 12:25.
Tags: data, drop, variable
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

15 Apr 2023, 12:32

Thank you for using -dataex- on your very first post!

I don't understand why you want to drop lines 3 and 4, but keep lines 5 and 6. After all, the category value for lines 5 and 6 is also different from that in lines 1 and 2. So what distinguishes lines 3 and 4 here?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10289
#3

15 Apr 2023, 12:43

If you want to keep groups that include type=1 (D)

Code:

bys ID category: egen tag= max(type==1) keep if tag

See https://www.stata.com/support/faqs/d...ble-recording/

Last edited by Andrew Musau; 15 Apr 2023, 12:47.
Comment
Helena Petersen

Join Date: Apr 2023

Posts: 13
#4

15 Apr 2023, 13:07

Hi Clyde.

Thank u for your quick response!

The overall purpose is to be able to compare other variables (I've deleted these in this example to simplify, cause these variables contain very complex info) within the specific two lines where:
- Line 1 must include D, and line 2 must include same category as the category in the Line 1 (The line where type is D).

The reason I want to drop line 4 and 5 is that they don't fulfill the criterion that one of these contains D in type.

I have sorted data by CPR and then by category.

If you need be to give a more exact example, please let me know.

Last edited by Helena Petersen; 15 Apr 2023, 13:18.
Comment
Helena Petersen

Join Date: Apr 2023

Posts: 13
#5

15 Apr 2023, 13:30

Hi Andrew,

Thank you for your suggestion - unfortunately it does not work as intended when i run the code.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10289
#6

15 Apr 2023, 13:42

Line 1 must include D, and line 2 must include same category as the category in the Line 1 (The line where type is D).

This implies a separate condition. Referring to the position of an observation is not very useful without some variable defining order. I assume that sorting by category preserves the order within an ID.

Code:

bys ID (category type): gen tag1= type[1]==1 & category[1]==category[2] bys ID category: egen tag2= max(type==1) keep if tag1 & tag2

Last edited by Andrew Musau; 15 Apr 2023, 13:45.
Comment
Helena Petersen

Join Date: Apr 2023

Posts: 13
#7

15 Apr 2023, 14:05

Andrew Musau - Sorry, it still does not work. For the ID I included in my example only zeros are generated in both tag1 & tag2.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10289

15 Apr 2023, 14:38

You must be running a different example.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int ID long(type category)
770 1 1
770 2 1
770 2 2
770 2 2
770 2 3
770 1 3
end
label values type type
label def type 1 "D", modify
label def type 2 "J", modify
label values category category
label def category 1 "A", modify
label def category 2 "B", modify
label def category 3 "C", modify

bys ID (category type): gen tag1= type[1]==1 & category[1]==category[2]
bys ID category: egen tag2= max(type==1)
keep if tag1 & tag2

Res.:

Code:

. l

     +-------------------------------------+
     |  ID   type   category   tag1   tag2 |
     |-------------------------------------|
  1. | 770      D          A      1      1 |
  2. | 770      J          A      1      1 |
  3. | 770      D          C      1      1 |
  4. | 770      J          C      1      1 |
     +-------------------------------------+

Comment

Helena Petersen

Join Date: Apr 2023

Posts: 13
#9

16 Apr 2023, 00:41

Andrew Musau Thank you, I can see that it works perfect in my simple example. I guess that I made it to simple.
I've now made a new dataex bases on a few IDs from the original dataset.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long ID str10 Date long(Category Type) 770 "25-06-2008" 14 4 770 "25-06-2008" 14 4 770 "25-06-2008" 77 4 770 "25-06-2008" 9 4 770 "25-06-2008" 77 3 770 "25-06-2008" 9 3 3119 "30-03-2017" 10 2 3119 "30-03-2017" 10 2 2179 "20-09-2017" 25 4 2179 "20-09-2017" 25 3 2791 "16-12-2012" 10 2 2791 "17-12-2012" 10 2 2791 "25-12-2012" 10 2 2791 "24-12-2012" 10 2 2791 "24-12-2012" 10 2 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 374 "13-12-2012" 118 3 374 "13-12-2012" 118 3 2571 "29-08-2016" 10 2 2571 "29-08-2016" 10 2 end label values ID CPR label values Category Prøvekategori label values Type Prøvetype label def Prøvetype 2 "B", modify label def Prøvetype 3 "D", modify label def Prøvetype 4 "J", modify

In these specific observations I would like to ONLY keep observations for a specific ID if the following criteria are met
1. The ID MUST have one or more Type=D, if an ID does not have any Type=D, I want to drop all observations for this ID
2. For an ID that has min. one D, there must be a Category which has the same value as the Category value in the observation where Type=D, if not I want to be able to drop these observations as well. The corresponding Category with the same value can have more than one observation (I want to keep all these)

For a specific ID this combination can occour 0, 1 or even several times.

In my dataset I also have string variables containing random text.

I think it's very complex, and I've really tried many possible solutions including tagging and "foreach" but can't get it right.
I really hope you can help me, it would be much appreciated.

Best.
Comment
Helena Petersen

Join Date: Apr 2023

Posts: 13
#10

16 Apr 2023, 00:48

Andrew Musau Thank you, I can see that it works perfect in my simple example. I guess that I made it to simple.
I've now made a new dataex bases on a few IDs from the original dataset.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long ID str10 Date long(Category Type) 770 "25-06-2008" 14 4 770 "25-06-2008" 14 4 770 "25-06-2008" 77 4 770 "25-06-2008" 9 4 770 "25-06-2008" 77 3 770 "25-06-2008" 9 3 3119 "30-03-2017" 10 2 3119 "30-03-2017" 10 2 2179 "20-09-2017" 25 4 2179 "20-09-2017" 25 3 2791 "16-12-2012" 10 2 2791 "17-12-2012" 10 2 2791 "25-12-2012" 10 2 2791 "24-12-2012" 10 2 2791 "24-12-2012" 10 2 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 2791 "13-12-2012" 118 4 374 "13-12-2012" 118 3 374 "13-12-2012" 118 3 2571 "29-08-2016" 10 2 2571 "29-08-2016" 10 2 end label values ID CPR label values Category Prøvekategori label values Type Prøvetype label def Prøvetype 2 "B", modify label def Prøvetype 3 "D", modify label def Prøvetype 4 "J", modify

In these specific observations I would like to ONLY keep observations for a specific ID if the following criteria are met
1. The ID MUST have one or more Type=D, if an ID does not have any Type=D, I want to drop all observations for this ID
2. For an ID that has min. one D, there must be a Category which has the same value as the Category value in the observation where Type=D, if not I want to be able to drop these observations as well. The corresponding Category with the same value can have more than one observation (I want to keep all these)

For a specific ID this combination can occour 0, 1 or even several times.

In my dataset I also have string variables containing random text.

I think it's very complex, and I've really tried many possible solutions including tagging and "foreach" but can't get it right.
I really hope you can help me, it would be much appreciated.

Best.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10289

#11

16 Apr 2023, 05:03

The second condition is not very clear.

2. For an ID that has min. one D, there must be a Category which has the same value as the Category value in the observation where Type=D, if not I want to be able to drop these observations as well. The corresponding Category with the same value can have more than one observation (I want to keep all these)

Here, I assume that all kept IDs must have at least one category where both type= "D" and _N>1, where _N is the observation count. If so, this condition encompasses the first. You also need to create a numerical date variable.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID str10 Date long(Category Type)
 770 "25-06-2008"  14 4
 770 "25-06-2008"  14 4
 770 "25-06-2008"  77 4
 770 "25-06-2008"   9 4
 770 "25-06-2008"  77 3
 770 "25-06-2008"   9 3
3119 "30-03-2017"  10 2
3119 "30-03-2017"  10 2
2179 "20-09-2017"  25 4
2179 "20-09-2017"  25 3
2791 "16-12-2012"  10 2
2791 "17-12-2012"  10 2
2791 "25-12-2012"  10 2
2791 "24-12-2012"  10 2
2791 "24-12-2012"  10 2
2791 "13-12-2012" 118 4
2791 "13-12-2012" 118 4
2791 "13-12-2012" 118 4
2791 "13-12-2012" 118 4
2791 "13-12-2012" 118 4
2791 "13-12-2012" 118 4
 374 "13-12-2012" 118 3
 374 "13-12-2012" 118 3
2571 "29-08-2016"  10 2
2571 "29-08-2016"  10 2
end
label values ID CPR
label values Category Prøvekategori
label values Type Prøvetype
label def Prøvetype 2 "B", modify
label def Prøvetype 3 "D", modify
label def Prøvetype 4 "J", modify

gen date= daily(Date, "DMY"), before(Date)
format date %td
gen long seq=_n
bys ID Category: gen size=_N
bys ID (seq): egen tag= max(size>1& Type==3)
keep if tag

Res.:

Code:

. sort seq

. l, sepby(ID)

     +--------------------------------------------------------------------+
     |   ID        date         Date   Category   Type   seq   size   tag |
     |--------------------------------------------------------------------|
  1. |  770   25jun2008   25-06-2008         14      J     1      2     1 |
  2. |  770   25jun2008   25-06-2008         14      J     2      2     1 |
  3. |  770   25jun2008   25-06-2008         77      J     3      2     1 |
  4. |  770   25jun2008   25-06-2008          9      J     4      2     1 |
  5. |  770   25jun2008   25-06-2008         77      D     5      2     1 |
  6. |  770   25jun2008   25-06-2008          9      D     6      2     1 |
     |--------------------------------------------------------------------|
  7. | 2179   20sep2017   20-09-2017         25      J     9      2     1 |
  8. | 2179   20sep2017   20-09-2017         25      D    10      2     1 |
     |--------------------------------------------------------------------|
  9. |  374   13dec2012   13-12-2012        118      D    22      2     1 |
 10. |  374   13dec2012   13-12-2012        118      D    23      2     1 |
     +--------------------------------------------------------------------+

Comment

Helena Petersen

Join Date: Apr 2023

Posts: 13
#12

16 Apr 2023, 06:32

Andrew Musau I really appreciate your help. I't seems like we're getting closer, but still: For ID 770 I would like to drop obs. nr. 1&2 (Because they are the same category, but none of them has type=D). I only want to keep 3&5 (because these are "a couple" where Category is the same and one of them has type=D) AND 4&6 (because these are "a couple" where Category is the same and one of them has type=D)
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10289
#13

16 Apr 2023, 06:48

Change

bys ID (seq): egen tag= max(size>1& Type==3)

to

Code:

bys ID Category (seq): egen tag= max(size>1& Type==3)
Comment
Amna Ghaffar

Join Date: Apr 2023

Posts: 4
#14

16 Apr 2023, 06:50

in STATA i have 500 data in this one variable is "date of last menstrual period" and second variable is "date of outcome or date of delivery" so find the difference in between the date of delivery and date of last menstrual period and generate new variable so what command will use in STATA
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10289
#15

16 Apr 2023, 06:55

Amna Ghaffar, welcome to Statalist. Please start a new thread and include a data example, e.g., by copying and pasting the result of

Code:

dataex in 1/30
Comment

Announcement