drop, keep or cond() - which is the safest to use

Soren Nielsen

Join Date: Oct 2018

Posts: 24
#1

drop, keep or cond() - which is the safest to use

28 Dec 2022, 04:45

In the context of a large (N:105 434 persons – of these were 22 374 treated eating disorders (ED) patients) observational epidemiological follow-up study using record linkage, we are looking into educational attainment, i.e. highest level of education successfully completed in the two groups – the patients N: ~ 22 000 and the matched controls, N: ~82 000. Cf. The dataex-file attached. Using several sources of information, we have identified and recoded the datapoints where ISCED level changes.
The present datafile consist of ~ 1.6 million observations, and we would like to keep only the relevant observations, i.e. – by person: the first and last observation, plus all observations where ISCED-level changes.
We are unsure as to how to do this – I might have a ‘short circuit between the headphones’, but normally I am able to think.
Is it ‘keep’, ‘drop’, or ‘cond()’ we should use, or should we use the ‘if’ command during the analyses?
Happy New Year to all Statalisters
Søren Nielsen

Below you will find a sample of a .do file as well as the dataex.dta which might be helpful.

changeISCED.do
//purpose - to register changes in ISCED-level//
//AUTHOR Søren Nielsen//
//date 27-12-2022//
use "C:\Users\Søren\Documents\Projekter\ANDKREG197 0-2010\Delprojekter\Uddannelse\work\dataex.dta"
bysort id (HF10year): generate chISCED = ISCEDlevel - ISCEDlevel[_n-1]
//purpose - to find and locate changes in ISCEDlevel//

recode chISCED (1/7 = 1), gen(chptISCED).
// purpose - to identify the points in time where ISCED changes from one level to another level//

//as only observations where ISCED change level are of interest - all other observations might be dropped, or omitted in the analyses. Observations of interest are the following: the first [t==1] and the last [t=_N] observation of each person, as well as the observations where ISCED level changes[chptISCED==1].//

//question - how is this accomplished in a 'safe' way//

//keep if chptISCED== . |chptISCED==1|t=_N is 'not allowed'//

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str1 id str2 stratum float(exp sex) str1 ISCED byte ISCEDlevel float(HF10year t HF10age chISCED chptISCED) "6" "19" 0 2 "1" 1 1997 1 17.67 . . "6" "19" 0 2 "1" 1 1998 2 18.67 0 0 "6" "19" 0 2 "1" 1 1999 3 19.67 0 0 "6" "19" 0 2 "2" 2 2000 4 20.67 1 1 "6" "19" 0 2 "2" 2 2001 5 21.67 0 0 "6" "19" 0 2 "2" 2 2002 6 22.67 0 0 "6" "19" 0 2 "5" 5 2003 7 23.67 3 1 "6" "19" 0 2 "5" 5 2004 8 24.67 0 0 "6" "19" 0 2 "5" 5 2005 9 25.67 0 0 "6" "19" 0 2 "7" 7 2006 10 26.67 2 1 "6" "19" 0 2 "7" 7 2007 11 27.67 0 0 "6" "19" 0 2 "7" 7 2008 12 28.67 0 0 "6" "19" 0 2 "7" 7 2009 13 29.67 0 0 "6" "19" 0 2 "7" 7 2010 14 30.67 0 0 "6" "19" 0 2 "7" 7 2011 15 31.67 0 0 "6" "19" 0 2 "7" 7 2012 16 32.67 0 0 "6" "19" 0 2 "7" 7 2013 17 33.67 0 0 "6" "19" 0 2 "7" 7 2014 18 34.67 0 0 "7" "99" 1 2 "1" 1 1995 1 16.5 . . "7" "99" 1 2 "1" 1 1996 2 17.5 0 0 "7" "99" 1 2 "1" 1 1997 3 18.5 0 0 "7" "99" 1 2 "1" 1 1998 4 19.5 0 0 "7" "99" 1 2 "2" 2 1999 5 20.5 1 1 "7" "99" 1 2 "2" 2 2000 6 21.5 0 0 "7" "99" 1 2 "2" 2 2001 7 22.5 0 0 "7" "99" 1 2 "2" 2 2002 8 23.5 0 0 "7" "99" 1 2 "2" 2 2003 9 24.5 0 0 "7" "99" 1 2 "2" 2 2004 10 25.5 0 0 "7" "99" 1 2 "2" 2 2005 11 26.5 0 0 "7" "99" 1 2 "5" 5 2007 12 28.5 3 1 "7" "99" 1 2 "5" 5 2008 13 29.5 0 0 "7" "99" 1 2 "5" 5 2009 14 30.5 0 0 "7" "99" 1 2 "5" 5 2010 15 31.5 0 0 "7" "99" 1 2 "5" 5 2011 16 32.5 0 0 "7" "99" 1 2 "5" 5 2012 17 33.5 0 0 "7" "99" 1 2 "5" 5 2013 18 34.5 0 0 "7" "99" 1 2 "5" 5 2014 19 35.5 0 0 end

------------------ copy up to and including the previous line ------------------

Listed 37 out of 37 observations

This looks a bit more messy than I expected - this should be two examples of data "id" - id of persons, "stratum" means stratum - quintuplet number, "exp" - exposure 0 control, 1 ed-patient, "sex" 1 male 2 female,

Sorry for any inconvenience .
I hope that somebody will be able to advice on how to proceed

Sincerely Søren Nielsen
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

28 Dec 2022, 05:23

I don't understand the problem.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35664
#3

28 Dec 2022, 05:46

I don't understand the question either. The risks with either drop or keep come from a familiar 2 x 2 table: that you may lose observations you want and that you don't lose observations you don't want. I can't think of rules that always give the right answer, but common complications are

1. Missing values.

2. Whether blocks of observations for individuals are to be kept or dropped in entirety.

Similarly I don't know rules for infallibly choosing between reducing to a subset that is relevant and selecting a subset with the if qualifier (which is not the if command). Common considerations are just speed, memory requirements, clarity and consistency,

cond() is a function that can be useful. Again, I am not clear how it is relevant here.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#4

28 Dec 2022, 06:07

The most I could gather, after reading it again, is OP wants to know how

Code:

keep if chptISCED== . |chptISCED==1|t=_N

can be done "safely". And perhaps this does accomplish the goal (well evidently not given the syntax error!)..... But, as Nick first introduced me to, I think this is sort of an xy problem. For us to get the the bottom of this, we need a better idea of what it is you want in simple terms. In other words, what part of this dataex do you wanna keep? And, on what criteria do you wanna keep it? Once we can answer that, then we can discuss what the solution is, so I wanna be clear about the problem first.
Comment
Soren Nielsen

Join Date: Oct 2018

Posts: 24
#5

28 Dec 2022, 08:55

Dear Statalisters

Thanks for your efforts

I shall excuse my muddled question, and below I try once more with some clarifications.

I have the data relevant for comparisons between the exposed and unexposed groups, by sex at birth, ED-diagnosis and age.

My goal is to remove all observations where ISCED level do not change, i.e. chptISCED==0, but to keep the the first and the last observation for each person.

Observations of interest are the following: the first [t==1] and the last [t=_N] observation of each person, as well as the observations where ISCED level changes[chptISCED==1]
?question - how is this accomplished in a 'safe' way

I have tried the following - .keep if chptISCED== . |chptISCED==1|t=_N , and Stata replied 'not allowed'

So - what is wrong with the command that also Jared suggests?
Comment
Soren Nielsen

Join Date: Oct 2018

Posts: 24
#6

28 Dec 2022, 09:10

Some further comments - what I want to do is to 'compare distributions of ordinal data' - cf. Stephen P. Jenkins The Stata Journal (2020), 20, Number 3, pp. 505-531

At this time I am trying to prepare my dataset for these analyses, and in that process the paper by Nicholas J. Cox - 'Speaking Stata: How to move step by: step (The Stata Journal (2002), 2, Number 1, pp. 86-102 was very helpful, as might be gathered from the .do-file in my post above.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1381

28 Dec 2022, 09:51

Perhaps this:

Code:

sort id HF10year
by id: gen byte ISCED_change = (ISCEDlevel != ISCEDlevel[_n-1]) & _n>1
by id: gen byte wanted = ISCED_change | _n == 1 | _n == _N

keep if wanted

This obviates the need for chISCED, chptISCED and t. The resulting observations:

Code:

. list id ISCEDlevel HF10year ISCED_change, noobs sepby(id) ab(20)

  +-------------------------------------------+
  | id   ISCEDlevel   HF10year   ISCED_change |
  |-------------------------------------------|
  |  6            1       1997              0 |
  |  6            2       2000              1 |
  |  6            5       2003              1 |
  |  6            7       2006              1 |
  |  6            7       2014              0 |
  |-------------------------------------------|
  |  7            1       1995              0 |
  |  7            2       1999              1 |
  |  7            5       2007              1 |
  |  7            5       2014              0 |
  +-------------------------------------------+

Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

28 Dec 2022, 09:58

In the future, you need to show us your code in code blocks so we can follow you better. Is this the result you'd imagined, then?

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str1 id str2 stratum float(exp sex) str1 ISCED byte ISCEDlevel float(HF10year t HF10age chISCED chptISCED)
"6" "19" 0 2 "1" 1 1997  1 17.67 . .
"6" "19" 0 2 "1" 1 1998  2 18.67 0 0
"6" "19" 0 2 "1" 1 1999  3 19.67 0 0
"6" "19" 0 2 "2" 2 2000  4 20.67 1 1
"6" "19" 0 2 "2" 2 2001  5 21.67 0 0
"6" "19" 0 2 "2" 2 2002  6 22.67 0 0
"6" "19" 0 2 "5" 5 2003  7 23.67 3 1
"6" "19" 0 2 "5" 5 2004  8 24.67 0 0
"6" "19" 0 2 "5" 5 2005  9 25.67 0 0
"6" "19" 0 2 "7" 7 2006 10 26.67 2 1
"6" "19" 0 2 "7" 7 2007 11 27.67 0 0
"6" "19" 0 2 "7" 7 2008 12 28.67 0 0
"6" "19" 0 2 "7" 7 2009 13 29.67 0 0
"6" "19" 0 2 "7" 7 2010 14 30.67 0 0
"6" "19" 0 2 "7" 7 2011 15 31.67 0 0
"6" "19" 0 2 "7" 7 2012 16 32.67 0 0
"6" "19" 0 2 "7" 7 2013 17 33.67 0 0
"6" "19" 0 2 "7" 7 2014 18 34.67 0 0
"7" "99" 1 2 "1" 1 1995  1  16.5 . .
"7" "99" 1 2 "1" 1 1996  2  17.5 0 0
"7" "99" 1 2 "1" 1 1997  3  18.5 0 0
"7" "99" 1 2 "1" 1 1998  4  19.5 0 0
"7" "99" 1 2 "2" 2 1999  5  20.5 1 1
"7" "99" 1 2 "2" 2 2000  6  21.5 0 0
"7" "99" 1 2 "2" 2 2001  7  22.5 0 0
"7" "99" 1 2 "2" 2 2002  8  23.5 0 0
"7" "99" 1 2 "2" 2 2003  9  24.5 0 0
"7" "99" 1 2 "2" 2 2004 10  25.5 0 0
"7" "99" 1 2 "2" 2 2005 11  26.5 0 0
"7" "99" 1 2 "5" 5 2007 12  28.5 3 1
"7" "99" 1 2 "5" 5 2008 13  29.5 0 0
"7" "99" 1 2 "5" 5 2009 14  30.5 0 0
"7" "99" 1 2 "5" 5 2010 15  31.5 0 0
"7" "99" 1 2 "5" 5 2011 16  32.5 0 0
"7" "99" 1 2 "5" 5 2012 17  33.5 0 0
"7" "99" 1 2 "5" 5 2013 18  34.5 0 0
"7" "99" 1 2 "5" 5 2014 19  35.5 0 0
end

destring id, replace

xtset id t, g

qbys id: keep if inlist(t,_N,1) | chptISCED==1

cls

l

Resulting Output:

Code:

     +--------------------------------------------------------------------------------------------+
     | id   stratum   exp   sex   ISCED   ISCEDl~l   HF10year    t   HF10age   chISCED   chptIS~D |
     |--------------------------------------------------------------------------------------------|
  1. |  6        19     0     2       1          1       1997    1     17.67         .          . |
  2. |  6        19     0     2       2          2       2000    4     20.67         1          1 |
  3. |  6        19     0     2       5          5       2003    7     23.67         3          1 |
  4. |  6        19     0     2       7          7       2006   10     26.67         2          1 |
  5. |  6        19     0     2       7          7       2014   18     34.67         0          0 |
     |--------------------------------------------------------------------------------------------|
  6. |  7        99     1     2       1          1       1995    1      16.5         .          . |
  7. |  7        99     1     2       2          2       1999    5      20.5         1          1 |
  8. |  7        99     1     2       5          5       2007   12      28.5         3          1 |
  9. |  7        99     1     2       5          5       2014   19      35.5         0          0 |
     +--------------------------------------------------------------------------------------------+

Comment

Soren Nielsen

Join Date: Oct 2018

Posts: 24
#9

28 Dec 2022, 10:02

Thanks a lot to Hemanshu and Jared

I have now two avenues to try out

Very helpful indeed
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1381
#10

28 Dec 2022, 10:04

Alternatively, the code in #4 can be used, but only after modifying it to:

Code:

bysort id (HF10year): keep if chptISCED== . | chptISCED == 1 | t == _N
Comment

Announcement

drop, keep or cond() - which is the safest to use

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment