Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • drop, keep or cond() - which is the safest to use

    In the context of a large (N:105 434 persons – of these were 22 374 treated eating disorders (ED) patients) observational epidemiological follow-up study using record linkage, we are looking into educational attainment, i.e. highest level of education successfully completed in the two groups – the patients N: ~ 22 000 and the matched controls, N: ~82 000. Cf. The dataex-file attached. Using several sources of information, we have identified and recoded the datapoints where ISCED level changes.
    The present datafile consist of ~ 1.6 million observations, and we would like to keep only the relevant observations, i.e. – by person: the first and last observation, plus all observations where ISCED-level changes.
    We are unsure as to how to do this – I might have a ‘short circuit between the headphones’, but normally I am able to think.
    Is it ‘keep’, ‘drop’, or ‘cond()’ we should use, or should we use the ‘if’ command during the analyses?
    Happy New Year to all Statalisters
    Søren Nielsen

    Below you will find a sample of a .do file as well as the dataex.dta which might be helpful.

    changeISCED.do
    //purpose - to register changes in ISCED-level//
    //AUTHOR Søren Nielsen//
    //date 27-12-2022//
    use "C:\Users\Søren\Documents\Projekter\ANDKREG197 0-2010\Delprojekter\Uddannelse\work\dataex.dta"
    bysort id (HF10year): generate chISCED = ISCEDlevel - ISCEDlevel[_n-1]
    //purpose - to find and locate changes in ISCEDlevel//

    recode chISCED (1/7 = 1), gen(chptISCED).
    // purpose - to identify the points in time where ISCED changes from one level to another level//

    //as only observations where ISCED change level are of interest - all other observations might be dropped, or omitted in the analyses. Observations of interest are the following: the first [t==1] and the last [t=_N] observation of each person, as well as the observations where ISCED level changes[chptISCED==1].//

    //question - how is this accomplished in a 'safe' way//

    //keep if chptISCED== . |chptISCED==1|t=_N is 'not allowed'//

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str1 id str2 stratum float(exp sex) str1 ISCED byte ISCEDlevel float(HF10year t HF10age chISCED chptISCED)
    "6" "19" 0 2 "1" 1 1997  1 17.67 . .
    "6" "19" 0 2 "1" 1 1998  2 18.67 0 0
    "6" "19" 0 2 "1" 1 1999  3 19.67 0 0
    "6" "19" 0 2 "2" 2 2000  4 20.67 1 1
    "6" "19" 0 2 "2" 2 2001  5 21.67 0 0
    "6" "19" 0 2 "2" 2 2002  6 22.67 0 0
    "6" "19" 0 2 "5" 5 2003  7 23.67 3 1
    "6" "19" 0 2 "5" 5 2004  8 24.67 0 0
    "6" "19" 0 2 "5" 5 2005  9 25.67 0 0
    "6" "19" 0 2 "7" 7 2006 10 26.67 2 1
    "6" "19" 0 2 "7" 7 2007 11 27.67 0 0
    "6" "19" 0 2 "7" 7 2008 12 28.67 0 0
    "6" "19" 0 2 "7" 7 2009 13 29.67 0 0
    "6" "19" 0 2 "7" 7 2010 14 30.67 0 0
    "6" "19" 0 2 "7" 7 2011 15 31.67 0 0
    "6" "19" 0 2 "7" 7 2012 16 32.67 0 0
    "6" "19" 0 2 "7" 7 2013 17 33.67 0 0
    "6" "19" 0 2 "7" 7 2014 18 34.67 0 0
    "7" "99" 1 2 "1" 1 1995  1  16.5 . .
    "7" "99" 1 2 "1" 1 1996  2  17.5 0 0
    "7" "99" 1 2 "1" 1 1997  3  18.5 0 0
    "7" "99" 1 2 "1" 1 1998  4  19.5 0 0
    "7" "99" 1 2 "2" 2 1999  5  20.5 1 1
    "7" "99" 1 2 "2" 2 2000  6  21.5 0 0
    "7" "99" 1 2 "2" 2 2001  7  22.5 0 0
    "7" "99" 1 2 "2" 2 2002  8  23.5 0 0
    "7" "99" 1 2 "2" 2 2003  9  24.5 0 0
    "7" "99" 1 2 "2" 2 2004 10  25.5 0 0
    "7" "99" 1 2 "2" 2 2005 11  26.5 0 0
    "7" "99" 1 2 "5" 5 2007 12  28.5 3 1
    "7" "99" 1 2 "5" 5 2008 13  29.5 0 0
    "7" "99" 1 2 "5" 5 2009 14  30.5 0 0
    "7" "99" 1 2 "5" 5 2010 15  31.5 0 0
    "7" "99" 1 2 "5" 5 2011 16  32.5 0 0
    "7" "99" 1 2 "5" 5 2012 17  33.5 0 0
    "7" "99" 1 2 "5" 5 2013 18  34.5 0 0
    "7" "99" 1 2 "5" 5 2014 19  35.5 0 0
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 37 out of 37 observations

    This looks a bit more messy than I expected - this should be two examples of data "id" - id of persons, "stratum" means stratum - quintuplet number, "exp" - exposure 0 control, 1 ed-patient, "sex" 1 male 2 female,

    Sorry for any inconvenience .
    I hope that somebody will be able to advice on how to proceed

    Sincerely Søren Nielsen

  • #2
    I don't understand the problem.

    Comment


    • #3
      I don't understand the question either. The risks with either drop or keep come from a familiar 2 x 2 table: that you may lose observations you want and that you don't lose observations you don't want. I can't think of rules that always give the right answer, but common complications are

      1. Missing values.

      2. Whether blocks of observations for individuals are to be kept or dropped in entirety.

      Similarly I don't know rules for infallibly choosing between reducing to a subset that is relevant and selecting a subset with the if qualifier (which is not the if command). Common considerations are just speed, memory requirements, clarity and consistency,

      cond() is a function that can be useful. Again, I am not clear how it is relevant here.

      Comment


      • #4
        The most I could gather, after reading it again, is OP wants to know how
        Code:
        keep if chptISCED== . |chptISCED==1|t=_N
        can be done "safely". And perhaps this does accomplish the goal (well evidently not given the syntax error!)..... But, as Nick first introduced me to, I think this is sort of an xy problem. For us to get the the bottom of this, we need a better idea of what it is you want in simple terms. In other words, what part of this dataex do you wanna keep? And, on what criteria do you wanna keep it? Once we can answer that, then we can discuss what the solution is, so I wanna be clear about the problem first.

        Comment


        • #5
          Dear Statalisters

          Thanks for your efforts

          I shall excuse my muddled question, and below I try once more with some clarifications.

          I have the data relevant for comparisons between the exposed and unexposed groups, by sex at birth, ED-diagnosis and age.

          My goal is to remove all observations where ISCED level do not change, i.e. chptISCED==0, but to keep the the first and the last observation for each person.

          Observations of interest are the following: the first [t==1] and the last [t=_N] observation of each person, as well as the observations where ISCED level changes[chptISCED==1]
          ?question - how is this accomplished in a 'safe' way

          I have tried the following - .keep if chptISCED== . |chptISCED==1|t=_N , and Stata replied 'not allowed'

          So - what is wrong with the command that also Jared suggests?

          Comment


          • #6
            Some further comments - what I want to do is to 'compare distributions of ordinal data' - cf. Stephen P. Jenkins The Stata Journal (2020), 20, Number 3, pp. 505-531

            At this time I am trying to prepare my dataset for these analyses, and in that process the paper by Nicholas J. Cox - 'Speaking Stata: How to move step by: step (The Stata Journal (2002), 2, Number 1, pp. 86-102 was very helpful, as might be gathered from the .do-file in my post above.

            Comment


            • #7
              Perhaps this:
              Code:
              sort id HF10year
              by id: gen byte ISCED_change = (ISCEDlevel != ISCEDlevel[_n-1]) & _n>1
              by id: gen byte wanted = ISCED_change | _n == 1 | _n == _N
              
              keep if wanted
              This obviates the need for chISCED, chptISCED and t. The resulting observations:

              Code:
              . list id ISCEDlevel HF10year ISCED_change, noobs sepby(id) ab(20)
              
                +-------------------------------------------+
                | id   ISCEDlevel   HF10year   ISCED_change |
                |-------------------------------------------|
                |  6            1       1997              0 |
                |  6            2       2000              1 |
                |  6            5       2003              1 |
                |  6            7       2006              1 |
                |  6            7       2014              0 |
                |-------------------------------------------|
                |  7            1       1995              0 |
                |  7            2       1999              1 |
                |  7            5       2007              1 |
                |  7            5       2014              0 |
                +-------------------------------------------+

              Comment


              • #8
                In the future, you need to show us your code in code blocks so we can follow you better. Is this the result you'd imagined, then?
                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str1 id str2 stratum float(exp sex) str1 ISCED byte ISCEDlevel float(HF10year t HF10age chISCED chptISCED)
                "6" "19" 0 2 "1" 1 1997  1 17.67 . .
                "6" "19" 0 2 "1" 1 1998  2 18.67 0 0
                "6" "19" 0 2 "1" 1 1999  3 19.67 0 0
                "6" "19" 0 2 "2" 2 2000  4 20.67 1 1
                "6" "19" 0 2 "2" 2 2001  5 21.67 0 0
                "6" "19" 0 2 "2" 2 2002  6 22.67 0 0
                "6" "19" 0 2 "5" 5 2003  7 23.67 3 1
                "6" "19" 0 2 "5" 5 2004  8 24.67 0 0
                "6" "19" 0 2 "5" 5 2005  9 25.67 0 0
                "6" "19" 0 2 "7" 7 2006 10 26.67 2 1
                "6" "19" 0 2 "7" 7 2007 11 27.67 0 0
                "6" "19" 0 2 "7" 7 2008 12 28.67 0 0
                "6" "19" 0 2 "7" 7 2009 13 29.67 0 0
                "6" "19" 0 2 "7" 7 2010 14 30.67 0 0
                "6" "19" 0 2 "7" 7 2011 15 31.67 0 0
                "6" "19" 0 2 "7" 7 2012 16 32.67 0 0
                "6" "19" 0 2 "7" 7 2013 17 33.67 0 0
                "6" "19" 0 2 "7" 7 2014 18 34.67 0 0
                "7" "99" 1 2 "1" 1 1995  1  16.5 . .
                "7" "99" 1 2 "1" 1 1996  2  17.5 0 0
                "7" "99" 1 2 "1" 1 1997  3  18.5 0 0
                "7" "99" 1 2 "1" 1 1998  4  19.5 0 0
                "7" "99" 1 2 "2" 2 1999  5  20.5 1 1
                "7" "99" 1 2 "2" 2 2000  6  21.5 0 0
                "7" "99" 1 2 "2" 2 2001  7  22.5 0 0
                "7" "99" 1 2 "2" 2 2002  8  23.5 0 0
                "7" "99" 1 2 "2" 2 2003  9  24.5 0 0
                "7" "99" 1 2 "2" 2 2004 10  25.5 0 0
                "7" "99" 1 2 "2" 2 2005 11  26.5 0 0
                "7" "99" 1 2 "5" 5 2007 12  28.5 3 1
                "7" "99" 1 2 "5" 5 2008 13  29.5 0 0
                "7" "99" 1 2 "5" 5 2009 14  30.5 0 0
                "7" "99" 1 2 "5" 5 2010 15  31.5 0 0
                "7" "99" 1 2 "5" 5 2011 16  32.5 0 0
                "7" "99" 1 2 "5" 5 2012 17  33.5 0 0
                "7" "99" 1 2 "5" 5 2013 18  34.5 0 0
                "7" "99" 1 2 "5" 5 2014 19  35.5 0 0
                end
                
                destring id, replace
                
                xtset id t, g
                
                qbys id: keep if inlist(t,_N,1) | chptISCED==1
                
                cls
                
                l
                Resulting Output:
                Code:
                     +--------------------------------------------------------------------------------------------+
                     | id   stratum   exp   sex   ISCED   ISCEDl~l   HF10year    t   HF10age   chISCED   chptIS~D |
                     |--------------------------------------------------------------------------------------------|
                  1. |  6        19     0     2       1          1       1997    1     17.67         .          . |
                  2. |  6        19     0     2       2          2       2000    4     20.67         1          1 |
                  3. |  6        19     0     2       5          5       2003    7     23.67         3          1 |
                  4. |  6        19     0     2       7          7       2006   10     26.67         2          1 |
                  5. |  6        19     0     2       7          7       2014   18     34.67         0          0 |
                     |--------------------------------------------------------------------------------------------|
                  6. |  7        99     1     2       1          1       1995    1      16.5         .          . |
                  7. |  7        99     1     2       2          2       1999    5      20.5         1          1 |
                  8. |  7        99     1     2       5          5       2007   12      28.5         3          1 |
                  9. |  7        99     1     2       5          5       2014   19      35.5         0          0 |
                     +--------------------------------------------------------------------------------------------+

                Comment


                • #9
                  Thanks a lot to Hemanshu and Jared

                  I have now two avenues to try out

                  Very helpful indeed

                  Comment


                  • #10
                    Alternatively, the code in #4 can be used, but only after modifying it to:

                    Code:
                    bysort id (HF10year): keep if chptISCED== . | chptISCED == 1 | t == _N

                    Comment

                    Working...
                    X