Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping variables depending on multiple criteria

    Hi!

    I hope someone can help me out. After spending several days reading tips&tricks, as well as multiple Statalist posts, I still can't solve this. Here goes.


    In this example I want to be able to drop line 3&4. To be specific: For this ID (770), line 1&2 are linked because they have the same category, but I want to be able to specify that "if type = D identify corresponding line with same category as D" In this case I want to keep both line 1&2 as well as 5&6, but I want to drop line 3&4.

    I've generated a simple example of my data. Original dataset contains several thousand IDs and multiple variables. Different ID's will not have the same combinations of type and category as this exampe, so what i'm searching for is a way of solving my problem for all ID's not depending on defining each possible combination of type and category.

    Hope I made myself clear, otherwise please ask me to explain (English is not my first language), also I hope I've posted the right way, this is my first post.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int ID long(type category)
    770 1 1
    770 2 1
    770 2 2
    770 2 2
    770 2 3
    770 1 3
    end
    label values type type
    label def type 1 "D", modify
    label def type 2 "J", modify
    label values category category
    label def category 1 "A", modify
    label def category 2 "B", modify
    label def category 3 "C", modify
    Last edited by Helena Petersen; 15 Apr 2023, 12:25.

  • #2
    Thank you for using -dataex- on your very first post!

    I don't understand why you want to drop lines 3 and 4, but keep lines 5 and 6. After all, the category value for lines 5 and 6 is also different from that in lines 1 and 2. So what distinguishes lines 3 and 4 here?

    Comment


    • #3
      If you want to keep groups that include type=1 (D)

      Code:
      bys ID category: egen tag= max(type==1)
      keep if tag
      See https://www.stata.com/support/faqs/d...ble-recording/
      Last edited by Andrew Musau; 15 Apr 2023, 12:47.

      Comment


      • #4
        Hi Clyde.

        Thank u for your quick response!

        The overall purpose is to be able to compare other variables (I've deleted these in this example to simplify, cause these variables contain very complex info) within the specific two lines where:
        - Line 1 must include D, and line 2 must include same category as the category in the Line 1 (The line where type is D).

        The reason I want to drop line 4 and 5 is that they don't fulfill the criterion that one of these contains D in type.

        I have sorted data by CPR and then by category.

        If you need be to give a more exact example, please let me know.
        Last edited by Helena Petersen; 15 Apr 2023, 13:18.

        Comment


        • #5
          Hi Andrew,

          Thank you for your suggestion - unfortunately it does not work as intended when i run the code.

          Comment


          • #6
            Line 1 must include D, and line 2 must include same category as the category in the Line 1 (The line where type is D).
            This implies a separate condition. Referring to the position of an observation is not very useful without some variable defining order. I assume that sorting by category preserves the order within an ID.

            Code:
            bys ID (category type): gen tag1= type[1]==1 & category[1]==category[2]
            bys ID category: egen tag2= max(type==1)
            keep if tag1 & tag2
            Last edited by Andrew Musau; 15 Apr 2023, 13:45.

            Comment


            • #7
              Andrew Musau - Sorry, it still does not work. For the ID I included in my example only zeros are generated in both tag1 & tag2.

              Comment


              • #8
                You must be running a different example.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input int ID long(type category)
                770 1 1
                770 2 1
                770 2 2
                770 2 2
                770 2 3
                770 1 3
                end
                label values type type
                label def type 1 "D", modify
                label def type 2 "J", modify
                label values category category
                label def category 1 "A", modify
                label def category 2 "B", modify
                label def category 3 "C", modify
                
                bys ID (category type): gen tag1= type[1]==1 & category[1]==category[2]
                bys ID category: egen tag2= max(type==1)
                keep if tag1 & tag2
                Res.:

                Code:
                . l
                
                     +-------------------------------------+
                     |  ID   type   category   tag1   tag2 |
                     |-------------------------------------|
                  1. | 770      D          A      1      1 |
                  2. | 770      J          A      1      1 |
                  3. | 770      D          C      1      1 |
                  4. | 770      J          C      1      1 |
                     +-------------------------------------+

                Comment


                • #9
                  Andrew Musau Thank you, I can see that it works perfect in my simple example. I guess that I made it to simple.
                  I've now made a new dataex bases on a few IDs from the original dataset.

                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input long ID str10 Date long(Category Type)
                   770 "25-06-2008"  14 4
                   770 "25-06-2008"  14 4
                   770 "25-06-2008"  77 4
                   770 "25-06-2008"   9 4
                   770 "25-06-2008"  77 3
                   770 "25-06-2008"   9 3
                  3119 "30-03-2017"  10 2
                  3119 "30-03-2017"  10 2
                  2179 "20-09-2017"  25 4
                  2179 "20-09-2017"  25 3
                  2791 "16-12-2012"  10 2
                  2791 "17-12-2012"  10 2
                  2791 "25-12-2012"  10 2
                  2791 "24-12-2012"  10 2
                  2791 "24-12-2012"  10 2
                  2791 "13-12-2012" 118 4
                  2791 "13-12-2012" 118 4
                  2791 "13-12-2012" 118 4
                  2791 "13-12-2012" 118 4
                  2791 "13-12-2012" 118 4
                  2791 "13-12-2012" 118 4
                   374 "13-12-2012" 118 3
                   374 "13-12-2012" 118 3
                  2571 "29-08-2016"  10 2
                  2571 "29-08-2016"  10 2
                  end
                  label values ID CPR
                  label values Category Prøvekategori
                  label values Type Prøvetype
                  label def Prøvetype 2 "B", modify
                  label def Prøvetype 3 "D", modify
                  label def Prøvetype 4 "J", modify

                  In these specific observations I would like to ONLY keep observations for a specific ID if the following criteria are met
                  1. The ID MUST have one or more Type=D, if an ID does not have any Type=D, I want to drop all observations for this ID
                  2. For an ID that has min. one D, there must be a Category which has the same value as the Category value in the observation where Type=D, if not I want to be able to drop these observations as well. The corresponding Category with the same value can have more than one observation (I want to keep all these)

                  For a specific ID this combination can occour 0, 1 or even several times.


                  In my dataset I also have string variables containing random text.

                  I think it's very complex, and I've really tried many possible solutions including tagging and "foreach" but can't get it right.
                  I really hope you can help me, it would be much appreciated.

                  Best.

                  Comment


                  • #10
                    Andrew Musau Thank you, I can see that it works perfect in my simple example. I guess that I made it to simple.
                    I've now made a new dataex bases on a few IDs from the original dataset.

                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input long ID str10 Date long(Category Type)
                     770 "25-06-2008"  14 4
                     770 "25-06-2008"  14 4
                     770 "25-06-2008"  77 4
                     770 "25-06-2008"   9 4
                     770 "25-06-2008"  77 3
                     770 "25-06-2008"   9 3
                    3119 "30-03-2017"  10 2
                    3119 "30-03-2017"  10 2
                    2179 "20-09-2017"  25 4
                    2179 "20-09-2017"  25 3
                    2791 "16-12-2012"  10 2
                    2791 "17-12-2012"  10 2
                    2791 "25-12-2012"  10 2
                    2791 "24-12-2012"  10 2
                    2791 "24-12-2012"  10 2
                    2791 "13-12-2012" 118 4
                    2791 "13-12-2012" 118 4
                    2791 "13-12-2012" 118 4
                    2791 "13-12-2012" 118 4
                    2791 "13-12-2012" 118 4
                    2791 "13-12-2012" 118 4
                     374 "13-12-2012" 118 3
                     374 "13-12-2012" 118 3
                    2571 "29-08-2016"  10 2
                    2571 "29-08-2016"  10 2
                    end
                    label values ID CPR
                    label values Category Prøvekategori
                    label values Type Prøvetype
                    label def Prøvetype 2 "B", modify
                    label def Prøvetype 3 "D", modify
                    label def Prøvetype 4 "J", modify

                    In these specific observations I would like to ONLY keep observations for a specific ID if the following criteria are met
                    1. The ID MUST have one or more Type=D, if an ID does not have any Type=D, I want to drop all observations for this ID
                    2. For an ID that has min. one D, there must be a Category which has the same value as the Category value in the observation where Type=D, if not I want to be able to drop these observations as well. The corresponding Category with the same value can have more than one observation (I want to keep all these)

                    For a specific ID this combination can occour 0, 1 or even several times.


                    In my dataset I also have string variables containing random text.

                    I think it's very complex, and I've really tried many possible solutions including tagging and "foreach" but can't get it right.
                    I really hope you can help me, it would be much appreciated.

                    Best.

                    Comment


                    • #11
                      The second condition is not very clear.

                      2. For an ID that has min. one D, there must be a Category which has the same value as the Category value in the observation where Type=D, if not I want to be able to drop these observations as well. The corresponding Category with the same value can have more than one observation (I want to keep all these)

                      Here, I assume that all kept IDs must have at least one category where both type= "D" and _N>1, where _N is the observation count. If so, this condition encompasses the first. You also need to create a numerical date variable.

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input long ID str10 Date long(Category Type)
                       770 "25-06-2008"  14 4
                       770 "25-06-2008"  14 4
                       770 "25-06-2008"  77 4
                       770 "25-06-2008"   9 4
                       770 "25-06-2008"  77 3
                       770 "25-06-2008"   9 3
                      3119 "30-03-2017"  10 2
                      3119 "30-03-2017"  10 2
                      2179 "20-09-2017"  25 4
                      2179 "20-09-2017"  25 3
                      2791 "16-12-2012"  10 2
                      2791 "17-12-2012"  10 2
                      2791 "25-12-2012"  10 2
                      2791 "24-12-2012"  10 2
                      2791 "24-12-2012"  10 2
                      2791 "13-12-2012" 118 4
                      2791 "13-12-2012" 118 4
                      2791 "13-12-2012" 118 4
                      2791 "13-12-2012" 118 4
                      2791 "13-12-2012" 118 4
                      2791 "13-12-2012" 118 4
                       374 "13-12-2012" 118 3
                       374 "13-12-2012" 118 3
                      2571 "29-08-2016"  10 2
                      2571 "29-08-2016"  10 2
                      end
                      label values ID CPR
                      label values Category Prøvekategori
                      label values Type Prøvetype
                      label def Prøvetype 2 "B", modify
                      label def Prøvetype 3 "D", modify
                      label def Prøvetype 4 "J", modify
                      
                      gen date= daily(Date, "DMY"), before(Date)
                      format date %td
                      gen long seq=_n
                      bys ID Category: gen size=_N
                      bys ID (seq): egen tag= max(size>1& Type==3)
                      keep if tag
                      Res.:

                      Code:
                      . sort seq
                      
                      . l, sepby(ID)
                      
                           +--------------------------------------------------------------------+
                           |   ID        date         Date   Category   Type   seq   size   tag |
                           |--------------------------------------------------------------------|
                        1. |  770   25jun2008   25-06-2008         14      J     1      2     1 |
                        2. |  770   25jun2008   25-06-2008         14      J     2      2     1 |
                        3. |  770   25jun2008   25-06-2008         77      J     3      2     1 |
                        4. |  770   25jun2008   25-06-2008          9      J     4      2     1 |
                        5. |  770   25jun2008   25-06-2008         77      D     5      2     1 |
                        6. |  770   25jun2008   25-06-2008          9      D     6      2     1 |
                           |--------------------------------------------------------------------|
                        7. | 2179   20sep2017   20-09-2017         25      J     9      2     1 |
                        8. | 2179   20sep2017   20-09-2017         25      D    10      2     1 |
                           |--------------------------------------------------------------------|
                        9. |  374   13dec2012   13-12-2012        118      D    22      2     1 |
                       10. |  374   13dec2012   13-12-2012        118      D    23      2     1 |
                           +--------------------------------------------------------------------+

                      Comment


                      • #12
                        Andrew Musau I really appreciate your help. I't seems like we're getting closer, but still: For ID 770 I would like to drop obs. nr. 1&2 (Because they are the same category, but none of them has type=D). I only want to keep 3&5 (because these are "a couple" where Category is the same and one of them has type=D) AND 4&6 (because these are "a couple" where Category is the same and one of them has type=D)

                        Comment


                        • #13

                          Change

                          bys ID (seq): egen tag= max(size>1& Type==3)
                          to

                          Code:
                          bys ID Category (seq): egen tag= max(size>1& Type==3)

                          Comment


                          • #14
                            in STATA i have 500 data in this one variable is "date of last menstrual period" and second variable is "date of outcome or date of delivery" so find the difference in between the date of delivery and date of last menstrual period and generate new variable so what command will use in STATA

                            Comment


                            • #15
                              Amna Ghaffar, welcome to Statalist. Please start a new thread and include a data example, e.g., by copying and pasting the result of

                              Code:
                              dataex in 1/30

                              Comment

                              Working...
                              X