Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Delete row duplicates and move code forward (delete missing values)

    Hi all!
    I have a programming problem that I can't get my head around and hope that you'll be able to help me. Right now my dataset looks like:

    ID CODE1 CODE2 CODE3 CODE4 CODE5 CODE6
    1 X001 X001 X003
    2 X001 X002 X001
    3 X004 X001 X004
    4 T004 X005 T004
    5 R001 T004 R001 T004 T004 R002
    but I would like to remove all row duplicates and compress the data set so that missing values are placed towards the end:

    ID CODE1 CODE2 CODE3 CODE4 CODE5 CODE6
    1 X001 X003
    2 X001 X002
    3 X004 X001
    4 T004 X005
    5 R001 T004 R002
    Is there a smart way to do this?

    Thank you for taking your time!

    Best regards, Viktor

  • #2
    Please see https://www.statalist.org/forums/help#stata for how to present data examples.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id str4(code1 code2 code3 code4 code5 code6)
    1 "X001" ""     "X001" ""     "X003" ""    
    2 ""     "X001" "X002" "X001" ""     ""    
    3 "X004" ""     "X001" ""     ""     "X004"
    4 ""     ""     ""     "T004" "X005" "T004"
    5 "R001" "T004" "R001" "T004" "T004" "R002"
    end
    
    reshape long code, i(id) j(which)
    drop if missing(code)
    bysort id code (which) : keep if _n == 1 
    bysort id (which) : replace which = _n 
    reshape wide code, i(id) j(which)
    
    list
    It is possible that the wide layout you ask for isn't ideal for later analysis, but the abstracted nature of the example makes it impossible to be sure on that.

    Comment


    • #3
      Hi again all,

      I used the code above which worked perfectly for all rows with several values. However, rows with only one value got unintentionally deleted completley - causing a lot of missing values in the dataset. Is there any way around this?

      Thanks in advance!
      Viktor

      Comment


      • #4
        I don't see why that would happen. I added an example observation with only one non-missing value and can't reproduce any problem.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte id str4(code1 code2 code3 code4 code5 code6)
        1 "X001" ""     "X001" ""     "X003" ""    
        2 ""     "X001" "X002" "X001" ""     ""    
        3 "X004" ""     "X001" ""     ""     "X004"
        4 ""     ""     ""     "T004" "X005" "T004"
        5 "R001" "T004" "R001" "T004" "T004" "R002"
        6 "Z007" "" "" "" "" "" ""
        end
        
        reshape long code, i(id) j(which)
        drop if missing(code)
        bysort id code (which) : keep if _n == 1 
        bysort id (which) : replace which = _n 
        reshape wide code, i(id) j(which)
        
        list
        
             +----------------------------+
             | id   code1   code2   code3 |
             |----------------------------|
          1. |  1    X001    X003         |
          2. |  2    X001    X002         |
          3. |  3    X004    X001         |
          4. |  4    T004    X005         |
          5. |  5    R001    T004    R002 |
             |----------------------------|
          6. |  6    Z007                 |
             +----------------------------+

        Comment


        • #5
          Thank you very much for your quick reply Nick!

          You were right, the code works. The reason for missing values were that the "code 1" in the dataset had a different formatting, thus not included in the analysis. After correcting this, it works!

          Thanks again for your help!
          Viktor

          Comment


          • #6
            Thanks for confirmation.

            Comment


            • #7
              Hi again everyone,

              With the data set mentioned above I want STATA to look through CODE1-CODE5 for each ID.
              I have a code set of around 300 codes (i.e. X001-X200 + R001-R056 + R070 +R073 +R080-R120) that I want to include. If none of these codes are found in CODE1-CODE5, I want to drop that ID. Is there a smart way to do this?

              Thanks in advance!

              Best regards,
              Viktor

              Comment


              • #8
                So the first step is to -reshape- your existing data set to long layout, just as was done in #2.

                The next step depends on the kind of values these codes can take. In your example data, it appears that the codes are always a single upper case letter followed by a three digit number. If that is true of the full data set it looks like this:
                Code:
                reshape long code, i(id) j(which)
                drop if missing(code)
                by id (which), sort: replace which = _n
                
                by id (which), sort: egen to_keep = max( ///
                        inrange(code, "X001", "X200") ///
                        | inrange(code, "R001", "R056") ///
                        | inlist(code, "R070", "R073") ///
                        | inrange(code, "R080", "R120") ///
                    )
                keep if to_keep
                This approach, however, will fail if the codes do not all conform to the one upper case letter plus three digits pattern. If, for example, a code could be X0011, it would be considered as in the range between X001 and X200, which probably is not what you want. If that kind of situation can arise, a different approach is needed. Post back for additional help in that case.

                Comment

                Working...
                X