Delete row duplicates and move code forward (delete missing values)

Viktor Johansson

Join Date: Mar 2023

Posts: 1
#1

Delete row duplicates and move code forward (delete missing values)

01 Mar 2023, 01:54

Hi all!
I have a programming problem that I can't get my head around and hope that you'll be able to help me. Right now my dataset looks like:

ID CODE1 CODE2 CODE3 CODE4 CODE5 CODE6

1 X001 X001 X003

2 X001 X002 X001

3 X004 X001 X004

4 T004 X005 T004

5 R001 T004 R001 T004 T004 R002

but I would like to remove all row duplicates and compress the data set so that missing values are placed towards the end:

ID CODE1 CODE2 CODE3 CODE4 CODE5 CODE6

1 X001 X003

2 X001 X002

3 X004 X001

4 T004 X005

5 R001 T004 R002

Is there a smart way to do this?

Thank you for taking your time!

Best regards, Viktor
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35754

01 Mar 2023, 02:03

Please see https://www.statalist.org/forums/help#stata for how to present data examples.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte id str4(code1 code2 code3 code4 code5 code6)
1 "X001" ""     "X001" ""     "X003" ""    
2 ""     "X001" "X002" "X001" ""     ""    
3 "X004" ""     "X001" ""     ""     "X004"
4 ""     ""     ""     "T004" "X005" "T004"
5 "R001" "T004" "R001" "T004" "T004" "R002"
end

reshape long code, i(id) j(which)
drop if missing(code)
bysort id code (which) : keep if _n == 1 
bysort id (which) : replace which = _n 
reshape wide code, i(id) j(which)

list

It is possible that the wide layout you ask for isn't ideal for later analysis, but the abstracted nature of the example makes it impossible to be sure on that.

Comment

Viktor Johanson

Join Date: Nov 2023

Posts: 10
#3

06 Nov 2023, 14:57

Hi again all,

I used the code above which worked perfectly for all rows with several values. However, rows with only one value got unintentionally deleted completley - causing a lot of missing values in the dataset. Is there any way around this?

Thanks in advance!
Viktor
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35754

06 Nov 2023, 15:15

I don't see why that would happen. I added an example observation with only one non-missing value and can't reproduce any problem.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte id str4(code1 code2 code3 code4 code5 code6)
1 "X001" ""     "X001" ""     "X003" ""    
2 ""     "X001" "X002" "X001" ""     ""    
3 "X004" ""     "X001" ""     ""     "X004"
4 ""     ""     ""     "T004" "X005" "T004"
5 "R001" "T004" "R001" "T004" "T004" "R002"
6 "Z007" "" "" "" "" "" ""
end

reshape long code, i(id) j(which)
drop if missing(code)
bysort id code (which) : keep if _n == 1 
bysort id (which) : replace which = _n 
reshape wide code, i(id) j(which)

list

     +----------------------------+
     | id   code1   code2   code3 |
     |----------------------------|
  1. |  1    X001    X003         |
  2. |  2    X001    X002         |
  3. |  3    X004    X001         |
  4. |  4    T004    X005         |
  5. |  5    R001    T004    R002 |
     |----------------------------|
  6. |  6    Z007                 |
     +----------------------------+

Comment

Viktor Johanson

Join Date: Nov 2023

Posts: 10
#5

07 Nov 2023, 10:01

Thank you very much for your quick reply Nick!

You were right, the code works. The reason for missing values were that the "code 1" in the dataset had a different formatting, thus not included in the analysis. After correcting this, it works!

Thanks again for your help!
Viktor
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#6

07 Nov 2023, 10:05

Thanks for confirmation.
Comment
Viktor Johanson

Join Date: Nov 2023

Posts: 10
#7

17 Dec 2023, 11:09

Hi again everyone,

With the data set mentioned above I want STATA to look through CODE1-CODE5 for each ID.
I have a code set of around 300 codes (i.e. X001-X200 + R001-R056 + R070 +R073 +R080-R120) that I want to include. If none of these codes are found in CODE1-CODE5, I want to drop that ID. Is there a smart way to do this?

Thanks in advance!

Best regards,
Viktor
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30146
#8

17 Dec 2023, 11:25

So the first step is to -reshape- your existing data set to long layout, just as was done in #2.

The next step depends on the kind of values these codes can take. In your example data, it appears that the codes are always a single upper case letter followed by a three digit number. If that is true of the full data set it looks like this:

Code:

reshape long code, i(id) j(which) drop if missing(code) by id (which), sort: replace which = _n by id (which), sort: egen to_keep = max( /// inrange(code, "X001", "X200") /// | inrange(code, "R001", "R056") /// | inlist(code, "R070", "R073") /// | inrange(code, "R080", "R120") /// ) keep if to_keep

This approach, however, will fail if the codes do not all conform to the one upper case letter plus three digits pattern. If, for example, a code could be X0011, it would be considered as in the range between X001 and X200, which probably is not what you want. If that kind of situation can arise, a different approach is needed. Post back for additional help in that case.
1 like
Comment

Announcement

Delete row duplicates and move code forward (delete missing values)

Comment

Comment

Comment

Comment

Comment

Comment

Comment