Multiple observations per id, select id with highest value in other identifier and select corresponding other variables to that identifier

Anouk Klootwijk

Join Date: Jun 2020

Posts: 26
#1

Multiple observations per id, select id with highest value in other identifier and select corresponding other variables to that identifier

14 Jun 2020, 14:37

Dear statalists,

I have a question with regard to organizing the data from long to wide and removing duplicate ids.

The data that I'm working with contains ids with multiple observations. The data is besides id, identified by date of delivery (delivery_date) as well, so some ids have multiple delivery dates.
I imported the dataset from excel, variable 1 and variable 2 were originally in excel stated in different rows.
Like this:
id/ delivery_date/ var1/var2 /obs
1/ 01-05-2018/ var 1/ 5
1 /01-05-2018 /var 2 /7
2 /02-01-2018/ var 1 /4
2 /02-01-2018/ var 2 /3
2 /02-12-2019/ var 1 /2
2/ 02-12-2019/ var 2/ 10
3/ 04-05-2019/ var 1/ 6
3/ 04-05-2019/ var 2 /8

I subtracted var 1 and var 2 by sorting, selecting and copying into 2 different columns in excel and than imported it in Stata. Now, every id with an observation in variable 1 has a missing value in variable 2 and the the other way round (the same id number in the next row has a missing value in variable 1 and an observation in variable 2).
I would like to organise the data as following: one id per row, that id is selected on the highest value of var 1 (in my example, id 2 would become var 1 = 4 and var 2 = 3).

I tried a lot of commands like collapse (but than I got averages of the variables...), replace wide,
sort id date_delivery var 1
quietly by id date_delivery var 1: gen dup = cond(_N==1,0,_n)... I couldn't find the solution yet...

Thanks a lot in advance!

Regards, Anouk

Last edited by Anouk Klootwijk; 14 Jun 2020, 14:44.
Tags: None
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#2

14 Jun 2020, 17:43

Welcome to Statalist.

I'm a little lost in the later part of your description. I would recommend reading through the FAQ about how to post an example of your data using dataex.

I apologize if I interpret your request wrong, but for what it's worth here's a toy example that might be of use. There's an id with multiple entries, and two variables: x and y. x contains missings, and the desired result is the largest value of x. To do so, you'll want to remove the missing(s) so the sorting is correct, and because you're sorting in ascending order, you'll keep the last observation per id. Hope this gets you on the right track.

Code:

clear set obs 10 set seed 1234 gen id = _n expand 10 gen x = cond(rnormal() > 0 ,rnormal(),.) gen y = runiform() drop if mi(x) bys id (x): keep if _n == _N
Comment
Anouk Klootwijk

Join Date: Jun 2020

Posts: 26
#3

15 Jun 2020, 00:36

Thanks a lot for your reply. Unfortunately my Stata version from university does not support dataex.
The problem in my dataset is that y also has missing values. Every missing value of y belongs to an observation of x.
For example:

id x y
1/ 3/ .
1/ . / 4
1/ 5 / .
1 / ./ 2

In this case I cannot drop all ids with missings in x, because the information on y would be lost. In this case id 1 with x 3 has y 4. And id 1 with x 5 has y 2.
So first I want it to look like this:

id/x/y
1/3/4
1/5/2

Than I can drop ids as you explained and keep the highest value of x.

Could you help me to transform the observations with missing observatioins in x and y to the combination without missings?

Thank you!

Last edited by Anouk Klootwijk; 15 Jun 2020, 00:43.
Comment
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#4

15 Jun 2020, 01:02

Maybe:

Code:

collapse (max) x y, by(id date)
Comment
Anouk Klootwijk

Join Date: Jun 2020

Posts: 26
#5

15 Jun 2020, 01:18

In my example:

id x y
1/ 3/ .
1/ . / 4
1/ 5 / .
1 / ./ 2

y is date. So when I use your command, like: collapse (max) x, by (id y) the missing values of date (y) and x are still there...

Do you know another command to merge observations of different variables into one 1 id? And selecting the observation of x with the highest value (as in my example id 1 would remain with x 5 and y 2).

Thanks!
Comment
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#6

15 Jun 2020, 01:24

This question really needs a proper data example to give code. In the absence of the possibility of using dataex, could you paste the output of

Code:

list in f/20

here within code tags?
Comment
Anouk Klootwijk

Join Date: Jun 2020

Posts: 26
#7

15 Jun 2020, 01:50

id date y z

1. 0003642 07-05-2019 . .
2. 0003642 . 23 7
3. 0002465 25-03-2018 . .
4. 0002465 . 28 2
5. 0013560 27-08-2019 . .

6. 0013560 . 30 1
7. 0014537 29-11-2018 . .
8. 0014537 . 28 3
9. 0016788 02-09-2019 . .
10. 0016788 . 26 6

11. 0039732 04-06-2019 . .
12. 0039732 . 29 4
13. 0021584 14-01-2018 . .
14. 0021584 . 27 4
15. 0073786 08-05-2018 . .

16. 0073786 . 28 1
17. 0048420 06-02-2018 . .
18. 0048420 . 28 7
19. 0044526 16-08-2018 . .
20. 0044526 . 26 1

In this part of my set, every id has one date value and one y value & one z value. But there are also ids with multiple dates that are accompanied by y & z values, than I would like to select the date with the highest y value.

Thanks in advance!
Comment

Wouter Wakker

Join Date: Nov 2018
Posts: 621

15 Jun 2020, 02:24

This should work.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id str10 date byte(y z)
 3642 "07-05-2019"  . .
 3642 "."          23 7
 2465 "25-03-2018"  . .
 2465 "."          28 2
13560 "27-08-2019"  . .
13560 "."          30 1
14537 "29-11-2018"  . .
14537 "."          28 3
16788 "02-09-2019"  . .
16788 "."          26 6
39732 "04-06-2019"  . .
39732 "."          29 4
21584 "14-01-2018"  . .
21584 "."          27 4
73786 "08-05-2018"  . .
73786 "."          28 1
48420 "06-02-2018"  . .
48420 "."          28 7
44526 "16-08-2018"  . .
44526 "."          26 1
end

foreach var of varlist y z {
    bysort id (`var'): replace `var' = `var'[1]
}

drop if date == "."

I don't know if your date variable is string or numeric. The example assumes string, otherwise just remove the quotes in the last line.

Edit: I didn't see this:

But there are also ids with multiple dates that are accompanied by y & z values, than I would like to select the date with the highest y value.

In your example all dates have missing values for x and y. Please clarify how dates are connected to y values because it is not clear from your data example.

Last edited by Wouter Wakker; 15 Jun 2020, 02:28.

Comment

Anouk Klootwijk

Join Date: Jun 2020

Posts: 26
#9

15 Jun 2020, 04:38

Thanks for your reply! This really helps me and I think I'm almost there!

I simplified the data in the previous example, there is an extra variable (date 2, this has the same value as date) that connects y values to dates for every id.

id date 2 y z date
|-------------------------------------------------------|
1. | . . . . |
2. | 0003642 07may2019 34 6 . |
3. | 0003642 07may2019 . . 07-05-2019 |
4. | 0002465 25mar2018 39 1 . |
5. | 0002465 25mar2018 . . 25-03-2018 |
|-------------------------------------------------------|
6. | 0013560 27aug2019 41 0 . |
7. | 0013560 27aug2019 . . 27-08-2019 |
8. | 0014537 29nov2018 39 2 . |
9. | 0014537 29nov2018 . . 29-11-2018 |
10. | 0016788 02sep2019 37 5 . |
|-------------------------------------------------------|
11. | 0016788 02sep2019 . . 02-09-2019 |
12. | 0039732 04jun2019 40 3 . |
13. | 0039732 04jun2019 40 3 . |
14. | 0039732 04jun2019 . . 04-06-2019 |
15. | 0039732 04jun2019 . . 04-06-2019 |
|-------------------------------------------------------|
16. | 0021584 14jan2018 38 3 . |
17. | 0021584 14jan2018 . . 14-01-2018 |
18. | 0048420 08may2018 39 0 . |
19. | 0048420 08may2018 . . 08-05-2018 |
20. | 0044420 06feb2018 39 6 . |
|-------------------------------------------------------|
21. | 0044420 06feb2018 39 6 . |
22. | 0044420 06feb2018 . . 06-02-2018 |
23. | 0044420 06feb2018 . . 06-02-2018 |
24. | 0043596 16aug2018 37 0 . |
25. | 0043596 16aug2018 . . 16-08-2018 |
|-------------------------------------------------------|
26. | 0152902 20oct2018 38 2 . |
27. | 0152902 20oct2018 . . 20-10-2018 |
28. | 0055474 12nov2018 38 3 . |
29. | 0055474 12nov2018 . . 12-11-2018 |
30. | 0062844 15apr2019 40 3 . |
|-------------------------------------------------------|
31. | 0062844 15apr2019 . . 15-04-2019 |
32. | 0064146 03may2018 36 3 . |
33. | 0064146 09sep2019 37 2 . |
34. | 0064146 03may2018 . . 03-05-2018 |
35. | 0064146 09sep2019 . . 09-09-2019 |

In this example id 0039732 (12-15) has 4 identical values for date2, so these are duplicates.
Id 0064146 (32-35) has different values for date2 (03may2018 and 09sept2019), in this case I would like to select the date2 (corresponding with date) with the highest value for y.

Currently I'm thinking that I could use date2 rather than date so I can use the previously suggested command:

drop if mi(y)
bys id (y): keep if _n == _N

Than I got rows with unique ids and the values that I need from Date2 and y and z. Date2 represents date, it is imported from excel with a numeric value. I transformed it to a string by this command:

tostring date2, gen date1

Date1 is red (string) when I browse the dataset, but when I tried to format date1 (=elapsed date) back to human readable forms by

format date2 %td

error 120 appeared: string %fmt required for string variables.

Could you help me with this last step?

Thanks in advance!
Comment
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#10

15 Jun 2020, 05:00

Please use code delimiters next time for Stata output and code, they make your posts much more readable (also explained in the FAQ).

%td is a date format for numeric variables containing the number of days since 1 Jan 1960. %td makes the variable more readable but doesn't change anything internally. In general, you want to have date variables as SIF (Stata internal form). Your date2 variable already looks like it is this type of variable so I don't know why you would want to make it a string variable again.

In any case, if you use tostring on a daily SIF variable you will just get the number of days since 1 Jan 1960, but as a string. This is not any more useful than the numeric variable.

I would just strick to your date2 variable as it is. Or please explain if you indeed have a good reason to convert it to string.

Finally, help datetime is the place to go if you have problems with datetime conversions.
Comment

Announcement