Copying data from observations

Andrea Parisi

Join Date: Jan 2018

Posts: 4
#1

Copying data from observations

07 Jan 2018, 17:49

Dear all,

I am having some issues with my dataset and I would be very grateful for your help. I have a large dataset of surveillance data concerning 1 infectious disease. There are many ID duplicates as several patients had several different samples tested. The samples come from different body sites such as Site 1 or Site 2 and they are recorded in the variable SAMPLE. Unfortunately, there are many missing observations in this variable as the SAMPLE was recorded only for the first different observation of each ID as depicted below.

Observation |ID | SAMPLE |
1 1234 Site 1
2 1234 .
3 1234 .
4 1234 Site 2
5 1234 Site 3
6 5678 Site 2
7 5678 .
8 5678 .

What I would like to do is to copy the observations to SAMPLE based on the same ID. For example, Observation 1 of the variable SAMPLE copy to observations 2 and 3 of the same variable, observation 6 copy to observation 7 and 8, etc. As there are about 15.000 missing values, I cannot do it one by one. Is there any command which would allow me to copy these based on the condition of having the same ID?

My second question relates to the possibility of merging several different observations again based on ID. Let's say I want to merge observations with same ID in the variable SAMPLE so that I get this:

Observation |ID | SAMPLE |
1 1234 Site 1, Site 2, Site 3
2 5678 Site 2

Is there any way I can do such a merge?

Thanks in advance for your help.

Andrea
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

07 Jan 2018, 18:00

Welcome to Statalist, Andrea.

To answer your first question, something like the following should do what you need:

Code:

by ID (Observation), sort: replace SAMPLE = SAMPLE[_n-1] if missing(SAMPLE)

On your second question, once you got your data like that, it would be very difficult to use it for any analytical purpose in Stata. I strongly recommend against that approach.
Comment
Andrea Parisi

Join Date: Jan 2018

Posts: 4
#3

11 Jan 2018, 01:08

Dear William,

thanks a lot, it gives me exactly what I want.

One more thing. Let's say the command looks as follows:

bysort ID (diagnosis_date): replace SAMPLE = SAMPLE[_n-1] if missing(SAMPLE)

As you can see, I am using diagnosis_date as the lead variable. What I would like to do now is to add another condition: SAMPLE observation of the person with the same ID is copied only if the diagnosis_date of the copied observation is no more than 3 days apart from the observation with missing value.

I thought of something like the example below but apparently there is somewhere a mistake.

bysort ID (diagnosis_date): replace SAMPLE = SAMPLE[_n-1] if (missing(SAMPLE)) & (diagnosis_date – diagnosis_date[_n-1] < 3)

Thanks again,

Andrea
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35726

11 Jan 2018, 03:08

To William's suggestion I add the reference of an FAQ on this matter. https://www.stata.com/support/faqs/d...issing-values/

On the new twist: A data example would help. Here is one. I give two solutions. The first requires rangestat (SSC) by Robert Picard and friends.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(ID Date SAMPLE)
1 1  42
1 2   .
1 3   .
1 4   .
1 5   .
1 6   .
1 7   .
2 1 666
2 2   .
2 3   .
2 4   .
2 5   .
2 6   .
2 7   .
end

* solution 1
rangestat (lastnm) previous=SAMPLE, inter(Date -3 0) by(ID)
gen SAMPLE2 = cond(missing(SAMPLE), previous, SAMPLE)
l, sepby(ID)

+-----------------------------------------+
| ID Date SAMPLE previous SAMPLE2 |
|-----------------------------------------|
1. | 1 1 42 42 42 |
2. | 1 2 . 42 42 |
3. | 1 3 . 42 42 |
4. | 1 4 . 42 42 |
5. | 1 5 . . . |
6. | 1 6 . . . |
7. | 1 7 . . . |
|-----------------------------------------|
8. | 2 1 666 666 666 |
9. | 2 2 . 666 666 |
10. | 2 3 . 666 666 |
11. | 2 4 . 666 666 |
12. | 2 5 . . . |
13. | 2 6 . . . |
14. | 2 7 . . . |
+-----------------------------------------+



* solution 2
gen LastDate = Date if SAMPLE < .
bysort ID (Date) : replace LastDate = LastDate[_n-1] if missing(LastDate)
gen SAMPLE3 = SAMPLE
bysort ID (Date) : replace SAMPLE3 = cond(missing(SAMPLE3) & (Date - LastDate) <= 3, SAMPLE3[_n-1], SAMPLE3)
l, sepby(ID)

+--------------------------------------------------------------+
| ID Date SAMPLE previous SAMPLE2 LastDate SAMPLE3 |
|--------------------------------------------------------------|
1. | 1 1 42 42 42 1 42 |
2. | 1 2 . 42 42 1 42 |
3. | 1 3 . 42 42 1 42 |
4. | 1 4 . 42 42 1 42 |
5. | 1 5 . . . 1 . |
6. | 1 6 . . . 1 . |
7. | 1 7 . . . 1 . |
|--------------------------------------------------------------|
8. | 2 1 666 666 666 1 666 |
9. | 2 2 . 666 666 1 666 |
10. | 2 3 . 666 666 1 666 |
11. | 2 4 . 666 666 1 666 |
12. | 2 5 . . . 1 . |
13. | 2 6 . . . 1 . |
14. | 2 7 . . . 1 . |
+--------------------------------------------------------------+

There remains the question of why your code didn't work. As mentioned, you did not give a data example so that we could see your data, But presumably the reason is that

Code:

 
(diagnosis_date – diagnosis_date[_n-1] < 3)

will only catch the first date after diagnosis, not any date afterwards. You need a device like the second solution above to copy the last diagnosis date within the dataset.

Last edited by Nick Cox; 11 Jan 2018, 03:14.

Comment

Andrea Parisi

Join Date: Jan 2018

Posts: 4
#5

25 Jan 2018, 03:36

Dear Nick,

thank you for your suggestions. I went with the second option and it worked great. Many thanks again.
Comment

Announcement